Gradient Descent: Learning Rate

 

Testing convergence of gradient descent

Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ)J(θ) over the number of iterations of gradient descent.

Ideally, J(θ)J(\theta) should decrease after every iteration.

11-gradient-descent-convergence

Number of iterations taken by gradient descent to converge can vary a lot depending for different cases.

Example automatic convergence test:
Declare convergence if J(θ)J(\theta) decereases by less than 103(ϵ)10^{-3} (\epsilon) in one iteration.

However, in practice choosing a particular threshold (ϵ)(\epsilon) value can be pretty difficult. So, in order to test convergence of gradient descent, graphs like above can be more effective.

Debugging: How to make sure gradient descent is working correctly

12-debugging-gradient-descent

If J(θ) ever increases, then you probably need to decrease α\alpha.

NOTE:
It has been proven that if learning rate α\alpha is sufficiently small, J(θ)J(\theta) should decrease on every iteration.

How to choose learning rate α\alpha

Recall that:

  • If α\alpha is too small:
    • slow convergence
  • If α\alpha is too large:
    • J(θ)J(\theta) may not decrease on every iteration, and thus may not converge
    • may even diverge
    • sometimes, slow convergence also possible

In order to choose suitable α\alpha, try a range of values and corresponding to each value of α\alpha, plot the values of J(θ)J(\theta) as a function of number of iterations. Then, choose an α\alpha which causes J(θ)J(\theta) to decrease rapidly.

For trying a range of values of α\alpha, you can choose factors of 10 or 3, for example.
E.g., ..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...