Gradient Descent: Learning Rate


Testing convergence of gradient descent

Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ)J(θ) over the number of iterations of gradient descent.

Ideally, J(θ)J(\theta) should decrease after every iteration.


Number of iterations taken by gradient descent to converge can vary a lot depending for different cases.

Example automatic convergence test:
Declare convergence if J(θ)J(\theta) decereases by less than 103(ϵ)10^{-3} (\epsilon) in one iteration.

However, in practice choosing a particular threshold (ϵ)(\epsilon) value can be pretty difficult. So, in order to test convergence of gradient descent, graphs like above can be more effective.

Debugging: How to make sure gradient descent is working correctly


If J(θ) ever increases, then you probably need to decrease α\alpha.

It has been proven that if learning rate α\alpha is sufficiently small, J(θ)J(\theta) should decrease on every iteration.

How to choose learning rate α\alpha

Recall that:

  • If α\alpha is too small:
    • slow convergence
  • If α\alpha is too large:
    • J(θ)J(\theta) may not decrease on every iteration, and thus may not converge
    • may even diverge
    • sometimes, slow convergence also possible

In order to choose suitable α\alpha, try a range of values and corresponding to each value of α\alpha, plot the values of J(θ)J(\theta) as a function of number of iterations. Then, choose an α\alpha which causes J(θ)J(\theta) to decrease rapidly.

For trying a range of values of α\alpha, you can choose factors of 10 or 3, for example.
E.g., ..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...