Parameter Learning: Gradient Descent


So we have our hypothesis function and we have a way of measuring (the cost function) how well it fits into the data. Now we need to estimate the parameters (θ0\theta_0 and θ1\theta_1) in hypothesis function. That's where gradient descent comes in.

We put θ0\theta_0 on the x-axis and θ1\theta_1 on the y-axis, with the cost function on the vertical z-axis. We need to find the specific theta parameters for which our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.

Gradient descent is a more general algorithm that can be used to minimize an arbitrary function JJ.

We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α\alpha, which is called the learning rate.

On a side note, we should adjust our parameter α\alpha to ensure that the gradient descent algorithm converges in a reasonable time. If α\alpha is too small, gradient decent can be slow to converge. On the other hand, if α\alpha is too large value, gradient descent can overshoot the minimum and it may fail to converge, or even diverge. Sometimes, if α\alpha is too large, slow convergence is also possible.

Even with a fixed learning rate α\alpha, as we approach a local minimum, gradient descent will automatically take smaller steps. At the minima, the derivative will always be 0 and thus the θ\theta values will no longer change.

The gradient descent algorithm is:

repeat until convergence: {

θj:=θjαθjJ(θ0,θ1)\,\,\,\,\,\, \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) \,\,\,\,\,\, (simultaneously update for j=0j=0 and j=1j=1)


Note that θ0\theta_0 and θ1\theta_1 should be updated simultaneously.


θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) is the slope of tangent also known as derivative in j dimension.

In higher dimensions, the analog of derivative is gradient. The gradient measures the vector direction of steepest descent of the function.

Depending on where one starts on the graph, one could end up at different points.