Logistic Regression: Gradient Descent


Recall that the general form of gradient descent is:

Repeat  {  θj:=θjαθjJ(θ)}\begin{aligned}& Repeat \; \lbrace \\ & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \\ & \rbrace\end{aligned}

Computing θjJ(θ)\dfrac{\partial}{\partial \theta_j}J(\theta) for our cost function for Logistic regression, we get:

Repeat until convergence: {\text{Repeat until convergence: } \{

θj:=θjα1mi=1m((hθ(x(i))y(i))xj(i))\,\,\,\,\,\, \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) \,\,\, \\ \quad \quad \quad \quad \quad (simultaneously update θj\theta_j for every j0,...,nj \in 0,...,n)

Note that the gradient descent algorithm for logistic regression in terms of hypothesis function hθ(x)h_\theta(x) looks exactly same as that for the linear regression. Note, however, that the actual hypothesis function hθ(x)h_\theta(x) used in linear regression and logistic regression are different.

The idea of using feature scaling for making gradient descent work faster also applies for Logistic regression.

Vectorized implementation:

θ:=θα δ\theta := \theta - \alpha \space \delta

where θ,δRn+1\theta,\delta \in \mathbb{R}^{n+1} are vectors as shown below

δ=1mi=1m((hθ(x(i))y(i))x(i))=1mXT(g(Xθ)y)\begin{aligned} \delta = \frac{1}{m} \sum\limits_{i=1}^{m} \left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) = \frac{1}{m}X^T(g(X\theta) - y) \end{aligned}

θ=[θ0θ1θn]Rn+1\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}