## Concise.org

Recall that the general form of gradient descent is:

\begin{aligned}& Repeat \; \lbrace \\ & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \\ & \rbrace\end{aligned}

Computing $\dfrac{\partial}{\partial \theta_j}J(\theta)$ for our cost function for Logistic regression, we get:

$\text{Repeat until convergence: } \{$

$\,\,\,\,\,\, \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) \,\,\, \\$ $\quad \quad \quad \quad \quad$ (simultaneously update $\theta_j$ for every $j \in 0,...,n$)
$\}$

Note that the gradient descent algorithm for logistic regression in terms of hypothesis function $h_\theta(x)$ looks exactly same as that for the linear regression. Note, however, that the actual hypothesis function $h_\theta(x)$ used in linear regression and logistic regression are different.

The idea of using feature scaling for making gradient descent work faster also applies for Logistic regression.

Vectorized implementation:

$\theta := \theta - \alpha \space \delta$

where $\theta,\delta \in \mathbb{R}^{n+1}$ are vectors as shown below

\begin{aligned} \delta = \frac{1}{m} \sum\limits_{i=1}^{m} \left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) = \frac{1}{m}X^T(g(X\theta) - y) \end{aligned}

$\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}$