## Gradient Descent for Multiple Features

The gradient descent equation for $n$ features (variables) can be written as:

repeat until convergence: {

$\,\,\,\,\,\, \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \,\,\,\,\,\,$ (simultaneously update $\theta_j$ for every $j \in 0,...,n$)

}

On calculating the value of $\frac{\partial}{\partial \theta_j} J(\theta)$, the above algorithm converts to:

repeat until convergence: {

$\,\,\,\,\,\, \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) \,\,\, \\$ $\quad \quad \quad \quad \quad$ (simultaneously update $\theta_j$ for every $j \in 0,...,n$)
}

Vectorized implementation:

$\theta := \theta - \alpha \space \delta$

where $\theta,\delta \in \mathbb{R}^{n+1}$ are vectors as shown below

\begin{aligned} \delta = \frac{1}{m} \sum\limits_{i=1}^{m} \left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) = \frac{1}{m}X^T(X\theta - y) \end{aligned}

$\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}$

$x^{(i)} = \begin{bmatrix} x^{(i)}_0 \\ x^{(i)}_1 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \in \space \mathbb{R}^{n+1}$

$X = \begin{bmatrix} ... \space (x^{(1)})^T \space ... \\ ... \space (x^{(2)})^T \space ... \\ \vdots \\ ... \space (x^{(m)})^T \space ... \end{bmatrix}$ $($m x (n+1) matrix$)$

$y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m}$

Derivation: Vectorized implementation

Let $\space c^{(i)} = (h_\theta(x^{(i)}) - y^{(i)})$

$m \space \delta = m \begin{bmatrix} \delta_0 \\ \delta_1 \\ \vdots \\ \delta_n \end{bmatrix} = \begin{bmatrix} \sum\limits_{i=1}^m c^{(i)}x^{(i)}_0 \\ \sum\limits_{i=1}^m c^{(i)}x^{(i)}_1 \\ \vdots \\ \sum\limits_{i=1}^m c^{(i)}x^{(i)}_n \end{bmatrix} = c^{(1)}x^{(1)} + c^{(2)}x^{(2)} + ... + c^{(m)}x^{(m)} = \sum\limits_{i=1}^m c^{(i)}x^{(i)}$