## Normal Equation for optimization

Like gradient descent, "Normal Equation" method is another way of minimizing $J$. In the "Normal Equation" method, we will minimize $J$ by explicitly taking its derivatives with respect to the $θ_j$’s, and setting them to zero. This allows us to find the optimum theta without iteration.

The cost function is:

$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$

Find $\theta \in \mathbb{R}^{n+1}$, which minimizes cost function $J$: $\frac{\partial}{\partial \theta_j} J(\theta) = 0$ (for every $j \in 0,1,...,n$)

Solve for $\theta_0, \theta_1, ..., \theta_n$

$n$ features;

$m$ training examples: $(x^{(1)},y^{(1)}), ..., (x^{(m)},y^{(m)})$

$\space x^{(i)} = \begin{bmatrix} x^{(i)}_0 \\ x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \in \mathbb{R}^{n+1} \space; \space X = \begin{bmatrix} ... \space (x^{(1)})^T \space ... \\ ... \space (x^{(2)})^T \space ... \\ \vdots \\ ... \space (x^{(m)})^T \space ... \end{bmatrix}$ $($m x (n+1) matrix$)$

$y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m}$

The normal equation formula is given below:

$\theta = (X^T X)^{-1}X^T y \space \space \in \mathbb{R}^{n+1}$

*Octave/MATLAB Implementation:*

`pinv(X' * X) * X' * y`

*NOTE:*

There is *no need* to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Gradient Descent | Normal Equation |
---|---|

Need to choose learning rate $\alpha$ | No need to choose $\alpha$ |

Needs many iterations | No need to iterate |

$O(kn^2)$ | $O(n^3)$, need to calculate inverse of $X^TX$ |

Works well even if $n$ is large | Slow if $n$ is very large |

In practice, when $n$ exceeds 10,000 it might be a good time to go from a normal equation solution to an iterative process.

When implementing the normal equation in Octave/MATLAB, we want to use the `pinv`

function rather than `inv`

. The `pinv`

function will give you a value of $\theta$ even if $X^TX$ is not invertible.

If $X^TX$ is non-invertible, the common causes might be having:

*Redundant*features, where two features are very closely related (i.e. they are linearly dependent). In this case, delete a feature that is linearly dependent with another.*Too many*features (e.g. m ≤ n). In this case, delete one or more features or use "regularization" (to be explained in a later lesson).