Normal Equation for optimization


Like gradient descent, "Normal Equation" method is another way of minimizing JJ. In the "Normal Equation" method, we will minimize JJ by explicitly taking its derivatives with respect to the θjθ_j’s, and setting them to zero. This allows us to find the optimum theta without iteration.

The cost function is:

J(θ)=12mi=1m(hθ(x(i))y(i))2J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

Find θRn+1\theta \in \mathbb{R}^{n+1}, which minimizes cost function JJ: θjJ(θ)=0\frac{\partial}{\partial \theta_j} J(\theta) = 0 (for every j0,1,...,nj \in 0,1,...,n)

Solve for θ0,θ1,...,θn\theta_0, \theta_1, ..., \theta_n


nn features;
mm training examples: (x(1),y(1)),...,(x(m),y(m))(x^{(1)},y^{(1)}), ..., (x^{(m)},y^{(m)})

 x(i)=[x0(i)x1(i)x2(i)xn(i)]Rn+1 ; X=[... (x(1))T ...... (x(2))T ...... (x(m))T ...]\space x^{(i)} = \begin{bmatrix} x^{(i)}_0 \\ x^{(i)}_1 \\ x^{(i)}_2 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \in \mathbb{R}^{n+1} \space; \space X = \begin{bmatrix} ... \space (x^{(1)})^T \space ... \\ ... \space (x^{(2)})^T \space ... \\ \vdots \\ ... \space (x^{(m)})^T \space ... \end{bmatrix} ((m x (n+1) matrix))

y=[y(1)y(2)y(m)]Rmy = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m}

The normal equation formula is given below:

θ=(XTX)1XTy  Rn+1\theta = (X^T X)^{-1}X^T y \space \space \in \mathbb{R}^{n+1}

Octave/MATLAB Implementation:

pinv(X' * X) * X' * y

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Gradient DescentNormal Equation
Need to choose learning rate α\alphaNo need to choose α\alpha
Needs many iterationsNo need to iterate
O(kn2)O(kn^2)O(n3)O(n^3), need to calculate inverse of XTXX^TX
Works well even if nn is largeSlow if nn is very large

In practice, when nn exceeds 10,000 it might be a good time to go from a normal equation solution to an iterative process.

When implementing the normal equation in Octave/MATLAB, we want to use the pinv function rather than inv. The pinv function will give you a value of θ\theta even if XTXX^TX is not invertible.

If XTXX^TX is non-invertible, the common causes might be having:

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent). In this case, delete a feature that is linearly dependent with another.
  • Too many features (e.g. m ≤ n). In this case, delete one or more features or use "regularization" (to be explained in a later lesson).