Regularized Linear Regression

 

Cost function

The cost function for regularized linear regression is:

J(θ)=12m i=1m(hθ(x(i))y(i))2+λ2m j=1nθj2\displaystyle J(\theta) = \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \ \sum_{j=1}^n \theta_j^2

Gradient descent

The gradient descent algorithm for regularized linear regression is:

Repeat {    θ0:=θ0α 1mi=1m((hθ(x(i))y(i))x0(i))    θj:=θjα [1mi=1m((hθ(x(i))y(i))xj(i))+λmθj]       j{1,2...n}}\begin{aligned} & \text{Repeat}\ \lbrace \\ & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{0}\right) \\ & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\\ & \rbrace \end{aligned}

After some manipulation, this can be written as:

Repeat {    θ0:=θ0α 1mi=1m((hθ(x(i))y(i))x0(i))    θj:=θj (1α λm)α 1mi=1m((hθ(x(i))y(i))xj(i))      j{1,2...n}}\begin{aligned} & \text{Repeat}\ \lbrace \\ & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{0}\right) \\ & \ \ \ \ \theta_j := \theta_j \ \Big( 1 - \alpha \ \frac{\lambda}{m} \Big) - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) &\ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\\ & \rbrace \end{aligned}

In the algorithm step, for the first term (1α λm)\Big( 1 - \alpha \ \frac{\lambda}{m} \Big) will always be less than 1 since α,λ,m>0\alpha,\lambda,m>0. Intuitively we can see it as reducing the value of θj\theta_j by some amount on every update. Notice that the second term is now exactly the same as it was before.

Normal Equation

Now let's use regularization for the non-iterative normal equation method.

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

θ=(XTX+λ L)1XTywhere  L=[0111](n+1) × (n+1)\begin{aligned} & \theta = \left( X^TX + \lambda \ L \right)^{-1} X^Ty \\ & \text{where}\ \ L = \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots & \\ & & & & 1 \end{bmatrix}_{(n+1) \ \times \ (n+1)} \end{aligned}

Intuitively, LL this is the identity matrix (though we are not including x0x_0), multiplied with a single real number λ\lambda.

Recall that in some cases (e.g., m<nm < n), XTXX^TX could be non-invertible. However, it can be shown that when we add the term λL\lambda \cdot L, then (XTX+λL)(X^TX + \lambda \cdot L) is always invertible.