## Regularized Linear Regression

#### Cost function

The cost function for regularized linear regression is:

$\displaystyle J(\theta) = \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \ \sum_{j=1}^n \theta_j^2$

The gradient descent algorithm for regularized linear regression is:

\begin{aligned} & \text{Repeat}\ \lbrace \\ & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{0}\right) \\ & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\\ & \rbrace \end{aligned}

After some manipulation, this can be written as:

\begin{aligned} & \text{Repeat}\ \lbrace \\ & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{0}\right) \\ & \ \ \ \ \theta_j := \theta_j \ \Big( 1 - \alpha \ \frac{\lambda}{m} \Big) - \alpha\ \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_{j}\right) &\ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\\ & \rbrace \end{aligned}

In the algorithm step, for the first term $\Big( 1 - \alpha \ \frac{\lambda}{m} \Big)$ will always be less than 1 since $\alpha,\lambda,m>0$. Intuitively we can see it as reducing the value of $\theta_j$ by some amount on every update. Notice that the second term is now exactly the same as it was before.

#### Normal Equation

Now let's use regularization for the non-iterative normal equation method.

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

\begin{aligned} & \theta = \left( X^TX + \lambda \ L \right)^{-1} X^Ty \\ & \text{where}\ \ L = \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots & \\ & & & & 1 \end{bmatrix}_{(n+1) \ \times \ (n+1)} \end{aligned}

Intuitively, $L$ this is the identity matrix (though we are not including $x_0$), multiplied with a single real number $\lambda$.

Recall that in some cases (e.g., $m < n$), $X^TX$ could be non-invertible. However, it can be shown that when we add the term $\lambda \cdot L$, then $(X^TX + \lambda \cdot L)$ is always invertible.