Regularization: Cost Function


Suppose there is an issue of overfitting due to the presence of a large number of features. Consider the hypothesis function hθ(x)=θ0+θ1x+θ2x2+θ3x3+θ4x4h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 and we want to remove the influence of x3x^3 and x4x^4 terms which might be causing overfitting. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function.

In order for x3x^3 and x4x^4 terms of the cost function to get close to zero, we will have to reduce the values of θ3\theta_3 and θ4\theta_4 to near zero. We can rewrite the cost function as:

J(θ)=12m[i=1m(hθ(x(i))y(i))2+(1000θ32+1000θ42)]J(\theta) = \dfrac{1}{2m} \begin{bmatrix} \displaystyle \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \Big( 1000\cdot\theta_3^2 + 1000 \cdot\theta_4^2 \Big) \end{bmatrix}


We could also regularize all of our theta parameters in a single summation as:

J(θ)=12m [i=1m(hθ(x(i))y(i))2+λj=1nθj2]J(\theta) = \dfrac{1}{2m}\ \begin{bmatrix} \displaystyle \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2 \end{bmatrix}

The λ\lambda, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated/deflated. The extra summation term added on the right is called the regularization term. Note that the regularization term does not include θ0\theta_0 since we do not want to penalize θ0\theta_0.

If the regularization parameter (λ)(\lambda) is chosen to be too large, it will deflate all the theta values (except θ0\theta_0) to nearly zero and thus smooth out the hypothesis function too much. In this case, the hypothesis function essentially becomes hθ(x)θ0h_\theta(x) \approx \theta_0 causing underfitting (high bias).