The Problem of Overfitting


While coming up with a hypothesis function, it might seem that the more features we add, the better. However, there is also a danger in adding too many features.

This terminology is applied to both linear and logistic regression.


Underfitting or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.

Overfitting or high variance, at the other extreme, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

Ways to address issue of overfitting:

  • Reduce the number of features
    • Manually select which features to keep
    • Use a model selection algorithm (later in the course)
  • Regularization
    • Keep all the features, but reduce the magnitude/values of parameters θj\theta_j
    • Works well when we have a lot of slightly useful features