The Problem of Overfitting
While coming up with a hypothesis function, it might seem that the more features we add, the better. However, there is also a danger in adding too many features.
This terminology is applied to both linear and logistic regression.
Underfitting or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.
Overfitting or high variance, at the other extreme, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
Ways to address issue of overfitting:
- Reduce the number of features
- Manually select which features to keep
- Use a model selection algorithm (later in the course)
- Keep all the features, but reduce the magnitude/values of parameters
- Works well when we have a lot of slightly useful features