The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.
Binary classification problem
The value we want to predict can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email, then may be some features of a piece of email, and may be 1 if it is a piece of spam mail, and 0 otherwise.
Hence, . 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+”.
Given , the corresponding is also called the label for the training example.
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.
If linear regression doesn't work on a classification task as in the above example, applying feature scaling will not make it work.