## Gradient Descent: Feature Scaling

#### Feature Scaling techniques

* Idea:* Make sure features are on a similar scale

We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges.

When input values vary in a large range, the resulting contours can be highly skewed resulting into inefficient oscillation down to the optimum and thus spanning a longer trajectory.

So, we want to get every feature into approximately similar range.

Ideally, $(-1 \leq x_i \leq 1)$ or $(-0.5 \leq x_i \leq 0.5)$

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

So, $(-3 \leq x_i \leq 3)$ or $(-\frac{1}{3} \leq x_i \leq \frac{1}{3})$ is also fine.

*Example:*

- If $0 \leq x_1 \leq 3$, then leave $x_1$ unchanged
- If $-2 \leq x_2 \leq 0.5$, then leave $x_2$ unchanged

#### Feature Scaling

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

$x_i := \dfrac{x_i}{s_i}$

Where $s_i$ is the range of values (max - min)

*Example:*

- If $100 \leq x_1 \leq 2000$, then $x_1 := \dfrac{x_1}{1900}$.
- If $0.001 \leq x_2 \leq 0.003$, then $x_2 := \dfrac{x_2}{0.002}$.

#### Mean Normalization

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

$x_i := \dfrac{x_i - \mu_i}{s_i}$

Where $\mu_i$ is the *average* of all the values for feature $(i)$ and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

In this case, $x_i$ turns out to be approximately in the range $(-0.5 \leq x_i \leq 0.5)$

*Example:*
If $100 \leq x_1 \leq 2000$ and $\mu_1 = 1000$, then $x_1 := \dfrac{x_1 - 1000}{1900}$. ($s_1$ is used as range here)

*NOTE:*
The idea is that the normalized version of feature matrix X should be such that the mean value of each feature is 0 and the standard deviation is 1.