Gradient Descent: Feature Scaling


Feature Scaling techniques

Idea: Make sure features are on a similar scale

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ\theta will descend quickly on small ranges and slowly on large ranges.

When input values vary in a large range, the resulting contours can be highly skewed resulting into inefficient oscillation down to the optimum and thus spanning a longer trajectory.

So, we want to get every feature into approximately similar range.
Ideally, (1xi1)(-1 \leq x_i \leq 1) or (0.5xi0.5)(-0.5 \leq x_i \leq 0.5)

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.
So, (3xi3)(-3 \leq x_i \leq 3) or (13xi13)(-\frac{1}{3} \leq x_i \leq \frac{1}{3}) is also fine.


  • If 0x130 \leq x_1 \leq 3, then leave x1x_1 unchanged
  • If 2x20.5-2 \leq x_2 \leq 0.5, then leave x2x_2 unchanged

Feature Scaling

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

xi:=xisix_i := \dfrac{x_i}{s_i}

Where sis_i is the range of values (max - min)


  • If 100x12000100 \leq x_1 \leq 2000, then x1:=x11900x_1 := \dfrac{x_1}{1900}.
  • If 0.001x20.0030.001 \leq x_2 \leq 0.003, then x2:=x20.002x_2 := \dfrac{x_2}{0.002}.

Mean Normalization

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

xi:=xiμisix_i := \dfrac{x_i - \mu_i}{s_i}

Where μi\mu_i is the average of all the values for feature (i)(i) and sis_i is the range of values (max - min), or sis_i is the standard deviation.

In this case, xix_i turns out to be approximately in the range (0.5xi0.5)(-0.5 \leq x_i \leq 0.5)

Example: If 100x12000100 \leq x_1 \leq 2000 and μ1=1000\mu_1 = 1000, then x1:=x110001900x_1 := \dfrac{x_1 - 1000}{1900}. (s1s_1 is used as range here)

NOTE: The idea is that the normalized version of feature matrix X should be such that the mean value of each feature is 0 and the standard deviation is 1.