Linear Regression: Multiple Features


Linear regression with multiple variables (features) is also known as "multivariate linear regression".

Hypothesis function for multiple features

The multivariable form of the hypothesis function:

y^=hθ(x)=θ0+θ1x1+θ2x2+...+θnxn\hat{y} = h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n

Denoting x0(i)=1x^{(i)}_0 = 1 for (i1,2,...,m)(i \in 1,2,...,m), we can rewrite above as:

hθ(x)=i=0nθixi=[θ0 θ1 ... θn][x0x1xn]=θTxh_\theta(x) = \sum_{i=0}^n \theta_i x_i = \begin{bmatrix} \theta_0 \space \theta_1 \space ... \space \theta_n \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} = \theta^T x

where θ,xRn+1\theta, x \in \mathbb{R}^{n+1}

The above transformation of hθ(x)h_\theta(x) from i=0nθixi\sum_{i=0}^n \theta_i x_i to θTx\theta^T x is an example of 'vectorization' technique which is used to speed-up computations using available optimized numerical linear algebra libraries.


  • mm : Number of training examples
  • nn : Number of features
  • xj(i)x^{(i)}_j: Value of feature jj in the iith training example
  • x(i)x^{(i)}: Input (features) of the iith training example; this is a vector


x(i)=[x0(i)x1(i)xn(i)]Rn+1x^{(i)} = \begin{bmatrix} x^{(i)}_0 \\ x^{(i)}_1 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \in \mathbb{R}^{n+1}

Vectorized Implementation: hθ(X)=Xθh_\theta(X) = X\theta


X=[... (x(1))T ...... (x(2))T ...... (x(m))T ...]X = \begin{bmatrix} ... \space (x^{(1)})^T \space ... \\ ... \space (x^{(2)})^T \space ... \\ \vdots \\ ... \space (x^{(m)})^T \space ... \end{bmatrix} ((m x (n+1) matrix))

θ=[θ0θ1θn]Rn+1\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}

Note that in this vectorized implementation, we calculate hypotheses of all mm training examples at once.

Cost function for multiple features

Recall that the cost function J(θ)J(\theta) is defined as:

J(θ)=12mi=1m(hθ(x(i))y(i))2J(\theta) = \dfrac {1}{2m} \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

Vectorized implementation:

J(θ)=12m(Xθy)T(Xθy)J(\theta) = \dfrac {1}{2m} (X\theta - y)^T(X\theta - y)