Conceptual similarity between the normal equation in linear regression to find best parameters and an orthagonal projection in linear algebra?

Solution 1:

$$ y_i = a + bx_{1,i} + cx_{2,i} + \text{“error”}_i $$

In ordinary least squares one seeks the values of $\widehat a,\widehat{b\,}, \widehat{c\,}$ that, when put in the roles of $a,b,c$ minimize the sum of squares of the residuals $\widehat{y\,}_i - y_i$ where the fitted values $\widehat{y\,}_i$ are given by $$\widehat{y\,}_i = \widehat a + \widehat{b\,} x_{1,i} + \widehat{c\,}x_{2,i}. \tag 1$$

Since the vector $\widehat{\mathbf y} = \big( \widehat{y\,}_1, \ldots, \widehat{y\,}_n \big)^\top$ is thus closer to $\mathbf y = \big(y_1,\ldots,y_n\big)^\top$ than is any other vector whose components can be expresses by $(1),$ $\widehat{\mathbf y}$ is therefore the orthogonal projection of $\widehat{y\,}$ onto the space spanned by $\big(1,\ldots,1\big)^\top$ (which will be multiplied by $\widehat a\,$), $\big(x_{1,1},\ldots,x_{1,n}\big)^\top,$ and $\big(x_{2,1},\ldots,x_{2,n}\big)^\top.$

The "design matrix" (something of a misnomer) $X$ is the matrix whose columns are those vectors that span the space onto which $\mathbf y$ gets projected. Thus we have $$ X\mathbf{\widehat a} = \widehat{y\,} = X(X^\top X)^{-1}X^\top \mathbf y. $$ One may be tempted to multiply both sides of $X\mathbf{\widehat a} = X(X^\top X)^{-1}X^\top \mathbf y$ on the left by $X^{-1},$ but since $X$ is a tall skinny matrix (i.e. has many more rows that columns) it doesn't have an inverse of the kind first considered in linear algebra courses. If the columns of $X$ are linearly independent (as is typical in these problems) $X$ does, however, have a left inverse, which is $(X^\top X)^{-1}X^\top.$ Multiply both sides of that equality on the left by that matrix and you get $$ \widehat{\mathbf a} = (X^\top X)^{-1}X^\top\mathbf y. $$