Derivation of the formula for Ordinary Least Squares Linear Regression

Solution 1:

I once wrote it up in detail for my blog, but I never published it because I thought people would not find it interesting. But since you ask, here it is. There are a lot of formulas, but there is nothing difficult in it.


Scientific calculators all have a "linear regression" feature, where you can put in a bunch of data and the calculator will tell you the parameters of the straight line that forms the best fit to the data. For example, suppose you have a bunch of data that looks like this:

linear regression data

Linear regression will calculate that the data are approximated by the line $3.06148942993613\cdot x + 6.56481566146906$ better than by any other line.

the same linear regression data with plotted linear approximation

When the calculator does it, you just put in the data values and out pop the parameters of the line, its slope and its $y$-intercept. This can seem rather mysterious. When you see the formulas for the parameters of the line, they look even more mysterious. But it turns out that it is rather easy to figure out.

We want to find the line $y = mx + b$ that fits the data as well as possible. The first thing we need to decide is how to measure the goodness of the fit. If a data point is $(x, y)$ and the line actually goes through $(x, y')$ instead, how do we decide the extent to which that is a problem?

We need to pick some error-measuring function $\varepsilon(y, y')$ that tells us the error we attribute to the line going through $y'$ instead of $y$. Then we can do the calculation to figure out the line that has the minimum total error.

The $\varepsilon$ function we choose must satisfy some criteria. For example, $\varepsilon(y, y)$ should be 0 for all $y$: if the line actually goes exactly through the data point, we should attribute a 0 error to that. And $\varepsilon(y, y')$ should never be negative: a error of 0 is perfect, and you shouldn't be able to do better than perfect.

The $\varepsilon$ function that the pocket calculators all use is $$\varepsilon(y, y') = (y-y')^{2}.$$ This is for a few reasons. One is that it satisfies the important criteria above, and some others. For example, it is symmetric: if the line passes a certain distance above where the point actually is, that has the same error as if it passes the same distance below where the point actually is.

Also, this error function says that missing a data point by three feet isn't three times as bad as missing it by one foot; it's nine times as bad. It's okay to miss the point by a little bit, but missing by a lot is unacceptable. This is usually something like what we want.

But the real reason the calculators all use this error function is that it's really easy to calculate the best-fitting line for this particular definition of "best-fitting".

Let's do the calculation. Suppose that the data are the $N$ points $(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3})$…. And we're going to try to approximate them with the line $y = mx + b$. We want to find the values of $m$ and $b$ that minimize the total error.

The total error is:

$$E(m, b) = \sum{\varepsilon(y_i, mx_i+b)}$$

That is, for each point $(x_{i}, y_{i})$, we take where it really is (that's $y_{i}$) and where the line $y = mx + b$ predicts it should be (that's $mx_{i} + b$) and calculate the error of the prediction. We add up the error for each point that the line was supposed to predict, and that's the total error for the line.

Since $\varepsilon(y, y') = (y-y')^{2}$, the total error is:

$$E(m, b) = \sum{(y_i - mx_i - b)^2}$$

Or, grinding out the algebra:

$$E(m, b) = \sum{(y_i^2 + m^2x_i^2 + b^2 - 2mx_iy_i - 2by_i + 2bmx_i)}$$

We can add up each of the six kinds of components separately and then total them all at the end:

$$E(m, b) = \sum{y_i^2} + \sum{m^2x_i^2} + \sum{b^2} - \sum{2mx_iy_i} - \sum{2by_i} + \sum{2bmx_i}$$

We're going to need to spend a lot of time talking about things like $\sum x_iy_i$, so let's make some abbreviations:

$$\begin{array}{lcl} \mathcal X &= & \sum x_i \cr \mathcal Y &= & \sum y_i \cr \mathcal A &= & \sum x_i^2 \cr \mathcal B &= & \sum x_iy_i \cr \mathcal C &= & \sum y_i^2 \cr \end{array}$$

Also please recall that $N$ is the total number of points.

With these abbreviations, we can write the total error as:

$$E(m, b) = {\cal C} + m^2{\cal A} + b^2N - 2m{\cal B} - 2b{\cal Y} + 2bm{\cal X}$$

Now, remember that we would like to find the values of $m$ and $b$ that result in the minumum error. That is, we would like to minimize $E$ with respect to $m$ and $b$. But that is just a straightforward calculus problem. We want to find the value of $m$ that minimizes $E$, so we take the derivative of $E$ with respect to $m$:

$\def\d#1{{\partial E\over \partial #1}}$ $$\d m = 2m{\cal A} - 2{\cal B} + 2b{\cal X}$$

And we know that if there is a minimum value for $E$, it will occur when $\d m = 0$:

$$2m{\cal A} - 2{\cal B} + 2b{\cal X} = 0\qquad{(*)}$$

And similarly, $\d b$ will also have to be 0:

$$\d b = 2bN - 2{\cal Y} + 2m{\cal X} = 0\qquad{(**)}$$

So we have two equations (* and **) in two unknowns ($m$ and $b$) and we can solve them. First we'll solve the second equation for $b$:

$$b = {{\cal Y} - m{\cal X}\over N}$$

Now we'll solve the first equation for $m$:

$$\eqalign{ m{\cal A} - {\cal B} + b{\cal X} & = & 0 \cr m{\cal A} - {\cal B} + {{\cal Y} - m{\cal X}\over N}{\cal X} & = & 0 \cr m{\cal A} - {\cal B} + {{\cal XY} - m{\cal X}^2\over N} & = & 0 \cr mN{\cal A} - N{\cal B} + {\cal XY} - m{\cal X}^2 & = & 0 \cr m(N{\cal A} - {\cal X}^2 ) & = & N{\cal B} - {\cal XY} \cr %m & N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 \cr }$$ Thus:

$$m = { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 }$$

And that's it. We can get $m$ from $N$, $\cal A$, $\cal B$, $\cal X$, and $\cal Y$, and we can get $b$ from $m$, $N$, $\cal X$, and $\cal Y$.

$\def\dd#1{{\partial^2 E\over \partial {#1}^2}}$ (Actually, we also need to check the second derivatives to make sure we haven't found the values of $m$ and $b$ that give the maximum error. But it should be clear from the geometry of the thing that it is impossible that there could be a very-worst line: No matter how badly the data are approximated by any given line, you could always find another line that was worse, just by taking the bad line and moving it another few miles away from the data. But we should take the second derivatives anyway, if for no other reason than to rule out the possibility of a saddle point in the $E$ function. The second derivatives are $\dd m = 2\cal A$ and $\dd b = 2N$, which are both positive, so we really have found a minimum for the error function.)

Now let's do some checking, and make sure it works and makes sense. First, it should fail if $N=1$, because there are plenty of lines that all go through any single point. And indeed, when there's only one point, say $(x, y)$, then $X^{2} = x^{2}$, ${\cal A} = x^{2}$, the denominator of the expression for $m$ vanishes and the whole thing blows up.

It should also fail if all the $x$ values are equal, because then the line through the data is vertical, and has an infinite slope. And sure enough, if all the $x$ values are equal, then ${\cal X} = Nx$, ${\cal X}^{2} = N^{2}x^{2}$, ${\cal A} = Nx^{2}$, and the denominator of the expression for $m$ vanishes.

Now what if all the points happen to lie on a (non-vertical) straight line? Then there are some numbers $p$ and $q$ so that $y_{i} = px_{i} + q$ for all $i$. We hope that the equations for $m$ and $b$ will give $m = p$ and $b = q$. If they don't, we're in trouble, since the equations are supposed to find the best line through the points, and the points actually lie on the line $y = px + q$; this line will therefore have a total error of 0, and that's the best we can do. Well, let's see how it works out.

In such a case, we have $ {\cal Y} = \sum {y_i} = \sum {px_i+q} = p{\cal X} + qN$. Similarly, we also have:

$$\eqalign{ {\cal B} & = & \sum{x_iy_i} \cr & = & \sum{x_i(px_i+q)} \cr & = & \sum{px_i^2} + \sum{qx_i} \cr & = & p{\cal A} + q{\cal X} \cr }$$

Plugging these into the formula for $m$ above we get:

$$\eqalign{ m & = & { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 } \cr & = & { N(p{\cal A} + q{\cal X}) - {\cal X}(p{\cal X} + qN) \over N{\cal A} - {\cal X}^2 } \cr & = & { pN{\cal A} + qN{\cal X} - p{\cal X}^2 - qN{\cal X} \over N{\cal A} - {\cal X}^2 } \cr & = & { pN{\cal A} - p{\cal X}^2 \over N{\cal A} - {\cal X}^2 } \cr % & = & { p(N{\cal A} - {\cal X}^2) \over N{\cal A} - {\cal X}^2 } \cr & = & p \cr }$$

as hoped, and then plugging into the formula for $b$, we get $b$ = $q$, also as hoped.

It's handy to be able to fit a line to any data set. A while back I wrote a trivial little program to read data points from the standard input, do the least-squares analysis, and print out $m$ and $b$. Little utilities like this often come in handy.

Solution 2:

Suppose you are given a set of data points $\{(x_i,y_i)\}$. The goal of linear regression is to find a line that minimizes the sum of square of errors at each $x_i$. Let the equation of the desired line be $y=a+bx$.

To minimize: $E=\sum_i (y_i-a-b x_i)^2$

Differentiate $E$ w.r.t $a$ and $b$, set both of them to be equal to zero and solve for a and b.

Solution 3:

Since the question contains a historical focus, that treatize of Jeffrey M. Stanton on Francis Galton (and Pearson for the extensions) (at the ams) might be what is really wanted . I am surprised, that the wikipedia misses the historical reference completely.

(Cite from the article at AMS: )

In 1875, Galton had distributed packets of sweet pea seeds to seven friends; each friend received seeds of uniform weight (also see Galton 1894), but there was substantial variation across different packets. Galton's friends harvested seeds from the new generations of plants and returned them to him (see Appendix A). Galton plotted the weights of the daughter seeds against the weights of the mother seeds. Galton realized that the median weights of daughter seeds from a particular size of mother seed approximately described a straight line with positive slope less than 1.0:

"Thus he naturally reached a straight regression line, and the
 constant variability for all arrays of one character for a given
 character of a second. It was, perhaps, best for the progress of the
 correlational calculus that this simple special case should be
 promulgated first; it is so easily grasped by the beginner." (Pearson
 1930, p. 5)