Why do we use a Least Squares fit?

Solution 1:

Carl Gauss (the most famous person to live on earth in the 19th century, except for people who did not work in the physical and mathematical sciences) showed that least squares estimates coincide with maximum-likelihood estimates when one assumes independent normally distributed errors with $0$ mean and equal variances.

POSTSCRIPT four years later:

Here are a couple of other points about raising errors to the power $2$ instead of $1.95$ or $2.05$ or whatever.

  • The variance is the mean squared deviation from the average. The variance of the sum of ten-thousand random variables is the sum of their variances. That doesn't work for other powers of the absolute value of the deviation. That means if you roll a die $6000$ times, so that the expected number of $1$s you get is $1000$, then you also know that the variance of the number of $1$s is $6000\times\frac 1 6\times\frac 5 6$, so if you want the probability that the number of heads is between $990$ and $1020$, you can approximate the distribution by the normal distribution with the same mean and the same variance. You couldn't do that if you didn't know the variance, and you couldn't know the variances without additivity of variances, and if the exponent is anything besides $2$, then you don't have that. (Oddly, you do have additivity with the $3$rd powers of the deviations, but not with the $3$rd powers of the absolute values of the deviations.)

  • Suppose the errors are not necessarily independent but are uncorrelated, and are not necessarily identically distributed but have identical variances and expected value $0$. You have $Y_i = \alpha + \beta x_i + \text{error}_i$. The $Y$s and $x$s are observed; the $x$s are treated as non-random (hence the lower-case letter) the coefficients $\alpha$ and $\beta$ are two be estimated. The least-squares estimate of $\beta$ is $$\widehat\beta = \frac{\sum_i (x_i-\bar x)(Y_i-\bar Y)}{\sum_i(x_i-\bar x)^2} \tag 1$$ where $\bar x$ and $\bar Y$ are the respective averages of the observed $x$ and $Y$ values. Notice that $(1)$ is linear in the vector of observed $Y$ values. Then among all unbiased estimators of $\beta$ that are linear in the vector of $Y$ values, the one with the smallest variance is the least-squares estimator. And similarly for $\widehat\alpha$. That is the Gauss–Markov theorem.

Solution 2:

One very practical reason for using squared errors is that we are going to want to minimize the error, and minimizing a quadratic function is easy - you just differentiate it and set the derivatives to zero, which results in a linear equation - which we have centuries of tricks to help us solve.

I'll walk through a simple example: finding the best line through the origin that fits the data points $(y_i,x_i)$ for $i=1,\dots,n$. Our model for the data is

$$y_i = ax_i + \varepsilon_i$$

where $\varepsilon_i$ is the error of the approximation for the $i$th data point. Let's raise each error to the power $2$ and then add them all up:

$$E = \sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i - ax_i)^2$$

Minimizing this error with respect to $a$, we differentiate and set the derivative equal to zero:

$$\frac{\partial E}{\partial a} = -2\sum_{i=1}^n x_i (y_i - ax_i) = 0$$

which rearranges to

$$\sum_{i=1}^n x_iy_i = a\sum_{i=1}^n x_i^2$$

so we get the standard least-squares estimator of the slope,

$$a = \frac{\sum_{i=1}^n x_iy_i}{\sum_{i=1}^n x_i^2}$$

If we had raised the errors to any power other than 2 before summing them, the resulting equation would be much harder to solve. Minimizing anything other than the squared error is generally only achievable with a numerical method.

Solution 3:

16 months after asking the question, I've run into a different and quite physicsy answer, which I hope is useful to someone.

Suppose that in order to determine $m$ parameters of some model:

$$\text{Output} = f(\text{Inputs, parameters})$$ we have conducted $N>m$ experiments. We want to use the information from these experiments to best choose the $m$ parameters (so that the output of the model and the actual experimental value are as close together as possible).

Now comes the physicsy part: lets construct an N-dimensional phase space so that the $N$ (experimentally determined) outputs from our $N$ experiments are represented by a single point in this space (the Cartesian coordinates of this point are the outputs from each experiment). Call this the 'data-point'.

Secondly, if we choose an arbitrary set of parameters for our model, we can use the inputs for each experiment to construct a 'predicted output' for each experiment (by parsing the inputs through our model). There will be $N$ predicted outputs (one for each experiment) and these form a second point in the phase space, say the 'prediction-point'. As we varying the parameters this point moves about in an $m$-dimensional subspace of the phase space. And this is the important point:

The sum of the squares of the error terms (SSE) is the square of the distance between these two points in the $N$-dimensional phase space, just by Pythagoras' Theorem.

So minimising the sum of squares error is equivalent to minimising the distance between the data-point and the prediction-point in the $N$-dimensional phase space - a very natural way of calibrating our model.

Finally, from this Gauss' result makes some sense - if the data point is allowed to vary normally with 0 mean and equal variances, the error will be spherically-symmetric around the data-point, and so the closer our prediction-point is to the data-point in space, the better, and minimising this distance should give the maximum-likelihood estimator.