Why get the sum of squares instead of the sum of absolute values?
I'm self-studying machine learning and getting into the basics of linear regression models. From what I understand so far, a good regression model minimizes the sum of the squared differences between predicted values $h(x)$ and actual values $y$.
Something like the following:
$$\sum_{i=1}^m (h(x_i)-y_i)^2$$
Why do we square the differences? On one hand, it seems squaring them will allow us to get a positive number when the expected value is less than the actual value. But why can't this just be accounted for by taking the sum of the absolute values?
Like so:
$$\sum_{i=1}^m |h(x_i)-y_i|$$
Solution 1:
Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
Solution 2:
If $h(x)$ is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).
This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.
For illustration purposes, I generated a $10\times 10$ table for $$y=a+b\log(x_1)+c\sqrt{x_2}$$ ($x_1=1,2,\cdots,10$), ($x_2=1,2,\cdots,10$) and changed the values of $y$ using a random relative error between $-5$ and $5$%. The values used were $a=12.34$,$b=4.56$ and $c=7.89$.
Using the first objective function, the solution is immediate and leads to $a=12.180$, $b=4.738$,$c=7.956$.
Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver $\Large 20$ iterations to get $a=11.832$, $b=4.968$,$c=8.046$. And all these painful iterations reduced the objective function from $95.60$ down to $94.07$ !
There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.
Added later
A very small problem that you could (should, if I may) exercise by hand : consider four data points $(1,4)$,$(2,11)$,$(3,14)$,$(4,21)$ and your model is simply $y=a x$ and your search for the best value of $a$ which minimizes either $$\Phi_1(a)=\sum_{i=1}^4 (y_i-a x_i)^2$$ or $$\Phi_2(a)=\sum_{i=1}^4 |y_i-a x_i|$$ Plot the values of $\Phi_1(a)$ and $\Phi_2(a)$ as a function of $a$ for $4 \leq a \leq 6$. For $\Phi_1(a)$, you will have a nice parabola (the minimum of which is easy to find) but for $\Phi_2(a)$ the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.
Solution 3:
I study regression, and I used to wonder this very question myself.
Now I've come to the conclusion it's because of the geometry and linear algebra behind regression. Suppose we collect data on $n$ observations and to run a regression. When we minimize the sum of squared residuals, the way we do this (using Ordinary Least suares) is via projection matrices. We project a vector of explanatory variables (the "y" variables) onto a hyperplane of the explained variables (the "regressors" or "x" variables). By using projections, we are able to find the "closest" vector in the hyperplane (call it $\mathbf{x}\hat{\mathbf{\beta}}$, making the "error" vector of residuals $\hat{\mathbf{u}}$ as small as possible.
This is the key: when we choose $\hat{\mathbf{\beta}}$ to make the vector of residuals as "small" as possible, this means we are minimizing its Euclidean length: \begin{align} min |\mathbf{y}-\mathbf{x\beta}| &= min |\hat{\mathbf{u}}| \\ &= min \sqrt{\hat{\mathbf{u}}^\top\hat{\mathbf{u}}} \\ &=min \sqrt{\hat{u}_+\hat{u}_2+\cdots+\hat{u}_n} \end{align} And this is where the sum of SQUARES comes in. It's actually a geometric result.