Motivation behind standard deviation?

Let's take the numbers 0-10. Their mean is 5, and the individual deviations from 5 are
-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5
And so the average (magnitude of) deviation from the mean is $30/11 \approx 2.72$.

However, this is not the standard deviation. The standard deviation is $\sqrt{10} \approx 3.16$.

The first mean-deviation is a simpler and by far more intuitive definition of the "standard-deviation", so I'm sure it's the first definition statisticians worked with. However, for some reason they decided to adopt the second definition instead. What is the reasoning behind that decision?


Your guess is correct: least absolute deviations was the method tried first historically. The first to use it were astronomers who were attempting to combine observations subject to error. Boscovitch in 1755 published this method and a geometric solution. It was used later by Laplace in a 1789 work on geodesy. Laplace formulated the problem more mathematically and described an analytical solution.

Legendre appears to be the first to use least squares, doing so as early as 1798 for work in celestial mechanics. However, he supplied no probabilistic justification. A decade later, Gauss (in an 1809 treatise on celestial motion and conic sections) asserted axiomatically that the arithmetic mean was the best way to combine observations, invoked the maximum likelihood principle, and then showed that a probability distribution for which the likelihood is maximized at the mean must be proportional to $\exp(-x^2 / (2 \sigma^2))$ (now called a "Gaussian") where $\sigma$ quantifies the precision of the observations.

The likelihood (when the observations are statistically independent) is the product of these Gaussian terms which, due to the presence of the exponential, is most easily maximized by minimizing the negative of its logarithm. Up to an additive constant, the negative log of the product is the sum of the squares (all divided by a constant $2 \sigma^2$, which will not affect the minimization). Thus, even historically, the method of least squares is intimately tied up with likelihood calculations and averaging. There are plenty of other modern justifications for least squares, of course, but this derivation by Gauss--with the almost magical appearance of the Gaussian, which had first appeared some 70 years early in De Moivre's work on sums of Bernoulli variables (the Central Limit Theorem)--is memorable.

This story was researched, and is ably recounted, by Steven Stigler in his The History of Statistics - The Measurement of Uncertainty before 1900 (1986). Here I have merely given the highlights of parts of chapters 1 and 4.


Squaring is nicer than taking the absolute value, e.g. it is smooth. It also leads to a definition of variance which has nice mathematical properties, e.g. it is additive. But for me the theorem that really justifies using standard deviation over the mean absolute error is the central limit theorem. The central limit theorem is at work whenever we measure the mean and standard deviation of a distribution we assume to be normal (e.g. heights in a population) and use that to make predictions about the entire distribution, since a normal distribution is completely specified by its mean and standard deviation.


$\newcommand{\var}{\operatorname{var}}$ Variances are additive: for independent random variables $X_1,\ldots,X_n$, $$ \var(X_1+\cdots+X_n)=\var(X_1)+\cdots+\var(X_n). $$

Notice what this makes possible: Say I toss a fair coin 900 times. What's the probability that the number of heads I get is between 440 and 455 inclusive? Just find the expected number of heads ($450$), and the variance of the number of heads ($225=15^2$), then find the probability with a normal (or Gaussian) distribution with expectation $450$ and standard deviation $15$ is between $439.5$ and $455.5$. Abraham de Moivre did this with coin tosses in the 18th century, thereby first showing that the bell-shaped curve is worth something.


Variance is a natural measure of variability that comes up frequently in probability. Standard deviation, the square root of the variance, gives you a measure of dispersion that is on the same scale as the original data, which some may find more interpretable.