What is the purpose of subtracting the mean from data when standardizing?

What is the purpose of subtracting the mean from data when standardizing? and What is the purpose of dividing by the standard deviation?


Think of temperature measurements. The numerical values of mean temperature depends on whether we use Fahrenheit or Celsius scale, or some other. It's subject to our arbitrary choice of zero mark on the scale. By subtracting the mean, we remove the influence of that choice. But the choice of unit is still visible in the data because the notion of "$1$ degree change of temperature" is different on different scales. Division by $\sigma$ removes the units: we get a unitless quantity ("$z$-score") which is independent of the temperature scale used. (Well, as long as the scale is linear and warmer means higher temperature.) Now it makes sense to compare our data to some standard distribution such as $f(x)=\frac{1}{2\pi}\exp(-x^2/2)$ (which is a unitless quantity).

Shorter version: the purpose of subtracting the mean from data when standardizing is to standardize.

Also, what copper.hat said in comments.


Another reason is accuracy. When computing the variance, if the mean is large, much accuracy can be lost.

For example, the formula for the variance is $\dfrac{1}{n} \sum_{i=1}^n (x_i-\bar x)^2 $ (you can write $\dfrac1{n-1}$ instead of $\dfrac1{n}$ if it makes you feel better). If the $x_i$ are all close, even if their mean is large, this will be quite small.

If you write this in the mathematically equivalent form $\left(\dfrac{1}{n} \sum_{i=1}^n x_i^2\right) -\left(\dfrac{1}{n} \sum_{i=1}^n x_i \right)^2 $, you will be subtracting two large quantities to get a small quantity. This is the standard recipe for catastrophic cancellation and loss of accuracy.

By the way, if you do a Google search for "online mean and variance", you get a number of useful links including this one from Wikipedia: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance.