Intuition about the Central Limit Theorem

Solution 1:

I don't think you should expect any short, snappy answers because I think this is a very deep question. Here is a guess at a conceptual explanation, which I can't quite flesh out.

Our starting point is something called the principle of maximum entropy, which says that in any situation where you're trying to assign a probability distribution to some events, you should choose the distribution with maximum entropy which is consistent with your knowledge. For example, if you don't know anything and there are $n$ events, then the maximum entropy distribution is the uniform one where each event occurs with probability $\frac{1}{n}$. There are lots more examples in this expository paper by Keith Conrad.

Now take a bunch of independent identically distributed random variables $X_i$ with mean $\mu$ and variance $\sigma^2$. You know exactly what the mean of $\frac{X_1 + ... + X_n}{n}$ is; it's $\mu$ by linearity of expectation. Variance is also linear, at least on independent variables (this is a probabilistic form of the Pythagorean theorem), hence

$$\text{Var}(X_1 + ... + X_n) = \text{Var}(X_1) + ... + \text{Var}(X_n) = n \sigma^2$$

but since variance scales quadratically, the variance of $\frac{X_1 + ... + X_n}{n}$ is actually $\frac{\sigma^2}{n}$; in other words, it goes to zero! This is a simple way to convince yourself of the (weak) law of large numbers.

So we can convince ourselves that (under the assumptions of finite mean and variance) the average of a bunch of iid random variables tends to its mean. If we want to study how it tends to its mean, we need to instead consider $\frac{(X_1 - \mu) + ... + (X_n - \mu)}{\sqrt{n}}$, which has mean $0$ and variance $\sigma^2$.

Suppose we suspected, for one reason or another, that this tended to some fixed limiting distribution in terms of $\sigma^2$ alone. We might be led to this conclusion by seeing this behavior for several particular distributions, for example. Given that, it follows that we don't know anything about this limiting distribution except its mean and variance. So we should choose the distribution of maximum entropy with a fixed mean and variance. And this is precisely the corresponding normal distribution! Intuitively, each iid random variable is like a particle moving randomly, and adding up the contributions of all of the random particles adds "heat," or "entropy," to your system. (I think this is why the normal distribution shows up in the description of the heat kernel, but don't quote me on this.) In information-theoretic terms, the more iid random variables you sum, the less information you have about the result.

Solution 2:

There's an almost-formal argument using cumulants. Given a random variable $X$, define its moment generating function $$M(X) = E[e^{tX}].$$ It's called the moment generating function since opening the Taylor series of the exponential, we get $$M(X) = 1 + E[X]t + \frac{1}{2}E[X]^2t^2 + \cdots.$$ The moment generating function is useful because of its relation to convolution of two independent random variables: $$M(X+Y) = E[e^{t(X+Y)}] = E[e^{tX}e^{tY}] = E[e^{tX}]E[e^{tY}] = M(X)M(Y).$$ One proof of the CLT takes the route of the mgf, but we would like to replace the multiplication by a addition since we only really know how to handle sums. So we define the cumulant generating function $$K(X) = \log M(X).$$ We can calculate the first few coefficients (which are called cumulant) by substituting into the (formal) power series of $\log(1+x) = x - x^2/2 + \cdots$: $$K(X) = \log (1+E[X]t+E[X^2]t^2/2 + \cdots) = E[X]t + E[X^2]t^2/2 - (E[X]^2t^2 + E[X]E[X^2]t^3 + E[X^2]t^4/4)/2 + \cdots = E[X]t + V[X]t^2/2 + \cdots.$$ Also, if $X$ and $Y$ are independent then $$K[X+Y] = K[X]+K[Y].$$ Now suppose $X_1,\ldots,X_n$ are iid variables distributed like $X$ with zero expectation. Then $$K[X_1+\cdots+X_n] = nK[X] = \frac{1}{2}nV[X]t^2 + \frac{1}{6}nK_3(X)t^3 + \cdots,$$ where $K_m(X)$ are just the (normalized) coefficients of the cgf, i.e. the cumulants (they are normalized by $1/m!$). If we scale this sum down by $\sqrt{n}$, then the second cumulant becomes $V[X]$ (i.e. the variance is the same), but the rest of the cumulants $K_m$ for $m \geq 3$ get multiplied by $n^{1-m/2} \rightarrow 0$, so in the limit they disappear, and the cumulant of the limit is just $$K\left[\frac{X_1+\cdots+X_n}{\sqrt{n}}\right] = \frac{1}{2}V[X]t^2.$$ Therefore there is one 'domain of attraction' for distributions, which must be the normal distribution with zero mean and variance $V[X]$; it can be calculated directly from this representation. The same idea can be used to analyze the case where the variables are independent but not identically distributed. The main step missing to make this proof formal is reasoning about the limit distribution from the limit cgf; this is the Levy continuity lemma, which shows that the 'inverse Fourier transform' is continuous.

Had we taken the route of mgf's, we would have had to use the identity $(1+1/n)^n \rightarrow e^n$ somewhere, but otherwise the argument would be much the same.

Solution 3:

Working out a few simple examples might help. This would indeed show you that the theorem works in special cases. Thus it would go a long way towards convincing oneself of the validity of the central limit theorem. The central limit theorem first appeared in the work of Abraham de Moivre, in which he proved that the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. Later Laplace showed that the same for the binomial distribution, approximating it with the normal distribution. I suggest that you work out these two simpler cases to get a feeling of how the approximation happens. All the necessary background for doing this yourself is available in the book of Hoel, Port and Stone.

If you find the theorem hard to understand, it might make you feel better to hear that probabilists needed a long time to properly formulate and understand the theorem. It was only done in the twentieth century by Lyapunov.

If you are oriented towards practical applications, then getting used to some topics of your choice, for instance noise analysis in communication theory, might help you in convincing yourself of the truth of the central limit theorem.

The best way to understand the central limit theorem would of course be to take a course in probability theory. An introductory course usually ends with a proof of this theorem. And if you take a course, you would see other interesting theorems such as the weak and strong laws of large numbers, and this would put the central limit theorem in better perspective. Even after all this, you might still need to contemplate a bit to really absorb the theorem. The proof I have seen is using some "characteristic functions" and a sort of "Fourier transform". I have to regretfully confess that I didn't fully understand it when I took the course. I never had to take up probability theory later; but if the occasion arises, I intend to go all properly through the proof and understand the machinery.

Solution 4:

This answer gives an outline of how to use the Fourier Transform to prove that the $n$-fold convolution of any probability distribution with a finite variance contracted by a factor of $\sqrt{n}$ converges weakly to the normal distribution.

However, in his answer, Qiaochu Yuan mentions that one can use the Principle of Maximum Entropy to get a normal distribution. Below, I have endeavored to do just that using the Calculus of Variations.


Applying the Principle of Maximum Entropy

Suppose we want to maximize the entropy $$ -\int_{\mathbb{R}}\log(f(x))f(x)\,\mathrm{d}x\tag1 $$ over all $f$ whose mean is $0$ and variance is $\sigma^2$, that is $$ \int_{\mathbb{R}}\left(1,x,x^2\right)f(x)\,\mathrm{d}x=\left(1,0,\sigma^2\right)\tag2 $$ That is, we want the variation of $(1)$ to vanish $$ \int_{\mathbb{R}}(1+\log(f(x)))\,\delta f(x)\,\mathrm{d}x=0\tag3 $$ for all variations of $f$, $\delta f(x)$, so that the variation of $(2)$ vanishes $$ \int_{\mathbb{R}}\left(1,x,x^2\right)\delta f(x)\,\mathrm{d}x=(0,0,0)\tag4 $$ $(3)$, $(4)$, and orthogonality requires $$ \log(f(x))=c_0+c_1x+c_2x^2\tag5 $$ To satisfy $(2)$, we need $c_0=-\frac12\log\left(2\pi\sigma^2\right)$, $c_1=0$, and $c_2=-\frac1{2\sigma^2}$. That is, $$ \bbox[5px,border:2px solid #C0A000]{f(x)=\frac1{\sigma\sqrt{2\pi}}\,e^{-\frac{x^2}{2\sigma^2}}}\tag6 $$