How was the normal distribution derived?
Abraham de Moivre, when he came up with this formula, had to assure that the points of inflection were exactly one standard deviation away from the center, and so that it was bell-shaped, as well as make sure that the area under the curve was exactly equal to one.
And somehow they came up with the standard normal distribution, which is as follows:
$$\displaystyle\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-\dfrac{1}{2}x^2}$$
And even cooler, he found the distribution for when the mean was not $0$ and the standard deviation was not $1$, and came up with:
$$\displaystyle f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\dfrac{(x - \mu)^2}{2\sigma^2}}$$
And so what I ask is, how? How was an equation come up with that fit all the aforementioned criteria? Moreover, how do the numbers $\pi$ and $e$ come into this?
Suppose I throw a dart into a dartboard. I aim at the centre of the board $(0,0)$ but I'm not all that good with darts so the dart lands in a random position $(X,Y)$ which has a joint density function $f:\mathbb R^2\to\mathbb R^+$.
Let's make two assumptions about the way I play darts.
1.$\qquad$ The density is rotationally invariant so the distribution of where my dart lands only depends on the distance of the dart to the centre.
2.$\qquad$ The random variables $X$ and $Y$ are independent, how much I miss left and right makes no difference to the distribution of how much I miss up and down.
So by assumption one and Pythagoras I must be able to express the density $$f(x,y) = g(x^2 + y^2).$$
Now as the random variables $X$ and $Y$ are independent and identically distributed I must be able to express $$f(x,y) \propto f(x,0) f(0,y)$$ Combining these assumptions we get that for every pair $(x,y)$ we have $$g(x^2 + y^2) \propto g(x^2)g(y^2).$$
This means that $g$ must be an exponential function $$g(t) = A e^{-Bt}$$
So A will be some normalising constant. B somehow reflects the units I'm measuring in. (So if I measure the distance in cm B will be 10 times as big as if I measured in mm). $B$ must be negative because the density should be a decreasing function of distance (I'm not that bad at darts.)
So to work out $A$ I need to integrate $f(\cdot,\cdot)$ over $\mathbb R^2$ a quick change of coordinates and $$\iint_{\mathbb R} f(x,y) dxdy = 2\pi\int_0^\infty t g(t) dt = \frac{2\pi}{B^2}. $$ for
So we should set $A = \frac{B^2}{2\pi}$ it's convenient to choose $B$ in terms of the standard deviation, so we set $B = \frac 1{2\sigma}$ and $A = \frac{1}{2\pi\sigma^2}$.
So if I set $\tilde f(x) = \frac 1{\sqrt{2\pi}\sigma} e^{-\frac{x^2}{2\sigma}}$ then $f(x,y) = \tilde f(x) \tilde f(y)$.
So the $e$ comes from the fact I wanted my $X$ and $Y$ coordinates to be independent and the $\pi$ comes from the fact that I wanted rotational invariance so I'm integrating over a circle.
The interesting thing happens if I throw two darts. Suppose I throw my first dart aiming at $(0,0)$ which lands at $(X_1,Y_1)$, I aim my next dart at the first dart, so this one lands at $(X_2,Y_2)$ with $X_2 = X_1 + X$ and $Y_2 = Y_1 + Y$.
So the position of the second dart is the sum of the two errors. But my sum is still rotationally invariant and the variables $X_2$ and $Y_2$ are still independent, so $(X_2,Y_2)$ satisfies my two assumptions.
That means that when I add independent normal distributions together I get another normal distribution.
It's this property that makes it so useful, because if I take the average of a very long sequence of random variables I should get something that's the same shape no matter how long my sequence is and taking a sequence twice as long is like adding the two sequences together. It's this property of the normal distribution that makes it so useful.
PS a factor of two seems to be wrong in my derivation but I have to go to the airport now.
The Normal distribution came about from approximations of the binomial distribution (de Moivre), from linear regression (Gauss), and from the central limit theorem. The derivation given by Tim relates more closely to the linear regression derivation, where the amount of error is represented by a Normal distribution when errors are assumed symmetric about a mean, and to decrease away from the mean. I used Tim's answer and made it a little more formal.
Theorem: Two identically distributed independent random variables follow a distribution, called the normal distribution, given that their probability density functions (PDFs) are known to be continuous and differentiable, symmetric about a mean, and decrease towards zero away from the mean.
Proof: Let $X$ and $Y$ be identically distributed independent random variables with continuous and differentiable PDFs. It is assumed that the PDFs are even functions, for example $f_X(x) = f_X(-x)$, and $f_X(x) \rightarrow 0 \text{ as } x\rightarrow \pm \infty$.
Their joint PDF, because of their independence, is $f_{XY}(x,y) = f_X(x)f_Y(y)$. Because they are identically distributed and symmetric, only the norm or magnitude of the two variables is unique - that is, $x$ and $y$ can be interchanged with no effect on the final probability. They are identically distributed and symmetric, figuratively related to a circle, as opposed to the unequally distributed oval. Therefore, there must exist a function $g(r)$ such that $$ f_{XY}(x,y) = g(\sqrt{x^2 + y^2}) $$ Which, because $g$ is not yet determined, is equivalent to $$ f_{XY}(x,y) = g(x^2 + y^2). $$
From the definition of the joint distribution, $$ f_X(x)f_Y(y) = g(x^2 + y^2). $$ Which, for $y=0$, gives $$ f_X(x) \propto g(x^2). $$ Assuming $f_Y(0)$ is a constant. Similar argument gives $$ f_Y(y) \propto g(y^2). $$ These last two results are significant, because substitution shows that the product of $g(x^2)g(y^2)$ is proportional to the sum $g(x^2 + y^2)$: $$ g(x^2)g(y^2) \propto g(x^2 + y^2) $$ And it is known from algebra that the only function to have this property is the exponential function (and the natural logarithm).
This is to say, $g(r)$ will be some type of exponential, $$ g(r) = Ae^{Br} .$$ Where $A$ and $B$ are constants yet to be determined. We assume, now, that wherever the expected value is, the probability of error away from this expected value will decrease. That is to say, we expect that the chance of error should be minimum near the expected value, and decrease to zero away from this value. Another way of saying this is that the mean must be the maximum of $g(r)$, and yet another way of saying this is that $g(r)$ must approach $0$ as $r\rightarrow \pm \infty$. In any case, we need the argument to the exponential to be negative. $$ g(r) = Ae^{-Br} $$
Now if we return to our joint PDF, $$ f_{XY}(x,y) = f_X(x)f_Y(y) = g(x^2 + y^2) $$ Here again, we can investigate the PDF of $f_X(x)$ alone by setting $y=0$, $$ f_X(x) \propto g(x^2) = A e^{-Bx^2} $$ Note that the mean of this distribution is $0$. In order to give a mean of $\mu$, this distribution becomes $$ f_X(x) \propto A e^{-B(x-\mu)^2} .$$
The constants are determined from the fact that the integral of the PDF $f_{X}(x)$ must be equal to 1 for the entire domain. That is, the cumulative distribution function (CDF) must approach 1 at the upper limit (probability cannot be >100%). $$ \int_{0}^{\infty} f_{X}(x) dx = 1 $$ The integral of $e^{-Bx^2}$ is $\sqrt{\frac{\pi}{B}}$, thus $$ A \int_0^\infty e^{-Bx^2} dx = A \sqrt{\frac{\pi}{B}} = 1$$ $$ A = \sqrt{\frac{B}{\pi}} $$ The constant $B$ is, for convenience, substituted by $\sigma^2 = \frac{1}{2B}$, so that $A = \frac{1}{\sqrt{2\pi\sigma^2}}$ and $$ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} .$$ Which is, of course, the common form of what is known as the Normal distribution. Note that the proportional symbol became an equals sign, which is necessary from the assumption that $X$ is a random variable, and all random variables have a PDF which integrates to $1$. This proves the theorem.
One will find that $\sigma^2$ is called the variation, and $\sigma$ is the standard deviation. The parameters $\sigma^2$ and $\mu$, that is, the variation and the mean, may be chosen arbitrarily, and uniquely define the distribution.
It's a beautiful thing - without knowing much about the random variables, other than the fact that they are independent, and come from the same symmetric distribution, we can logically determine the distribution they come from.
To write this answer, I followed the references supplied in comments of the question, the description given by Tim, and various online resources such as Wikipedia. Therefore, the proof may not be rigorous, and may contain errors. The errors are certainly mine alone, due to misinterpretations and/or miscalculations.
It is very old questions. But still, there is a very interesting link where you can find the derivation of density function of Normal distribution. This will help in understanding the construction of probability density function of Normal distribution in a more lucid way.
Let me present you a derivation which does not use the circular assumption presented in the "dart" proof and uses only the property of the Central limit theorem - i.e. that for any suitable set of i.i.d. variables $\{ X_1,X_2,\ldots,X_n \}$ with $\mathrm{E}\left[X\right]=\mu$ and $\mathrm{Var}\left[X\right]=\sigma^2$, we have
$$Z_n := \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \overset{n\rightarrow\infty}{\longrightarrow} Z \sim N(0,1)$$
where $N(0,1)$ is the so called normal distribution (of the variable $Z$). Central limit theorem provides us the universality, i.e. no matter which set we choose, we will allways get $Z\sim N(0,1)$ - the same distribution. From this universality only we will find the formula for the distribution function $f(z)$ of the standard normally distributed $Z$. Any other normal variable $X \sim N(\mu,\sigma)$ is get from $Z$ by scalling and its density function must be $f\left((x-\mu)/\sigma\right)/\sigma$.
1) Stirling approximation
Without proof (can be found elsewhere) we state that for large $n$ we have
$$n! \approx \sqrt{2\pi n}\left(\frac{n}{e}\right)^n$$
2) Poisson distribution
The key to derive the normal distribution density function is to choose some particular set of i.i.d. for which we can find the density function of its sum (average $\bar{X}$ is that sum divided by $n$), i.e. to choose $X_1$. There are multiple choices for such choice, in many derivation of normal distribution function it is common to choose $X_1 \sim \mathrm{Ber}(p)$ Bernoulli, so the sum $S_n = X_1 + X_2 + \ldots + X_n \sim \mathrm{Bin}(n,p)$ is Binomial. In our approach, we set $X_1 \sim \mathrm{Po}(1)$. Random variable $X$ with a Poisson distribution $\mathrm{Po}(\lambda)$ has the known discrete distribution function:
$$\mathbb{P}\left[X = k\right]=\frac{\lambda^k}{k!}e^{-\lambda}, \qquad k=0,1,2,\ldots$$
Moreover $\mathrm{Var}\left[X\right] = \lambda$. For $X_1 \sim \mathrm{Po}(1)$ we have $\mu = \lambda = 1$ and $\sigma = \sqrt{\lambda} = 1$.
Poisson distribution is the distribution of the number $X$ of uniformly distributed events in a given finite time interval with a given $\lambda = \mathrm{E}\left[X\right]$ average number of events.
We have chosed the Poisson distrubition because of one special property - aditivity. One can think of the sum of two identical (same $\lambda$) independent poisonian $X_1 + X_2$ as the number of the same events it an interval with twice the duration of the previous one. Since
$$\mathrm{E}\left[X_1+X_2\right]=\mathrm{E}\left[X_1\right]+\mathrm{E}\left[X_2\right]=1 + 1 = 2$$
we immediately get (using the "events" description of Poisson distribution), that also
$$S_n = X_1 + X_2 + \ldots + X_n \sim \mathrm{Po}(n)$$
This distribution is discrete with again the step equal to $1$. However, on the left hand side of CLT we get
$$Z_n = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} = \frac{S_n - n}{\sqrt{n}}$$
Clearly, $S_n$ is reduced by $n$ and then divided by $\sqrt{n}$. This division causes the $\mathrm{Po}(n)$ in the limit of $n\rightarrow\infty$ be continuous, since now the step is $1/\sqrt{n}$ and this covers eventually all the real numbers.
Consider now the limit of large $n$. Let $z$ be a real number. We can write $z=(k-n)/\sqrt{n}$ (for large $k$ and $n$ natural, in a sense of being close to original real $z$). Then, using the definition of (continuous) distribution function
$$f(z)\approx\frac{1}{1/\sqrt{n}}\left(\mathbb{P}\left[Z_n < z + \frac{0.5}{\sqrt{n}}\right]-\mathbb{P}\left[Z_n < z - \frac{0.5}{\sqrt{n}}\right]\right)=\sqrt{n}\,\mathbb{P}\left[Z_n = z\right]$$
since there is only one point $z$ in the interval $(z-0.5/\sqrt{n},z+0.5\sqrt{n})$. But since
$$\mathbb{P}\left[Z_n = z\right] = \mathbb{P}\left[\frac{S_n - n}{\sqrt{n}} = \frac{k - n}{\sqrt{n}}\right]=\mathbb{P}\left[S_n = k\right]=\frac{n^k}{k!}e^{-n}$$
In the limit $n\rightarrow\infty$ with $k = n + z\sqrt{n}$ we then have explicitely
$$f(z) = \lim_{n\rightarrow \infty} \sqrt{n}\frac{n^{n + z\sqrt{n}}}{(n + z\sqrt{n})!}e^{-n}$$
Using Stirling's approximation
$$f(z) = \lim_{n\rightarrow \infty} \sqrt{n}\frac{e^{n + z\sqrt{n}}}{\sqrt{2\pi (n + z\sqrt{n})}}\left(\frac{n + z\sqrt{n}}{n}\right)^{-n - z\sqrt{n}}e^{-n} = \frac{1}{\sqrt{2\pi}}\lim_{n\rightarrow \infty} e^{z\sqrt{n}}\left(1+\frac{z}{\sqrt{n}}\right)^{-n - z\sqrt{n}}$$
3) Exponential function limit
This can be further simplified using the well known
$$e^x = \lim_{n\rightarrow \infty}\left(1+\frac{x}{n}\right)^n$$
i.e. for example
$$\lim_{n\rightarrow \infty} \left(1+\frac{z}{\sqrt{n}}\right)^{- z\sqrt{n}} = e^{-z^2}$$
therefore
$$f(z) = \frac{e^{-z^2}}{\sqrt{2\pi}}\lim_{n\rightarrow \infty} e^{z\sqrt{n}}\left(1+\frac{z}{\sqrt{n}}\right)^{-n}$$
The last limit is performed by Taylor expansion of logarithm since
$$\lim_{n\rightarrow \infty} e^{z\sqrt{n}}\left(1+\frac{z}{\sqrt{n}}\right)^{-n} = \exp\left(\lim_{n\rightarrow \infty} z\sqrt{n}-n\ln\left(1+\frac{z}{\sqrt{n}}\right)\right) = \exp\left(\lim_{n\rightarrow \infty} z\sqrt{n}-z\sqrt{n}+\frac{z^2}{2}\right)$$
imediately recovering
$$f(z) = \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}}$$
The Probability Mass Function of the binomial distribution is given by $$ P(X=x)=\binom{n}{x}p^xq^{n-x}=\frac{n!}{(n-x)!.x!}.p^xq^{n-x} $$
$$ \dfrac{P(X=x+1)}{P(X=x)}=\frac{x!.(n-x)!}{(x+1)!.(n-x-1)!}.\frac{p}{q}=\frac{(n-x)}{(x+1)}.\frac{p}{q} $$ Set $t=\dfrac{x-np}{\sqrt{npq}}\implies x=np+t\sqrt{npq}$ $$ \dfrac{P(X=np+t\sqrt{npq}+1)}{P(X=np+t\sqrt{npq})}=\frac{(n-np-t\sqrt{npq})}{(np+t\sqrt{npq}+1)}.\frac{p}{q} $$ $$ LHS=\frac{P\Big(\frac{X-np}{\sqrt{npq}}=t+\frac{1}{\sqrt{npq}}\Big)}{P\Big(\frac{X-np}{\sqrt{npq}}=t\Big)}=\frac{P(T=t+\delta_n)}{P(T=t)} $$ For $n\to\infty\implies \delta_n\to0$ $$ RHS=\frac{(n-np-t\sqrt{npq})}{(np+t\sqrt{npq}+1)}.\frac{p}{q}=\frac{q-t\sqrt{\frac{pq}{n}}}{p+t\sqrt{\frac{pq}{n}}+\frac{1}{n}}.\frac{p}{q}=\frac{1-t\sqrt{\frac{p}{nq}}}{1+t\sqrt{\frac{q}{np}}+\frac{1}{np}}=\frac{1-tp\delta_n}{1+tq\delta_n+q\delta_n^2} \\\implies \frac{P(T=t+\delta_n)}{P(T=t)}=\frac{1-tp\delta_n}{1+tq\delta_n+q\delta_n^2} $$ Assumes existence of a sufficiently smooth probability density function $f(t)$ such that for large $n$, the probability $P(T=t)$ can be approximated by the differential $f(t)dt$. ie., $P(T=t)\approx f(t)dt$ and $P(T=t+\delta_n)\approx f(t+\delta_n)dt\implies$ $\dfrac{P(T=t+\delta_n)}{P(T=t)}\approx\dfrac{f(t+\delta_n)}{f(t)}$ $$ \frac{f(t+\delta_n)}{f(t)}=\frac{1-tp\delta_n}{1+tq\delta_n+q\delta_n^2} $$
$$ \log f(t+\delta_n)-\log f(t)=\log(1-tp\delta_n)-\log(1+tq\delta_n+q\delta_n^2)\\ \lim_{n\to\infty\\\delta_n\to 0}\Big[\frac{\log f(t+\delta_n)}{\delta_n}-\frac{\log f(t)}{\delta_n}\Big]\approx\lim_{n\to\infty\\\delta_n\to 0}\Big[\frac{\log(1-tp\delta_n)}{\delta_n}-\frac{\log(1+tq\delta_n+q\delta_n^2)}{\delta_n}\Big]\\ \frac{d}{dt}\log f(t)=\lim_{\delta_n\to 0}\Big[\frac{-tp}{1-tp\delta_n}\Big]-\lim_{\delta_n\to 0}\Big[\frac{tq+2q\delta_n}{1+tq\delta_n+q\delta_n^2}\Big]=-tp-tq=-t\\\log f(t)=\frac{-t^2}{2}+C\implies \boxed{f(t)=\mathcal{N}e^{{-t^2}/{2}}} $$ normalizing above function gives the normal/gaussian distribution function.