What is the intuition behind the Poisson distribution's function?

Explanation based on DeGroot, second edition, page 256. Consider the binomial distribution with fixed $p$ $$ P(X = k) = {n \choose k}p^k(1-p)^{n-k} $$

Now define $\lambda = np$ and thus $p = \frac{\lambda}{n}$.

$$ \begin{align} P(X = k) &= {n \choose k}p^k(1-p)^{n-k}\\ &=\frac{n(n-1)(n-2)\cdots(n-k+1)}{k!}\frac{\lambda^k}{n^k}\left(1-\frac{\lambda}{n}\right)^{n-k}\\ &=\frac{\lambda^k}{k!}\frac{n}{n}\cdot\frac{n-1}{n}\cdots\frac{n-k+1}{n}\left(1-\frac{\lambda}{n}\right)^n\left(1-\frac{\lambda}{n}\right)^{-k} \end{align} $$ Let $n \to \infty$ and $p \to 0$ so $np$ remains constant and equal to $\lambda$.

Now $$ \lim_{n \to \infty}\frac{n}{n}\cdot\frac{n-1}{n}\cdots\frac{n-k+1}{n}\left(1-\frac{\lambda}{n}\right)^{-k} = 1 $$ since in all the fractions, $n$ climbs at the same rate in the numerator and the denominator and the last parentheses has the fraction going to $0$. Furthermore $$ \lim_{n \to \infty}\left(1-\frac{\lambda}{n}\right)^n = e^{-\lambda} $$ so under our definitions $$ \lim_{n \to \infty} = {n \choose k}p^k(1-p)^{n-k} = \frac{\lambda^k}{k!}e^{-\lambda} $$ In other words, as the probability of success becomes a rate applied to a continuum, as opposed to discrete selections, the binomial becomes the Poisson.

Update with key point from comments

Think about a Poisson process. It really is, in a sense, looking at very, very small intervals of time and seeing if something happened. The "very, very, small" comes from the need that we really only see at most one instance per interval. So what we have is pretty much an infinite sum of infinite Bernoullis. When we have a finite sum of finite Bernoullis, that is binomial. When it is infinite, but with finite probability $np=λ$, it is Poisson.


Let $p_k(t)$ be the probability of $k$ events in time $t$. We first find $p_0(t)$. Let $h$ be small. By independence $p_0(t+h)=p_0(t)p_0(h)$. The probability of an event in time $h$, where $h$ is very small, is roughly $\lambda h$. More accurately, $\lim_{h\to 0^+}\frac{1-p_0(h)}{h}=\lambda$. So $p_0(h)\approx 1-\lambda h$. Substitute. We get $$\frac{p_0(t+h)-p_0(t)}{h}\approx -\lambda p_0(t).$$ Let $h\to 0$. We conclude that $p_0'(t)=-\lambda p_0(t)$. This is a familiar differential equation. Since $p_0(0)=1$, it has solution $p_0(t)=e^{-\lambda t}$.

Now do a similar argument for general $k$. If $h$ is a small time interval, then the probability of $2$ or more events in time interval $h$ is negligible in comparison with the probability of $1$ event. Thus $$p_k(t+h)\approx p_{k}(t)(1-\lambda h)+p_{k-1}(t)\lambda h.$$ Simplifying, and letting $h\to 0$, we find that $$p_k'(t)=-\lambda p_k(t)+\lambda p_{k-1}(t).$$ This DE can be solved, using the induction hypothesis $p_{k-1}(t)=e^{-\lambda t}\frac{(\lambda t)^{k-1}}{(k-1)!}$. Or else we can verify by substitution that the standard expressions do satisfy the DE.