How do I derive the Beta function using the definition of the beta function as the normalizing constant of the Beta distribution and only common sense random experiments?

I'm pretty sure this is possible, but can't see how.

I can see that

$$\newcommand{\Beta}{\mathrm{Beta}}\sum_{a=0}^n {n \choose a} \Beta(a+1, n-a+1) = 1$$

because we can imagine that we are flipping a coin $n$ times. The $2^n$ unique sequences of flips partition the probability space. The Beta distribution with parameters $a$ and $n-a$ can be defined as the prior over the coin's bias probability $p$ given the observation of $a$ heads and $n-a$ tails. Since there are ${n \choose a}$ such sequences for any $n$ and $a$, that explains the scaling factor, and we know that it all sums to unity since the sequences partition the probability space, which has total measure 1.

What I can't figure out is why:

$${n \choose a} \Beta(a+1, n-a+1) = \frac{1}{n+1} \qquad \forall n \ge 0,\quad a \in \{0, \dots, n\}.$$

If we knew that, we could easily see that

$$\Beta(a + 1,n - a + 1) = \frac{1}{(n+1){n \choose a}} = \frac{a!(n-a)!}{(n+1)!}.$$


For non-negative integers $a, b$ and $t \in [0, 1]$, the expression $t^a (1 - t)^b$ describes the probability of randomly selecting $a+b$ real numbers in $[0, 1]$ such that the first $a$ are in $[0, t]$ and the last $b$ are in $[t, 1]$. The integral $\int_0^{1} t^a (1 - t)^b dt$ then describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, the next $a$ numbers are in $[0, t]$, and the next $b$ numbers are in $[t, 1]$.

It follows that $ {a+b \choose b} \int_0^1 t^a (1 - t)^b dt$ describes the probability of randomly selecting $a+b+1$ real numbers such that the first number is $t$, some $a$ of the remaining numbers are in $[0, t]$, and some $b$ of the remaining numbers are in $[t, 1]$. But this is the same as the probability that the first number happens to be $(a+1)^{st}$ in order, and this is just $\frac{1}{a+b+1}$. Hence

$$\int_0^1 t^a (1 - t)^b dt= \frac{a! b!}{(a+b+1)!}$$

as desired. I learned this proof through an exercise in a Putnam training seminar; the multidimensional generalization also works.


The multinomial generalization mentioned by Qiaochu is conceptually simple but getting the details right is messy. The goal is to compute $$\int_0^1 \int_0^{1-t_1} \ldots \int_0^{1-t_1-\ldots-t_{k-2}} t_1^{n_1} t_2^{n_2} \ldots t_{k-1}^{n_{k-1}} t_k^{n_k} dt_1 \ldots dt_{k-1},$$ where $t_k = 1 - t_1 - \ldots - t_{k-1},$ for nonnegative integers $n_1, \ldots, n_k$.

Draw $k-1 + \sum_{i = 1}^{k}n_k$ numbers $X_1, \ldots, X_{k-1 + \sum_{i = 1}^{k}n_k}$ independently from a uniform $[0,1]$ distribution. Define $X_0 = 0$ and $X_{k + \sum_{i = 1}^{k}n_k} = 1$ for convenience. Let $E$ be the event that the numbers $X_1$ through $X_{k-1}$ are in ascending order and that the numbers $X_{j + \sum_{i = 1}^{j-1} n_i}$ through $X_{j + \sum_{i = 1}^{j}n_i - 1}$ are between $X_{j-1}$ and $X_j$ for $j = 1, \ldots, k$.

Define a linear transformation from $(X_1, \ldots, X_{k-1}) \to (T_1, \ldots, T_{k-1})$ by $T_i = X_i - X_{i-1}$ for $i = 1, \ldots, k-1$. Note that the determinant of this linear transformation is 1 and it is therefore measure-preserving. Given values of $X_1$ through $X_{k-1}$, the conditional probability of $E$ is $$\mathbb{P}[E|(X_1, \ldots, X_{k-1}) = (x_1, \ldots, x_{k-1})] = \prod_{i = 1}^{k}(x_i - x_{i-1})^{n_k} \mathbf{1}_{\{x_i > x_{i-1}\}}.$$ Marginalizing with respect to the distribution of $X_1 \times \ldots \times X_{k-1}$ gives $$\begin{aligned} \mathbb{P}[E] &= \int_{0}^1 \ldots \int_{0}^1 \prod_{i = 1}^{k}(x_i - x_{i-1})^{n_k} \mathbf{1}_{\{x_i > x_{i-1}\}} p_{X_1 \times \ldots \times X_{k-1}}(x_1, \ldots, x_{k-1}) dx_{k-1} \ldots dx_{1} \\ &= \int_{0}^1 \int_{-t_1}^{1-t_1} \ldots \int_{-t_1 - \ldots - t_{k-1}}^{1 -t_1 - \ldots - t_{k-1}} \prod_{i = 1}^{k} t_k^{n_k} \mathbf{1}_{\{t_k > 0\}} p_{T_1 \times \ldots \times T_{k-1}}(t_1, \ldots, t_{k-1}) dt_{k-1} \ldots dt_{1} \\ &= \int_0^1 \int_0^{1-t_1} \ldots \int_0^{1-t_1-\ldots-t_{k-2}} t_1^{n_1} \ldots t_{k-1}^{n_{k-1}} t_k^{n_k} dt_{k-1} \ldots dt_{1}, \end{aligned}$$ so if we can compute $\mathbb{P}[E]$ combinatorially we will have evaluated the desired intergral.

Let $\{R_i\}_{i \in \{1, \ldots, k-1 + \sum_{i = 1}^{k}n_k\}}$ be the ranks that the numbers $\{X_i\}_{i \in \{1, \ldots, n+m+1\}}$ would have if sorted in ascending order. (Note that the numbers are all distinct with probability 1). Since the numbers were drawn independently from a uniform distribution, the ranks are a random permutation of the integers $1$ through $k-1 + \sum_{i = 1}^{k}n_k$. Note that $E$ is exactly the event that $R_j = j + \sum_{i = 1}^j n_i$ for $j \in \{1, \ldots, k-1\}$ and that for each $l \in \{1, \ldots, k\}$, $$R_j \in \{l + \sum_{i = 1}^{l-1} n_i, \ldots, l + \sum_{i=1}^{l}n_i - 1\}$$ for $$j \in \{k+\sum_{i = 1}^{l-1}n_i, \ldots, k + \sum_{i = 1}^{l}n_i - 1\}.$$ There are $n_1!\ldots n_k!$ possible permutations which satisfy these conditions out of $(\sum_{i=1}^{k}n_i+k-1)!$ total possible permuations, so $$\mathbb{P}[E] = \frac{n_1!\ldots n_k!}{(\sum_{i=1}^{k}n_i+k-1)!}.$$