Motivation For Weight Choice In Pooled Variance

Solution 1:

This question was already asked at How to derive "Pooled Sample Variance"?, but the accepted answer is wrong and the author of the question hasn’t been on the site since $2014$, so instead of trying to get them to unaccept the answer, I’ll post my answer here and vote to close the other question as a duplicate of this one.

As shown there, the weights should be in inverse proportion to the variances of the individual variance estimators. This is easiest to show for the combination of two estimators, where we have $\hat v=\lambda\hat v_1+(1-\lambda)\hat v_2$ with $\lambda\in[0,1]$ and thus $\mathsf{Var}[\hat v]=\lambda^2\mathsf{Var}[\hat v_1]+(1-\lambda)^2\mathsf{Var}[\hat v_2]$, which is minimal for $\lambda\mathsf{Var}[\hat v_1]-(1-\lambda)\mathsf{Var}[\hat v_2]=0$ and thus for

$$ \frac\lambda{1-\lambda}=\frac{\mathsf{Var}[\hat v_2]}{\mathsf{Var}[\hat v_1]}\;. $$

The variance of the unbiased variance estimator $\hat v=\frac1{n-1}\sum_i(x_i-\bar x_i)^2$ is

\begin{eqnarray} \operatorname{Var}[\hat v] &=& \mathsf E\left[\hat v^2\right]-\mathsf E\left[\hat v\right]^2 \\ &=& \mathsf E\left[\left(\frac1{n-1}\sum_i(x_i-\bar x_i)^2\right)^2\right]-\sigma^4 \\ &=& \frac1{n^2}\mathsf E\left[\left(\sum_ix_i^2-\frac2{n-1}\sum_{i\ne j}x_ix_j\right)^2\right]-\sigma^4 \\ &=&\frac{\mu_4}n-\frac{n-3}{n(n-1)}\sigma^4\;, \end{eqnarray}

where $\mu_4$ is the fourth central moment. So in general, even if the populations all have the same central moments, the optimal weight factor depends on the sizes of the populations in a more complicated way. However, for a normal distribution we have $\mu_4=3\sigma^4$ and thus

$$ \frac{\mu_4}n-\frac{n-3}{n(n-1)}\sigma^4=\frac{3\sigma^4}n-\frac{n-3}{n(n-1)}\sigma^4=\frac2{n-1}\sigma^4\;. $$

Thus, for a normal distribution, as you suspected, weighting the individual estimators by $n-1$ minimizes the variance of the pooled estimator.

It’s not a coincidence that this works out nicely for the normal distribution, as many things do; it’s related to how the normal distribution factorizes and the sums of the data and the squared data are jointly sufficient statistics for the parameters of the distribution; intuitively speaking, the data are additive, and each unknown mean acts like a missing data point.

Specifically, with $n=\sum_in_i$, the likelihood of the data is proportional to

$$ \frac1{\sigma^n}\exp\left(-\frac1{2\sigma^2}\sum_{ij}\left(x_{ij}-\mu_i\right)^2\right)\\=\frac1{\sigma^n}\exp\left(-\frac1{2\sigma^2}\sum_i\left(n_i\left(\mu_i-\overline x_i\right)^2+\sum_j\left(x_{ij}-\overline x_i\right)^2\right)\right)\;, $$

so the sample means $\overline x_i$ and the sum of the squared deviations from them over all populations are jointly sufficient statistics; we wouldn’t retain any extra information by retaining the separate sums of the squared deviations for the individual populations. If we assume a uniform prior for the unknown means $\mu_i$ and integrate them out, the result is proportional to

$$ \frac1{\sigma^{n-m}}\exp\left(-\frac1{2\sigma^2}\sum_{ij}\left(x_{ij}-\overline x_i\right)^2\right)\;, $$

where $m$ is the number of populations. Setting the derivative with respect to $\sigma$ to $0$ shows that the pooled variance estimator

$$ \frac1{n-m}\sum_{ij}\left(x_{ij}-\overline x_i\right)^2 $$

is the maximum likelihood estimator for the common variance $\sigma^2$.

Solution 2:

my guess is that the variance of the variance estimation is minimized by this choice

This sort of thing is sometimes the reason for a choice of weights in this kind of problem, but in this case there is a reason that hits you in the face before that question comes up, so I hadn't actually thought of the above-mentioned reason before.

Say you have $X_1,\ldots, X_n\sim\text{i.i.d}\operatorname N(\mu,\sigma^2)$ and $Y_1,\ldots, Y_n\sim\text{i.i.d}\operatorname N(\nu,\sigma^2),$ and \begin{align} \overline X & = (X_1+\cdots+X_n)/n \\[4pt] \overline Y & = (Y_1+\cdots+Y_m)/m \\[6pt] S_X^2 & = \frac {(X_1-\overline X)^2 + \cdots +(X_n - \overline X)^2} {n-1} \\[6pt] S_Y^2 & = \frac{(Y_1-\overline Y)^2 + \cdots + (Y_m - \overline Y)^2}{m-1} \end{align} The pooled estimator of $\sigma^2$ is $$ \frac{(X_1-\overline X)^2 + \cdots + (X_n-\overline X)^2 + (Y_1-\overline Y\,)^2 + \cdots + (Y_m-\overline Y\,)^2}{(n-1) + (m-1)}. \tag 1 $$ Recall that $$ \frac{(X_1-\overline X)^2 + \cdots +(X_n-\overline X)^2}{\sigma^2} \sim \chi^2_{n-1} $$ and $$ \frac{(Y_1-\overline Y)^2 + \cdots +(Y_m-\overline Y)^2}{\sigma^2} \sim \chi^2_{m-1}. $$ Thus the numerator in $(1),$ divided by $\sigma^2,$ is distributed as $\chi^2_{(n-1)+(m-1)}.$

The reason for the weights is that the numerator in $(1)$ is $(n-1)S_X^2 + (m-1)S_Y^2.$