Proof and precise formulation of Welch-Satterthwaite equation

In my statistics course notes the Welch-Satterthwaite equation, as used in the derivation of the Welch test, is formulated as follows:

Suppose $S_1^2, \ldots, S_n^2$ are sample variances of $n$ samples, where the $k$-th sample is a sample of size $m_k$ from a population with distribution $N(\mu_k, \sigma_k^2)$. For any real numbers $a_1, \ldots, a_n$, the statistic $$ L = \frac{\nu}{\sum_{k=1}^{n} a_k\sigma_k^2}\sum_{k=1}^n a_kS_k^2 $$ with $$ \nu = \frac{\left(\sum_{k=1}^m a_kS_k^2\right)^2}{\sum_{k=1}^m \frac{(a_kS_k^2)^2}{N_k - 1}} $$ approximately (sic) follows a chi-squared distribution with $\nu$ degrees of freedom.

Doing a quick Google search, most sources seem to follow a similar formulation. There are a few questions I have though:

  1. How can the number of degrees of freedom of a chi-squared distribution depend on the statistics $S_i$? Shouldn't the number of degrees of freedom be a constant?
  2. What does 'approximately follow a distribution' mean? This is closely related to my third question:
  3. How can one justify this approximation? All 'proofs' I have seen on the internet seem to assume that $L$ follows a chi-squared distribution and then show that $\nu$ must have the stated value, using basic properties of expected value and variance. Even those arguments are confusing to mee, since they seem to ignore the fact that the $S_i$ are itself stochastic variables and not known constants.

I found a link to the original papers by Welch and Satterthwaite, which I can't access freely. I wonder if there is a reasonably short answer to at least the first two of my questions. Us students are not expected to be able to prove the equation, so it's not that bad if there is no readily available proof, but I'd like to at least understand the statement itself.


The following comes from my mulling over the approximation Welch points out in the paper you cite, with some more explicit mathematical manipulation and points of explanation.

The Setup

Let us consider a case where we are estimating the difference in population means between two independent Normally distributed random variables $A_1 \sim N(0, \sigma_1^2)$ and $A_2 \sim N(0, \sigma_2^2)$, and we use the unbiased estimators $\hat{\sigma_1}^2$ and $\hat{\sigma_2}^2$ for the unknown variances. Note that, assuming we have $n_1$ and $n_2$ independent and identically distributed observations from $A_1$ and $A_2$ respectively, $$\frac{\hat{\sigma_1}^2}{n_1} + \frac{\hat{\sigma_2}^2}{n_2} \sim \frac{\sigma_1^2}{n_1f_1}\chi^2_{f_1} + \frac{\sigma_2^2}{n_2f_2}\chi^2_{f_1}$$ -- this will be relevant for our "big reveal" at the end.

The Derivation

We first start a bit more generally. Let us define a random variable $X = a\chi^2_{f_1} + b\chi^2_{f_2}$. Note that $E[X] = af_1 + bf_2$ and $Var(X) = 2(a^2f_1 + b^2f_2)$.

Rather than calculate $X$'s probability distribution exactly, say via convolution, we'd like to approximate $X$'s probability distribution using a chi-square distribution with appropriate parameters and possibly with some rescaling of our input; i.e., we'd like to find $f,g$ such that

$$f_X(x)d(x) \approx f_Y\left(\frac{x}{g}\right)d\left(\frac{x}{g}\right)$$

and $Y \sim \chi^2_f$. In particular (remembering to take into account the scaling factor $g$), due to random variables being completely characterized by their CDFs (and in this case for absolutely continuous r.v's, by their PDFs), this would mean $\frac{X}{g} \approx Y \sim \chi^2_f$.

How do we make the two random variables "approximately equal" and/or determine $f$ and $g$? By making the first few moments of $X$ and $Y$ equal. Note that using the substitution $z = y/g, dz = dy/g$, we can exploit our knowledge of the "regularly scaled" chi-square distribution and find that $E[Y] = gf, Var(Y) = 2g^2f$. Set $E[X] = E[Y], Var(X) = Var(Y)$ and solve for $f$ and $g$, and we find that

$$f = \frac{(af_1 + bf_2)^2}{a^2f_1 + b^2f_2}, g = \frac{a^2f_1 + b^2f_2}{af_1+bf_2}$$

Now, the payoff! We saw earlier that for the estimated sample variance, we have $a = \frac{\sigma_1^2}{n_1f_1}$ and $b = \frac{\sigma_2^2}{n_2f_2}$. Using the fact that $f_i = n_i - 1$, we can plug in these values to our equations above and get that

$$ f = \frac{(\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2})^2}{\frac{1}{n_1-1}(\frac{\sigma_1^2}{n_1})^2 + \frac{1}{n_2-1}(\frac{\sigma_2^2}{n_2})^2}$$

We're still in a pickle, since we don't know $\sigma_1^2$ or $\sigma_2^2$. But we have unbiased, consistent estimators for both in the form of $\hat{\sigma}_1^2$ and $\hat{\sigma}_2^2$. The formula you get is obtained by "plugging in" these estimators for the parameters. By Slutsky's Theorem, the random variable created by this "plug-in" method converges to the "correct" degrees of freedom. More importantly for the cases of smaller sample sizes where one would use this formula, the estimators are unbiased. (I'll leave calculating the variance of this estimator as "an exercise for the reader".)

The above argument was made for approximating the distribution of $\sum_{i=1}^{k} a_i\chi^2_{f_i}$ with $k=2$, but it generalizes quite simply for any $k \in \mathbb{N}$. This generalization leads to the formulas you provide.

Bonus point: Relation to t-statistic.

How does this relate back to the t-test (which is a common situation in which the Welch-Satterthwaite formula is used)? We have to remember a few things.

Recall that in our setup, we had $n_1$ and $n_2$ observations from Gaussian random variables $A_1$ and $A_2$. Denote their samples means as $\bar{A_1}$ and $\bar{A_2}$ respectively; then $\bar{A_1} - \bar{A_2} \sim N(0, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})$. For brevity, let's define $\bar{\sigma}^2 :=\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$ and $\hat{\bar{\sigma}}^2 := \frac{\hat{\sigma_1}^2}{n_1} + \frac{\hat{\sigma_2}^2}{n_2}$.

Recall also that a random variable $\frac{Z}{\sqrt{\chi_k^2/k}}$ (where $Z \sim N(0,1)$) follows a t-distribution with $k$ degrees of freedom.

Also, let us note that given our earlier formulations for $f$ and $g$ and in this particular instance $fg = \bar{\sigma}^2$

and also recall that our derivation shows that $\hat{\bar{\sigma}}^2/g \sim \chi_f^2$ for the appropriate $f$ shown above.

Finally, we can consider the t-statistic:

$$\frac{\bar{A_1} - \bar{A_2}}{\sqrt{\hat{\bar{\sigma}^2}}} \times \frac{\frac{1}{\bar{\sigma}}}{\frac{1}{\sqrt{fg}}} = \frac{\frac{\bar{A_1} - \bar{A_2}}{\bar{\sigma}}} {\sqrt{\frac{\bar{\hat{\sigma}^2}}{fg}}} \approx \frac{Z}{\sqrt{\chi_f^2/f}} \sim t_f$$

Directly answering the questions.

  1. The dependence on the statistics $\hat{\sigma_i^2}$ comes from not knowing the population variances, upon which our "approximate" chi-square distribution depends.
  2. In this case, "approximately distributed as..." refers to the fact that $f$ and $g$ were chosen such that a random variable distributed as $\chi^2_f$ has first and second moments that match that of the random variable with the exact distribution corresponding to $\frac{\sigma_1^2}{n_1f_1}\chi^2_{f_1} + \frac{\sigma_2^2}{n_2f_2}\chi^2_{f_1}$
  3. Hopefully the above derivation explains how justifiable the approximation is or isn't, as the case may be.