How do I combine standard deviations of two groups?

Solution 1:

Continuing on from BruceET's explanation, note that if we are computing the unbiased estimator of the standard deviation of each sample, namely $$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2},$$ and this is what is provided, then note that for samples $\boldsymbol x = (x_1, \ldots, x_n)$, $\boldsymbol y = (y_1, \ldots, y_m)$, let $\boldsymbol z = (x_1, \ldots, x_n, y_1, \ldots, y_m)$ be the combined sample, hence the combined sample mean is $$\bar z = \frac{1}{n+m} \left( \sum_{i=1}^n x_i + \sum_{j=1}^m y_i \right) = \frac{n \bar x + m \bar y}{n+m}.$$ Consequently, the combined sample variance is $$s_z^2 = \frac{1}{n+m-1} \left( \sum_{i=1}^n (x_i - \bar z)^2 + \sum_{j=1}^m (y_i - \bar z)^2 \right),$$ where it is important to note that the combined mean is used. In order to have any hope of expressing this in terms of $s_x^2$ and $s_y^2$, we clearly need to decompose the sums of squares; for instance, $$(x_i - \bar z)^2 = (x_i - \bar x + \bar x - \bar z)^2 = (x_i - \bar x)^2 + 2(x_i - \bar x)(\bar x - \bar z) + (\bar x - \bar z)^2,$$ thus $$\sum_{i=1}^n (x_i - \bar z)^2 = (n-1)s_x^2 + 2(\bar x - \bar z)\sum_{i=1}^n (x_i - \bar x) + n(\bar x - \bar z)^2.$$ But the middle term vanishes, so this gives $$s_z^2 = \frac{(n-1)s_x^2 + n(\bar x - \bar z)^2 + (m-1)s_y^2 + m(\bar y - \bar z)^2}{n+m-1}.$$ Upon simplification, we find $$n(\bar x - \bar z)^2 + m(\bar y - \bar z)^2 = \frac{mn(\bar x - \bar y)^2}{m + n},$$ so the formula becomes $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$ This second term is the required correction factor.

Solution 2:

Neither the suggestion in a previous (now deleted) Answer nor the suggestion in the following Comment is correct for the sample standard deviation of the combined sample.

Known data for reference.: First, it is helpful to have actual data at hand to verify results, so I simulated samples of sizes $n_1 = 137$ and $n_2 = 112$ that are roughly the same as the ones in the question.

Combined sample mean: You say 'the mean is easy' so let's look at that first. The sample mean $\bar X_c$ of the combined sample can be expressed in terms of the means $\bar X_1$ and $\bar X_2$ of the first and second samples, respectively, as follows. Let $n_c = n_1 + n_2$ be the sample size of the combined sample, and let the notation using brackets in subscripts denote the indices of the respective samples.

$$ \bar X_c = \frac{\sum_{[c]} X_i}{n} = \frac{\sum_{[1]} X_i + \sum_{[2]} X_i}{n_1 + n_1} = \frac{n_1\bar X_1 + n_2\bar X_2}{n_1+n_2}.$$

Let's verify that much in R, using my simulated dataset (for now, ignore the standard deviations):

set.seed(2025); n1 = 137; n2 = 112  
x1 = rnorm(n1, 35, 45);  x2 = rnorm(n2, 31, 11)
x = c(x1,x2)              # combined dataset
mean(x1); sd(x1)
[1] 31.19363              # sample mean of sample 1
[1] 44.96014
mean(x2); sd(x2)
[1] 31.57042              # sample mean of sample 2
[1] 10.47946
mean(x); sd(x)
[1] 31.36311              # sample mean of combined sample
[1] 34.02507
(n1*mean(x1)+n2*mean(x2))/(n1+n2)  # displayed formula above
[1] 31.36311              # matches mean of comb samp

Suggested formulas give incorrect combined SD: Here is a demonstration that neither of the proposed formulas finds $S_c = 34.025$ the combined sample:

According to the first formula $S_a = \sqrt{S_1^2 + S_2^2} = 46.165 \ne 34.025.$ One reason this formula is wrong is that it does not take account of the different sample sizes $n_1$ and $n_2.$

According to the second formula we have $S_b = \sqrt{(n_1-1)S_1^2 + (n_2 -1)S_2^2} = 535.82 \ne 34.025.$

To be fair, the formula $S_b^\prime= \sqrt{\frac{(n_1-1)S_1^2 + (n_2 -1)S_2^2}{n_1 + n_2 - 2}} = 34.093 \ne 34.029$ is more reasonable. This is the formula for the 'pooled standard deviation' in a pooled 2-sample t test. If we may have two samples from populations with different means, this is a reasonable estimate of the (assumed) common population standard deviation $\sigma$ of the two samples. However, it is not a correct formula for the standard deviation $S_c$ of the combined sample.

sd.a = sqrt(sd(x1)^2 + sd(x2)^2);  sd.a
[1] 46.16528
sd.b = sqrt((n1-1)*sd(x1)^2 + (n2-1)*sd(x2)^2);  sd.b
[1] 535.8193
sd.b1 = sqrt(((n1-1)*sd(x1)^2 + (n2-1)*sd(x2)^2)/(n1+n2-2))
sd.b1
[1] 34.09336

Method for correct combined SD: It is possible to find $S_c$ from $n_1, n_2, \bar X_1, \bar X_2, S_1,$ and $S_2.$ I will give an indication how this can be done. For now, let's look at sample variances in order to avoid square root signs.

$$S_c^2 = \frac{\sum_{[c]}(X_i - \bar X_c)^2}{n_c - 1} = \frac{\sum_{[c]} X_i^2 - n\bar X_c^2}{n_c - 1}$$

We have everything we need on the right-hand side except for $\sum_{[c]} X_i^2 = \sum_{[1]} X_i^2 + \sum_{[2]} X_i^2.$ The two terms in this sum can be obtained for $i = 1,2$ from $n_i, \bar X_i$ and $S_c^2$ by solving for $\sum_{[i]} X_i^2$ in a formula analogous to the last displayed equation. [In the code below we abbreviate this sum as $Q_c = \sum_{[c]} X_i^2 = Q_1 + Q_2.$]

Although somewhat messy, this process of obtaining combined sample variances (and thus combined sample SDs) is used in many statistical programs, especially when updating archival information with a subsequent sample.

Numerical verification of correct method: The code below verifies that the this formula gives $S_c = 34.02507,$ which is the result we obtained above, directly from the combined sample.

q1 = (n1-1)*var(x1) + n1*mean(x1)^2; q1
[1] 408219.2 
q2 = (n2-1)*var(x2) + n2*mean(x2)^2; q1
[1] 123819.4
qc = q1 + q2
sc = sqrt( (qc - (n1+n2)*mean(x)^2)/(n1+n2-1) ); sc
[1] 34.02507