Why does sample standard deviation underestimate population standard deviation?

Refering to this wikipedia page Unbiased estimation of standard deviation, it says that "it follows from Jensen's inequality that the square root of the sample variance is an underestimate".

I do know that for the concave square root function, Jensen's inequality says that the square root of the mean > mean of the square root.

So, how do we conclude that the square root of the sample variance underestimates population standard deviation?

Since we know from Jensen's inequality that square root of the mean > mean of the square root, does "square root of sample variance" somehow relate to "mean of the square root" while "population standard deviation" somehow relates to "square root of the mean"?

Added after joriki's response:

Given joriki's response about using a single sampling of data, I am now left with why $s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N{(x_i-\overline{x})^2}}$ will underestimate pop std dev. In order to use Jensen's inequality (mean of the square root < square root of the mean). I need to somehow relate the expression for $s$ to "mean of square root". I do see the square root sign in the expression for $s$ but where is the "mean" of this square root quantity?


The mean is part of what it means for an estimator to be biased. You can't make the estimator unbiased by averaging over several estimates; to the contrary, you can show that it's biased by averaging over estimates and showing that the expected average isn't the value to be estimated. (You can reduce the bias and the variance of the estimator by averaging several estimates, but as discussed above you can do that even better by using all the data for one estimate.)

For example, if your population has equidistributed values $-1,0,1$, with variance $\frac23$, and you take a sample of $2$, you'll get variance estimates of $0$, $\frac12$ and $2$ with probabilities $\frac13$, $\frac49$ and $\frac29$, respectively, yielding the correct mean $\frac13\cdot0+\frac49\cdot\frac12+\frac29\cdot2=\frac23$, whereas the estimates for the standard deviation, $0$, $\sqrt{\frac12}$ and $\sqrt2$ average to $\frac13\cdot0+\frac49\cdot\sqrt{\frac12}+\frac29\cdot\sqrt2=\frac49\sqrt2\neq\sqrt{\frac23}$, with $\frac49\sqrt2\approx0.6285\lt0.8165\approx\sqrt{\frac23}$, an underestimate as expected. If you take a sample of $3$ instead, the mean improves to $\frac19\cdot0+\frac49\cdot\sqrt{\frac13}+\frac29\cdot\sqrt{\frac43}+\frac29\cdot1=\frac19(8\sqrt{\frac13}+2)\approx0.7354$.


Let's assume we're picking $n$ independent samples from the same (unknown) distribution. Thus, the samples $x_1, x_2, \dotsc, x_n$ are independent and identically distributed random variables with some unknown mean $\mu$ (which we may approximate by the sample mean $\bar x = \frac 1 n \sum x_i$) and standard deviation $\sigma$, which we wish to estimate.

As André Nicolas notes in his first comment, the sample variance $$\tilde \sigma^2 = \frac 1{n-1} \sum_{i=1}^n(x_i-\bar x)^2$$ is a random variable whose mean or expected value $\mathrm E[\tilde \sigma^2]$ is equal to the true variance $\sigma^2$ of the unknown distribution. Thus, $\tilde \sigma^2$ is an unbiased estimator of $\sigma^2$. However, because the square root function is concave, by Jensen's inequality the mean $\mathrm E[\tilde \sigma]$ of its square root $$ \tilde \sigma = \sqrt{\tilde \sigma^2} = \sqrt{\frac 1{n-1} \sum_{i=1}^n(x_i-\bar x)^2} $$ is (except in trivial cases) less than the square root $\sigma$ of its mean $\mathrm E[\tilde \sigma^2] = \sigma^2$. Thus, $\tilde \sigma$ is an underestimate of the true standard deviation $\sigma$.