What is the difference between "expectation", "variance" for statistics versus probability textbooks?

It seems that there are two ideas of expectation, variance, etc. going on in our world.

In any probability textbook:

I have a random variable $X$, which is a function from the sample space to the real line. Ok, now I define the expectation operator, which is a function that maps this random variable to a real number, and this function looks like, $$\mathbb{E}[X] = \sum\limits_{i = 1}^n x_i p(x_i)$$ where $p$ is the probability mass function, $p: x_i \mapsto [0,1], \sum_{i = 1}^n p(x_i) = 1$ and $x_i \in \text{range}(X)$. The variance is, $$\mathbb{E}[(X - \mathbb{E}[X])^2]$$

The definition is similar for a continuous RV.


However, in statistics, data science, finance, bioinformatics (and I guess everyday language when talking to your mother)

I have a multi-set of data $D = \{x_i\}_{i = 1}^n$ (weight of onions, height of school children). The mean of this dataset is

$$\dfrac{1}{n}\sum\limits_{i= 1}^n x_i$$

The variance of this dataset (according to "science buddy" and "mathisfun dot com" and government of Canada) is,

$$\dfrac{1}{n}\sum\limits_{i= 1}^n(x_i - \sum\limits_{j= 1}^n \dfrac{1}{n} x_j)^2$$


I mean, I can already see what's going on here (one is assuming uniform distribution), however, I want an authoritative explanation on the following:

  1. Is the distinction real? Meaning, is there a universe where expectation/mean/variance... are defined for functions/random variables and another universe where expectation/mean/variance... are defined for raw data? Or are they essentially the same thing (with hidden/implicit assumption)

  2. Why is it no probabilistic assumption is made when talking about mean or variance when it comes to dealing with data in statistics or data science (or other areas of real life)?

  3. Is there some consistent language for distinguishing these two seemingly different mean and variance terminologies? For example, if my cashier asks me about the "mean weight" of two items, do I ask him/her for the probabilistic distribution of the random variable whose realization are the weights of these two items (def 1), or do I just add up the value and divide (def 2)? How do I know which mean the person is talking about?/


Solution 1:

You ask a very insightful question that I wish were emphasized more often.

EDIT: It appears you are seeking reputable sources to justify the above. Sources and relevant quotes have been provided.

Here's how I would explain this:

  • In probability, the emphasis is on population models. You have assumptions that are built-in for random variables, and can do things like saying that "in this population following such distribution, the probability of this value is given by the probability mass function."
  • In statistics, the emphasis is on sampling models. With most real-world data, you do not have access to the data-generating process governed by the population model. Probability provides tools to make guesses on what the data-generating process might be. But there is always some uncertainty behind it. We therefore attempt to estimate characteristics about the population given data.

From Wackerly et al.'s Mathematical Statistics with Applications, 7th edition, chapter 1.6:

The objective of statistics is to make an inference about a population based on information contained in a sample taken from that population...

A necessary prelude to making inferences about a population is the ability to describe a set of numbers...

The mechanism for making inferences is provided by the theory of probability. The probabilist reasons from a known population to the outcome of a single experiment, the sample. In contrast, the statistician utilizes the theory of probability to calculate the probability of an observed sample and to infer this from the characteristics of an unknown population. Thus, probability is the foundation of the theory of statistics.

From Shao's Mathematical Statistics, 2nd edition, section 2.1.1:

In statistical inference... the data set is viewed as a realization or observation of a random element defined on a probability space $(\Omega, \mathcal{F}, P)$ related to the random experiment. The probability measure $P$ is called the population. The data set or random element that produces the data is called a sample from $P$... In a statistical problem, the population $P$ is at least partially unknown and we would like to deduce some properties of $P$ based on the available sample.

So, the probability formulas of the mean and variance assume you have sufficient information about the population to calculate them.

The statistics formulas for the mean and variance are attempts to estimate the population mean and variance, given a sample of data. You could estimate the mean and variance in any number of ways, but the formulas you've provided are some standard ways of estimating the population mean and variance.

Now, one logical question is: why do we choose those formulas to estimate the population mean and variance?

For the mean formula you have there, one can observe that if you assume that your $n$ observations can be represented as observed values of independent and identically distributed random variables $X_1, \dots, X_n$ with mean $\mu$, $$\mathbb{E}\left[\dfrac{1}{n}\sum_{i=1}^{n}X_i \right] = \mu$$ which is the population mean. We say then that $\dfrac{1}{n}\sum_{i=1}^{n}X_i$ is an "unbiased estimator" of the population mean.

From Wackerly et al.'s Mathematical Statistics with Applications, 7th edition, chapter 7.1:

For example, suppose we want to estimate a population mean $\mu$. If we obtain a random sample of $n$ observations $y_1, y_2, \dots, y_n$, it seems reasonable to estimate $\mu$ with the sample mean $$\bar{y} = \dfrac{1}{n}\sum_{i=1}^{n}y_i$$

The goodness of this estimate depends on the behavior of the random variables $Y_1, Y_2, \dots, Y_n$ and the effect this has on $\bar{Y} = (1/n)\sum_{i=1}^{n}Y_i$.

Note. In statistics, it is customary to use lowercase $x_i$ to represent observed values of random variables; we then call $\frac{1}{n}\sum_{i=1}^{n}x_i$ an "estimate" of the population mean (notice the difference between "estimator" and "estimate").

For the variance estimator, it is customary to use $n-1$ in the denominator, because if we assume the random variables have finite variance $\sigma^2$, it can be shown that $$\mathbb{E}\left[\dfrac{1}{n-1}\sum_{i=1}^{n}\left(X_i - \dfrac{1}{n}\sum_{j=1}^{n}X_j \right)^2 \right] = \sigma^2\text{.}$$ Thus $\dfrac{1}{n-1}\sum_{i=1}^{n}\left(X_i - \dfrac{1}{n}\sum_{j=1}^{n}X_j \right)^2$ is an unbiased estimator of $\sigma^2$, the population variance.

It is also worth noting that the formula you have there has expected value $$\dfrac{n-1}{n}\sigma^2$$ and $$\dfrac{n-1}{n} < 1$$ so on average, it will tend to underestimate the population variance.

From Wackerly et al.'s Mathematical Statistics with Applications, 7th edition, chapter 7.2:

For example, suppose that we wish to make an inference about the population variance $\sigma^2$ based on a random sample $Y_1, Y_2, \dots, Y_n$ from a normal population... a good estimator of $\sigma^2$ is the sample variance $$S^2 = \dfrac{1}{n-1}\sum_{i=1}^{n}(Y_i - \bar{Y})^2\text{.}$$

The estimators for the mean and variance above are examples of point estimators. From Casella and Berger's Statistical Inference, Chapter 7.1:

The rationale behind point estimation is quite simple. When sampling is from a population described by a pdf or pmf $f(x \mid \theta)$, knowledge of $\theta$ yields knowledge of the entire population. Hence, it is natural to seek a method of finding a good estimator of the point $\theta$, that is, a good point estimator. It is also the case that the parameter $\theta$ has a meaningful physical interpretation (as in the case of a population) so there is direct interest in obtaining a good point estimate of $\theta$. It may also be the case that some function of $\theta$, say $\tau(\theta)$ is of interest.

There is, of course, a lot more that I'm ignoring for now (and one could write an entire textbook, honestly, on this topic), but I hope this clarifies things.

Note. I know that many textbooks use the terms "sample mean" and "sample variance" to describe the estimators above. While "sample mean" tends to be very standard terminology, I disagree with the use of "sample variance" to describe an estimator of the variance; some use $n - 1$ in the denominator, and some use $n$ in the denominator. Also, as I mentioned above, there are a multitude of ways that one could estimate the mean and variance; I personally think the use of the word "sample" used to describe such estimators makes it seem like other estimators don't exist, and is thus misleading in that way.


In Common Parlance

This answer is informed primarily by my practical experience in statistics and data analytics, having worked in the fields for about 6 years as a professional. (As an aside, I find one serious deficiency with statistics and data analysis books is providing mathematical theory and how to approach problems in practice.)

You ask:

Is there some consistent language for distinguishing these two seemingly different mean and variance terminologies? For example, if my cashier asks me about the "mean weight" of two items, do I ask him/her for the probabilistic distribution of the random variable whose realization are the weights of these two items (def 1), or do I just add up the value and divide (def 2)? How do I know which mean the person is talking about?

In most cases, you want to just stick with the statistical definitions. Most people do not think of statistics as attempting to estimate quantities relevant to a population, and thus are not thinking "I am trying to estimate a population quantity using an estimate driven by data." In such situations, people are just looking for summaries of the data they've provided you, known as descriptive statistics.

The whole idea of estimating quantities relevant to a population using a sample is known as inferential statistics. While (from my perspective) most of statistics tends to focus on statistical inference, in practice, most people - especially if they've not had substantial statistical training - do not approach statistics with this mindset. Most people whom I've worked with think "statistics" is just descriptive statistics.

Shao's Mathematical Statistics, 2nd edition, Example 2.1 talks a little bit about this difference:

In descriptive data analysis, a few summary measures may be calculated, for example, the sample mean... and the sample variance... However, what is the relationship between $\bar{x}$ and $\theta$ [a population quantity]? Are they close (if not equal) in some sense? The sample variance $s^2$ is clearly an average of squared deviations of $x_i$'s from their mean. But, what kind of information does $s^2$ provide?... These questions cannot be answered in descriptive data analysis.


Other remarks about the sample mean and sample variance formulas

Let $\bar{X}_n$ and $S^2_n$ denote the sample mean and sample variance formulas provided earlier. The following are properties of these estimators:

  • They are unbiased for $\mu$ and $\sigma^2$, as explained earlier. This is a relatively simple probability exercise.
  • They are consistent for $\mu$ and $\sigma^2$. Since you know measure theory, assume all random variables are defined over a probability space $(\Omega, \mathcal{F}, P)$. It follows that $\bar{X}_n \overset{P}{\to} \mu$ and and $S^2_n \overset{P}{\to} \sigma^2$, where $\overset{P}{\to}$ denotes convergence in probability, also known as convergence with respect to the measure $P$. See https://math.stackexchange.com/a/1655827/81560 for the sample variance (observe that the estimator with the $n$ in the denominator is used here; simply multiply by $\dfrac{n-1}{n}$ and apply a result by Slutsky) and Proving a sample mean converges in probability to the true mean for the sample mean. As a stronger result, convergence is almost sure with respect to $P$ in both cases (Sample variance converge almost surely).
  • If one assumes $X_1, \dots, X_n$ are independent and identically distributed based on a normal distribution with mean $\mu$ and variance $\sigma^2$, one has that $\dfrac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{S_n^2}}$ follows a $t$-distribution with $n-1$ degrees of freedom, which converges in distribution to a normally-distributed random variable with mean $0$ and variance $1$. This is a modification of the central limit theorem.
  • If one assumes $X_1, \dots, X_n$ are independent and identically distributed based on a normal distribution with mean $\mu$ and variance $\sigma^2$, $\bar{X}_n$ and $S^2_n$ are uniformly minimum-variance unbiased estimators (UMVUEs) for $\mu$ and $\sigma^2$ respectively. It also follows that $\bar{X}_n$ and $S^2_n$ are independent, through - as mentioned by Michael Hardy - showing that $\text{Cov}(\bar{X}_n, X_i - \bar{X}_n) = 0$ for each $i = 1, \dots, n$, or as one can learn from more advanced statistical inference courses, an application of Basu's Theorem (see, e.g., Casella and Berger's Statistical Inference).

Solution 2:

The first definitions you gave are correct and standard, and statisticians and data scientists will agree with this. (These definitions are given in statistics textbooks.) The second set of quantities you described are called the "sample mean" and the "sample variance", not mean and variance.

Given a random sample from a random variable $X$, the sample mean and sample variance are natural ways to estimate the expected value and variance of $X$.

Solution 3:

Other answers — particularly Clarinetist’s — give excellent rundowns of the most important side of the answer. Given a random variable, we can sample it, and use the sample mean (defined in the statistical sense) to estimate the actual mean of the random variable (defined in the sense of probability theory), and similarly for variance, etc.

But the connection in the other direction doesn’t seem to have been mentioned yet. This is not as important, but it’s much more straightforward, and worth pointing out. Given a sample, i.e. a finite multiset of values $\{x_i\}_{i \in I}$, we can “consider this as a distribution”, i.e. take a random variable $X$, with value $x_i$ for $i$ distributed uniformly over $I$. Then the mean, variance, etc. of $X$ (in the sense of probability theory) will be precisely the mean, variance, etc. of the original multiset (defined in the statistical sense).