Why do we ask for *absolute* convergence of a series to define the mean of a discrete random variable?

It's because if the series is convergent but not absolutely convergent, you can rearrange the sum to get any value. Any good notion of "mean" or "expectation" should not depend on the ordering of the $x_i$'s.

For a more abstract reason, note that we define the expectation $E[X]$ of a random variable $X$ defined on a probability space $(\Omega, \mathcal{F}, P)$ as the Lebesgue integral $\int_{\Omega} X dP$. By definition of the Lebesgue integral, this is only well-defined if the integrand is absolutely integrable. If you learn more about measure theory, you will also learn why this definition makes sense. It is done to avoid strange situations like $\infty - \infty$ in the theory.


I think any explanation is going to make reference to the fact that without absolute convergence, the value of an infinite sum or an improper Riemann integral depends on the order in which the "pieces" are summed up. That alone may satisfy you, but it didn't fully satisfy me.

The more specific reason that satisfied me is "without absolute convergence the law of large numbers fails". The intuitive reason for this is that when you're taking sample averages, instead of integrating $x f(x)$ symmetrically, you're integrating it by Monte Carlo integration, literally picking locations randomly. As a consequence, if the integral for the mean only converges conditionally, then there is no guarantee that the sample averages have the same behavior along different sequences of samples, or even that the sample averages converge at all.

To see this on a computer, try running a program like this, which takes successive sample averages from the standard Cauchy distribution (which is symmetric about $0$, so its mean "would be zero if it made sense").

n=1e4;
x=pi*(rand(1,n)-1/2);
y=cumsum(tan(x))./(1:n);
plot(1:n,y)

This program as is will run in Matlab or Octave, but very similar programs can be run in other software with support for random numbers and plotting. What you see is quite dramatic jumps in the sample mean that occur when an entry of x gets too close to $\pi/2$ or $-\pi/2$, and which continue to occur even after thousands of samples have been drawn.


There is a distinction to be made in math between features that are an essential part of a mathematical object, versus features that are used in a more arbitrary manner simply as labels so that we can discuss the objects.

For instance, if we're encoding demographic data, and one of the variables is race, to encode that data as an integer requires assigning numbers to the different races. What number a race gets is not an essential feature of the race, but simply an arbitrary number given to keep track of it. It would be rather weird if our calculation of the average value of some metric over the different races gave a different answer depending on how we labeled the races.

Similarly, the index that we assign different $x$ values is not an essential part of the data. Even if there's an "obvious" order, it's not the only ordering, and we would want our definition to not depend on picking the right order, or there being an obvious ordering. Consider the set of rational numbers. This set is countable, so it's possible to come up with a labeling of the rational numbers by integers, but there are many different ways of going about this. If two people used different methods for labeling the rational numbers and got different numbers for the mean, that would be a problem.

The fact that the formula has $\Sigma$ rather than $\Sigma_{i=0}^{\infty}$ emphasizes this: the mean is an attribute of the events and their probabilities, not of any particular indexing. We can define the $\Sigma$ operator without any reference to an indexing. The sum over a finite set can easily be defined, and for an infinite set $X$, we can define the limit of its sum as being the number $L$ such that for any positive $\epsilon$, $X$ can be split into a finite set $H$ and infinite set $T$ such that the sum over $H$ plus the sum over any finite subset of $T$ is within $\epsilon$ of $L$. i.e.

$\Sigma X = L$ if $\forall \epsilon >0,\exists H,T:H \cup T = X, H \cap T= \emptyset, |H|<\infty, \forall S \subset T, |S| < \infty \rightarrow |\Sigma X + \Sigma S-L|<\epsilon$

This definition (unless I've messed up somewhere) is equivalent to "uniform convergence"; the term "uniform" is a qualifier we have to add when defining sums with respect to an indexing to make that sum independent of the indexing.