Can I derive the formula for expected value of continuous random variables from the discrete case?

I've previously asked the question on Stats SE, but I guess it fits the Math SE better. Is it possible to rigorously derive the formula for expected value of continuous random variable starting with expected variable in discrete case, i.e.

$$E[X] = \sum_{i=1}^{n}p_i x_i$$

to obtain

$$E[X]=\int_{-\infty}^{\infty}xf(x)dx$$

When formulating the definition for continuous case, the intention was, I believe, to make it 'equivalent' to the discrete case. So for example, I'd like $E[X]$ of a continuous random variable $X$ to be equal to the sum of every possible value of $X$ times probability of that particular value.

The problem is that the probability of any particular value of $X$ is $0$, and the expected value calculated that way would always be $0$. But some people tried to convince me that it's possible overcome these issues with help of Lebesgue integral. Could anyone explain intuitively how is that possible? I'm convinced that no matter what integration we use, we cannot somehow magically assign non-zero probabilities to single values the random variable $X$ might take. They will always be $0$!

Or maybe there's no magic involved and best we can do is work with infinitely thin intervals of $X$? From what I've managed to find out (I don't have time to study all measure theory and Lebesgue integration at the moment to figure it out on my own) it's about approximating the original continuous random variable $X$ with step function, and the more steps there are, the better is the approximation. But still, all we have is calculating probability of intervals (they are infinitely thin though, but they are intervals anyway).

The fact that something gets closer and closer to original function in the limit doesn't mean it behaves the same as the original function (check the very popular example here).

In the 'very popular example' above, even though the curve approaches the circle, its length will never be the same as the perimeter of that circle. Similarly, here, in 'The Riemann-Stieltjes integral: intuition' part. they say the discrete r.v. $X_n$ converges to continuous r.v. $X$, as $n \to \infty$, in the limit it becomes the same variable. So it's important to ask whether it's reasonable to expect the approximation of the cont. r.v. $X$ to behave the same as the original random variable $X$, even if in the limit the are 'indistinguishable' in the limit, whatever that word means in mathematics. The curve and circle are indistinguishable too, but still have different properties.

So apparently it's that the continuous case is not derived from discrete case, but is a generalization of the discrete case. I guess in every mathematical theory, there is no such thing as 'the only correct' generalization, so the contunous formula could look different. If you claim this is the only 'valid one', then shouldn't we call it a derivation of the formula?


You are slightly missing the point. The 'right way' is not to define continuous r.v. expectation to match the discrete case, but to find a structure that they both share. Lebesgue integration defines both continuous and discrete expectations at the same time, basing everything on the probability set-function $P$, which is normally called a probability measure, or just a probability. $P$ measures the size of sets in the event space, e.g. $P(\text{$X$ is a heads}) = 1/2$. We can also talk about the induced probability measure(or law, or distribution) of $X$, $\mu_X = P\circ X^{-1}$. The same example is $P(X^{-1}(\text{heads})) = 1/2$.

Suppose a random variable $X$ takes only a finite number of values, $X = \sum_{i=1}^N x_i\Bbb 1_{A_i}$. This is termed a 'simple function'. Here $\Bbb 1_A$ is the function $$ \Bbb 1_A(ω) := \begin{cases} 1 & ω ∈ A \\ 0 & ω \not∈ A \end{cases}$$ and it follows that the sets $A_i = [X = x_i]$ as long as the $A_i$ are disjoint sets.

We then define expectation, the integral of $X$ with respect to the probability $P$ to be

$$\Bbb E X:= ∫_Ω X(ω) dP(ω) := \sum_{i=1}^N x_i P(A_i) $$

($∫_Ω X(ω) dP(ω)$ is also written $∫_Ω X dP, ∫_Ω X(ω) P(dω),\dots$)

Note that if the probability of a singleton e.g. $A_i = \{\text{heads}\}$ is positive, $P(\{\text{heads}\}) > 0$, this means we will assign a single point some positive probability with this integration method, which I believe solves one of your problems.

For a random variable that can take more than finitely many values, we approximate by simple functions, and take limits, $$\Bbb E X:=∫_Ω X(ω) dP(ω) := \lim_{n→∞}∫_Ω X_n(ω) dP(ω)$$

Once this is out of the way, we can then distinguish discrete and continuous (or combinations of) random variables by looking at their distribution (induced measures). For a discrete $X$, we would have the distribution $\mu_X = \sum_{i=1}^∞ P(X=x_i)δ_{x_i}$, where $δ_x$ are the Dirac point masses,

$$ \Bbb δ_x(A) := \begin{cases} 1 & x ∈ A \\ 0 & x \not∈ A \end{cases}$$

It is not hard to check that

$$\Bbb E X = ∫_{-∞}^∞ x d\mu_X(x)$$

In the case of a continuous random variable, this means that we have a density function $f$ such that $f(x) dx = d\mu_X$. Again, it is just a little harder to check that

$$\Bbb E X = ∫_{-∞}^∞ x d\mu_X(x)$$


From comments-

is that structure unique, or maybe it is possible to find another structure they would both share, with different formula for continuous case?

That question is very broad, and I don't know if one day we will discover a better foundation for Probability than measure theory. However, another good thing about Lebesgue integration is that we have some very powerful convergence theorems: e.g.

Monotone Convergence Theorem. If $f_n,f$ are $\mu$-measurable(i.e. 'nice') functions such that $f_n$ increases to $f$ i.e. $f_n(x)\uparrow f(x)$ for every $x$, then $$∫_Ω f_n \ d\mu \uparrow ∫_Ω f \ d\mu$$

and

Dominated Convergence Theorem. If the measurable functions $f_n,f$ are such that $f_n$ converges pointwise to $f$ and there is a measurable $g>0$ such that $∫_Ω g \ d\mu <∞ $ and $|f_n|<g$, then $$∫_Ω f_n\ d\mu → ∫_Ω f\ d\mu $$

IMO, these results are very intuitive, and work for any measure (discrete and continuous!) so if we somehow had a different way to integrate that violates one of these, I think would view those as the 'wrong' way to integrate. But if the new integration method doesn't violate these, is it really a different formula?