Intuitive explanation of the tower property of conditional expectation

First, recall that in $E[X|Y]$ we are taking the expectation with respect to $X$, and so it can be written as $E[X|Y]=E_X[X|Y]=g(Y)$ . Because it's a funciton of $Y$, it's a random variable, and hence we can take its expectation (with respect to $Y$ now). So the double expectation should be read as $E_Y[E_X[X|Y]]$.

About the intuitive meaning, there are several approaches. I like to think of the expectation as a kind of predictor/guess (indeed, it's the predictor that minimizes the mean squared error).

Suppose for example that $X, Y$ are two (positively) correlated variables, say the weigth and height of persons from a given population. The expectation of the weight $E(X)$ would be my best guess of the weight of a unknown person: I'd bet for this value, if not given more data (my uninformed bet is constant). Instead, if I know the height, I'd bet for $E(X | Y)$ : that means that for different persons I'd bet a diferent value, and my informed bet would not be constant: sometimes I'd bet more that the "uninformed bet" $E(X)$ (for tall persons) , sometime less. The natural question arises, can I say something about my informed bet in average? Well, the tower property answers: In average, you'll bet the same.


Added : I agree (ten years later) with @Did 's comment below. My notation here is misleading, an expectation is defined in itself, it makes little or no sense to specify "with respect to $Y$". In my answer here I try to clarify this, and reconcile this fact with the (many) examples where one qualifies (subscripts) the expectation (with respect of ...).


For simple discrete situations from which one obtains most basic intuitions, the meaning is clear.

I have a large bag of biased coins. Suppose that half of them favour heads, probability of head $0.7$. Two-fifths of them favour heads, probability of head $0.8$. And the rest favour heads, probability of head $0.9$.

Pick a coin at random, toss it, say once. To find the expected number of heads, calculate the expectations, given the various biasing possibilities. Then average the answers, taking into consideration the proportions of the various types of coin.

It is intuitively clear that this formal procedure "should" give about the same answer as the highly informal process of say repeating the experiment $1000$ times, and dividing by $1000$. For if we do that, in about $500$ cases we will get the first type of coin, and out of these $500$ we will get about $350$ heads, and so on. The informal arithmetic mirrors exactly the more formal process described in the preceding paragraph.

If it is more persuasive, we can imagine tossing the chosen coin $12$ times.


The expected value of $X$ is still the expected value of $X$ when you take into account the possible values of $Y$.


I will further expand on leonbloy's answer by emphasizing the role of change of variables for integrals, but this will be a self-contained answer.

Assume $X$ is a real-valued random variable (defined on $(\Omega, P, \mathcal F)$). Let's say $X$ is some very complicated random variable and you wish to calculate its expectation because maybe it is an exercise problem and you have to submit a solution. Let's say you don't see how to calculate its expectation until one day you get the feeling that it might be easy to calculate the expectation of $X$ restricted to the slice $Y = y$ where $Y$ is another (auxiliary) real-valued random variable (on $\Omega$) that you came up with, and where $y$ can be an arbitrary real number, and you guess the expectation along the slice is $h(y)$ (where $h$ is some explicitly written function, maybe you guess $h(y) = y^2$, or maybe $h(y) = y+1$, ...), and now you think you just need to calculate $\int_{-\infty}^{\infty} h(y) d\mu(y)$ (where $\mu$ is the probability distribution of $Y$) because your intuition says that the result of that calculation is exactly the value of $E(X)$. (the intuition comes from generalization of the intuition in André Nicolas's answer)

Let's say you successfully calculated $\mu$ and then also $\int_{-\infty}^{\infty} h(y) d\mu(y)$. Now it only remains to rigorously prove that $\int_{-\infty}^{\infty} h(y) d\mu(y)$ is actually equal to $E(X)$ and you immediately see a little problem: the expectation along a particular slice such as $Y=2$ may have no meaning at all because $Y=2$ may be a null event.

I must mention that if one finishes a course on measure theory, he/she will have passed through denial, anger, bargaining, depression and arrived at acceptance of the fact that he need to live with many measurable functions that are only a.e. well defined (as opposed to everywhere defined), and measurable functions that come from theorems that only guarantee uniqueness with respect to a.e. equal. Then you will get some intuition that the little problem has a way out: (but not in the naive way: see Borel-Kolmogorov paradox).

While the expression $E[X | Y=2]$ (conditional expectation of $X$ given the possibly null event $Y=2$) eludes you for now, the expression $E[X|Y]$ is fine. If lucky, you may be able to prove that $E[X|Y] = h(Y)$ holds almost everywhere. Proving that might require applying properties of conditional expectation several times until you arrive at the desired conclusion $E[X|Y] = h(Y)$. Sometimes you start with an informal argument for why you find the (meaningless) equation $E[X | Y = 2] = h(2)$ plausible and then you keep replacing steps in the informal argument into rigorous steps until you come up with a rigorous proof of the (not meaningless) equation $E[X|Y] = h(Y)$.

Let's say you manage to prove $E[X|Y] = h(Y)$. Now what remains? The tower property now says $E[X] = E[h(Y)]$, but the change of variables (for integral) says $E[h(Y)] = \int_{-\infty}^{\infty} h(y) d\mu(y)$ which you have already calculated. So you are done. You have shown that $E[X]$ is equal to your calculation result.

Remark 1:

Some people write $h(y)$ as $E[X | Y= y]$. It is okay to use expressions like $E[X | Y= y]$ after all, if you and the readers are aware of the gotchas.

Remark 2:

Also known as the law of total expectation.

Remark 3.

If $\mathcal G$ is a sub-sigma-algebra, then we still have $E[X] = E[E[X|\mathcal G]]$. If $\mathcal G$ is generated by finite or countably infinite family of random variables, you can still give similar interpretation. For example, if $\mathcal G$ is generated by two real-valued random variables $Y, Z$, then $E[X] = E[E[X|\mathcal G]]$ is just another way of saying $E[X] = \int h(y,z) d\mu(y,z)$ where $h: \mathbb R^2 \to \mathbb R$ is some measurable function satisfying $E[X| \mathcal G] = h(Y,Z)$ and $\mu$ is the probability distribution of the joint $(Y,Z)$.

The intuition.

The intuition for the tower property is that for example $E[X] = E[E[X| Y,Z]]$ (where $E[X| Y,Z]$ is just $E[X| (Y,Z)]$) is simply a more succinct way of saying $E[X] = \int h(y,z) d\mu(y,z)$.

Remark 4.

Convenience of using something like $E[X] = E[E[X|Y, Z]]$ rather than $E[X] = \int h(y,z) d\mu(y,z)$ is in that with the former you keep the setup in a single probability space $\Omega$ while the latter involves two probability spaces $(\Omega, P)$ and $(\mathbb R^2, \mu)$.