Could someone explain conditional independence?

My understanding right now is that an example of conditional independence would be:

If two people live in the same city, the probability that person A gets home in time for dinner, and the probability that person B gets home in time for dinner are independent; that is, we wouldn't expect one to have an affect on the other. But if a snow storm hits the city and introduces a probability C that traffic will be at a stand still, you would expect that the probability of both A getting home in time for dinner and B getting home in time for dinner, would change.

If this is a correct understanding, I guess I still don't understand what exactly conditional independence is, or what it does for us (why does it have a separate name, as opposed to just compounded probabilities), and if this isn't a correct understanding, could someone please provide an example with an explanation?


The scenario you describe provides a good example for conditional independence, though you haven't quite described it as such. As the Wikipedia article puts it,

$R$ and $B$ are conditionally independent [given $Y$] if and only if, given knowledge of whether $Y$ occurs, knowledge of whether $R$ occurs provides no information on the likelihood of $B$ occurring, and knowledge of whether $B$ occurs provides no information on the likelihood of $R$ occurring.

In this case, $R$ and $B$ are the events of persons A and B getting home in time for dinner, and $Y$ is the event of a snow storm hitting the city. Certainly the probabilities of $R$ and $B$ will depend on whether $Y$ occurs. However, just as it's plausible to assume that if these two people have nothing to do with each other their probabilities of getting home in time are independent, it's also plausible to assume that, while they will both have a lower probability of getting home in time if a snow storm hits, these lower probabilities will nevertheless still be independent of each other. That is, if you already know that a snow storm is raging and I tell you that person A is getting home late, that gives you no new information about whether person B is getting home late. You're getting information on that from the fact that there's a snow storm, but given that fact, the fact that A is getting home late doesn't make it more or less likely that B is getting home late, too. So conditional independence is the same as normal independence, but restricted to the case where you know that a certain condition is or isn't fulfilled. Not only can you not find out about A by finding out about B in general (normal independence), but you also can't do so under the condition that there's a snow storm (conditional independence).

An example of events that are independent but not conditionally independent would be: You randomly sample two people A and B from a large population and consider the probabilities that they will get home in time. Without any further knowledge, you might plausibly assume that these probabilities are independent. Now you introduce event $Y$, which occurs if the two people live in the same neighbourhood (however that might be defined). If you know that $Y$ occurred and I tell you that A is getting home late, then that would tend to increase the probability that B is also getting home late, since they live in the same neighbourhood and any traffic-related causes of A getting home late might also delay B. So in this case the probabilities of A and B getting home in time are not conditionally independent given $Y$, since once you know that $Y$ occurred, you are able to gain information about the probability of B getting home in time by finding out whether A is getting home in time.

Strictly speaking, this scenario only works if there's always the same amount of traffic delay in the city overall and it just moves to different neighbourhoods. If that's not the case, then it wouldn't be correct to assume independence between the two probabilities, since the fact that one of the two is getting home late would already make it somewhat likelier that there's heavy traffic in the city in general, even without knowing that they live in the same neighbourhood.

To give a precise example: Say you roll a blue die and a red die. The two results are independent of each other. Now you tell me that the blue result isn't a $6$ and the red result isn't a $1$. You've given me new information, but that hasn't affected the independence of the results. By taking a look at the blue die, I can't gain any knowledge about the red die; after I look at the blue die I will still have a probability of $1/5$ for each number on the red die except $1$. So the probabilities for the results are conditionally independent given the information you've given me. But if instead you tell me that the sum of the two results is even, this allows me to learn a lot about the red die by looking at the blue die. For instance, if I see a $3$ on the blue die, the red die can only be $1$, $3$ or $5$. So in this case the probabilities for the results are not conditionally independent given this other information that you've given me. This also underscores that conditional independence is always relative to the given condition -- in this case, the results of the dice rolls are conditionally independent with respect to the event "the blue result is not $6$ and the red result is not $1$", but they're not conditionally independent with respect to the event "the sum of the results is even".


The example you've given (the snowstorm) is usually given as a case where you might think two events might be truly independent (since they take totally different routes home), i.e.

$p(A|B)=p(A)$.

However in this case they are not truly independent, they are "only" conditionally independent given the snowstorm i.e.

$p(A|B,Z) = p(A|Z)$.

A clearer example paraphrased from Norman Fenton's website: if Alice (A) and Bob (B) both flip the same coin, but that coin might be biased, we cannot say

$p(A=H|B=H) = p(A=H)$

(i.e. that they are independent) because if we see Bob flips heads, it is more likely to be biased towards heads, and hence the left probability should be higher. However if we denote Z as the event "the coin is biased towards heads", then

$p(A=H|B=H,Z)=p(A=H|Z)$

we can remove Bob from the equation because we know the coin is biased. Given the fact that the coin is biased, the two flips are conditionally independent.

This is the common form of conditional independence, you have events that are not statistically independent, but they are conditionally independent.

It is possible for something to be statistically independent and not conditionally independent. To borrow from Wikipedia: if $A$ and $B$ both take the value $0$ or $1$ with $0.5$ probability, and $C$ denotes the product of the values of $A$ and $B$ ($C=A\times B$), then $A$ and $B$ are independent:

$p(A=0|B=0) = p(A=0) = 0.5$

but they are not conditionally independent given $C$:

$p(A=0|B=0,C=0) = 0.5 \neq \frac{2}{3} = p(A=0|C=0)$


Other answers have provided great responses elaborating on the intuitive meaning of conditional dependence. Here, I won't add to that; instead I want to address your question about "what it does for us," focusing on computational implications.

There are three events/propositions/random variables in play, $A$, $B$, and $C$. They have a joint probability, $P(A,B,C)$. In general, a joint probability for three events can be factored in many different ways: \begin{align} P(A,B,C) &= P(A)P(B,C|A)\\ &= P(A)P(B|A)P(C|A,B) \;=\; P(A)P(C|A)P(B|A,C)\\ &= P(B)P(A,C|B)\\ &= P(B)P(A|B)P(C|A,B) \;=\; P(B)P(C|B)P(A|B,C)\\ &= P(C)P(A,B|C)\\ &= P(C)P(A|C)P(B|A,C) \;=\; P(C)P(B|C)P(A|B,C)\\ \end{align} Something to notice here is that every expression on the RHS includes a factor with three variables

Now suppose our information about the problem tells us that $A$ and $B$ are conditionally independent given $C$. A conventional notation for this is: $$ A \perp\!\!\!\perp B \,|\, C, $$ which means (among other implications), $$ P(A|B,C) = P(A|C). $$ This means that the last of the many expressions I displayed for $P(A,B,C)$ above can be written, $$ P(A,B,C) = P(C)P(B|C)P(A|C). $$ From a computational perspective, the key thing to note is that conditional dependence here means we can write the 3-variable function $P(A,B,C)$ in terms of 1-variable and 2-variable functions. In a nutshell, conditional independence means that joint distributions are simpler than they might have been. When there are lots of variables, conditional independence can imply grand simplifications of joint probabilities. And if (as is often the case) you have to sum or integrate over some of the variables, conditional independence can let you pull some factors through a sum/integral, simplifying the summand/integrand.

This can be very important for computational implementation of Bayesian inference. When we want to quantify how strongly some observed data, $D$, support rival hypotheses $H_i$ (with $i$ a label distinguishing the hypotheses), you are probably used to seeing Bayes's theorem (BT) in its "posterior $\propto$ prior times likelihood" form: $$ P(H_i|D) = \frac{P(H_i)P(D|H_i)}{P(D)}, $$ where the terms in the numerator are the prior probability for $H_i$ and the sampling (or conditional predictive) probability for $D$ (aka, the likelihood for $H_i$), and the term in the denominator is the prior predictive probability for $D$ (aka the marginal likelihood, since it is the marginal of $P(D,H_i)$). But recall that $P(H_i,D) = P(H_i)P(D|H_i)$ (in fact, one typically derives BT using this, and equating it to the alternative factorization). So BT can be written as $$ P(H_i|D) = \frac{P(H_i,D)}{P(D)}, $$ or, in words, $$ \mbox{Posterior} = \frac{\mbox{Joint for everything}}{\mbox{Marginal for observations}}. $$ In models with complex dependence structures, this turns out to be the easiest way to think of modeling: The modeler expresses the joint probability for the data and all hypotheses (possibly including latent parameters for things you don't know but need to know in order to predict the data). From the joint, you compute the marginal for the data, to normalize the joint to give you the posterior (you may not even need to do this, e.g., if you use MCMC methods that don't depend on normalization constants).

Now you can see the value of conditional independence. Since the starting point of computation is the joint for everything, anything you can do to simplify the expression for the joint (and its sums/integrals) can be a great help to computation. Probabilistic programming languages (e.g., BUGS, JAGS, and to some degree Stan) use graphical representations of conditional dependence assumptions to organize and simplify computations.


No independence

Take a random sample of school children and for each child obtain data on:

  • Foot Size ($F$)
  • Literacy Score ($L$).

The two will be (positively) correlated, in that the bigger the foot size the higher the literacy score.

The random variables $F$ and $L$ are not independent.

Confounder

A graph showing a parent node (age) and two children (foot size and literacy score)

Obviously a bigger foot size is not the direct cause for a higher literacy score. What correlates the two is the child's age ($A$), which is the confounder in the fork structure above.

If I tell you someone's foot size, it hints at their age, which in turn hints at their literacy score. So we can write:

$$ P(L|F) \neq P(L) $$

Again, the random variables $F$ and $L$ are not independent.

Conditioning

By conditioning on age (the confounder), we no longer consider the relationship between foot size and literacy for the whole sample, but per each age group separately.

Doing so annihilates the correlation caused by the confounder, and makes foot size and literacy score independent.

While age does hint at literacy score, if now I tell you someone's foot size it doesn't hint a smidgen about their age because their age is given (we condition on it) - no correlation.

$$ P(L|F, A) = P(L|A) $$

And so:

$$ P(L|F) = P(L) $$

Conclusion

So this was just an example of two random variables $F$ and $L$ that were:

  • dependent when not conditioned on $A$
  • independent when conditioned on $A$

We say that $F$ is conditionally independent of $L$ given $A$:

$$ (F \perp L | A) $$