Conditional Probability and Division by Zero

Solution 1:

This is a surprisingly philosophical question, and as such, here is a link to a philosophical paper about it: What Conditional Probability Could Not Be

Practically speaking though, you're absolutely correct - this probability is $1/2.$ However it is difficult to describe this fact using conditional probability the way it is usually understood.

The way I would "rigorously" approach this problem the following: let's say you have a probability space $(X,\Sigma,P)$ and a subspace $Y\subset X$ such that $P(Y)=0$. How do we 'condition' on this space? Well, the same way we consider a "line" integral in $\mathbb{R}^2$: $Y$ becomes your new universe, so you have to define a new probability space $(Y,\Sigma_2,P_2)$ where you can answer questions such as this. The statement $P(A\cap B)/P(B)$ is somewhat like trying to measure the length of a line segment using a bathroom scale - the scale ignores the line segment, so you have to get a ruler instead!

Solution 2:

Not an answer, but for those who might have approached it in a different way and got an answer different from $\frac12$, a comment on why that might be ok.

To see why different several answers might be ok in some sense, let's start with what is the purpose of conditioning. Even when we are conditioning to an event of positive probability, it is usually for the purpose of later invoking the law of total probability or the law of total expectation in order to enable computation of some probability or expected value via case-by-case analysis. That is, we divide into cases $B_1,\dots, B_n$ (from a partition of the probability space), and we restrict our attention to each smaller probability space $B_i$ and we observe that some random variables become constant in the smaller space, and we exploit that to compute $P(A|B_i)$ and then we move back to the original big probability space to obtain $P(A)$ by invoking the law of total probability $P(A) = \sum_i P(A|B_i) P(B_i)$. To enable such case-by-case analysis is why we define and use conditional probabilities. That is, the purpose of conditioning is in enabling the trick of double counting in probability theory.

Since such case-by-case analysis has been shown to be extremely useful at least in discrete situations and since there are many situations where one would want to divide the probability space into continuum-many cases, one is bound to ask the following questions:

(1) can we have a good probability theory with probability spaces which are not discrete at all?

(2) If so, can we have a good theory of dividing such a space into uncountably many parts in a way that enables most case-by-case analysis that we might want to employ?

Now as you know, Question (1) is answered yes at the inevitable cost of abandoning the requirement that every subset be assigned a probability.

Question (2) is also answered yes and this time at the inevitable cost of abandoning the requirement that $P(A|B)$, for events A, B with P(B)=0, be assigned a unique number independent of how we partition the ambient probability space (and independent of how we approximate $B$ by positive probability events). Seeing that this cost is necessary is easy, as Borel–Kolmogorov paradox demonstrates. What's slightly less easy is whether this cost (plus some other minor costs) is sufficient for, say, if you are dealing with continuous random variables instead of discrete ones. But as you know, the notion of conditional probability density function works quite well in assigning numbers to $P(A|B)$ in a way that enables double counting arguments whenever a partition is fixed and if the partition comes from a continuous random variable.

Even when we are working with random variables that are not continuous nor discrete (random walk, Brownian motion, etc), we can say yes to Question (2): If the probability space is in some sense approximated by discrete ones (the limiting probability space is called a standard probability space) and if we are partitioning the space using a measurable function (intuitively, this means that the way we partition the space is also approximated by discrete ones), then we can assign numbers to $P(A|B)$ for all events A and partition elements B in such a way that it enables double counting arguments. (For those who want to google precise results, this is known as disintegration of measures)