Conditional probability on zero probability events (Definition)
Let $(\Omega ,{\mathcal {F}},P)$ be a probability space, $X$ is a $(M,{\mathcal {M}})$-value random variable, $Y$ is a $(N,{\mathcal {N}})$-value random variable, and $f$ is a measurable function from $M \mapsto N$. We don't know about the existence of the joint probability density of $(X, f(X))$. Is $P(X \in E \vert f(X)=y)$ well-defined? In otherwords, does this limit exist or not: $\lim_{r \mapsto 0} P(X \in E \vert f(X) \in (y-r, y+r))$?
This question is about the definition of conditional probability on zero probability events. Even if, it is well-defined, the application of the definition is not clear to me. I asked a question about the application of conditional probability on zero probability events here.
Solution 1:
The short answer
Yes, if $N=\mathbb{R}^n$ and $\mathcal{N}$ is the Borel field, then, for every $A \in \mathcal{M}$, the limit $\lim_{\Delta y \mapsto 0} P(X \in A\ |\ Y \in (y-\Delta y, y+\Delta y))$ exists $P_Y$-almost surely, and this is true whether or not $Y = f(X)$, and whether or not $X,Y$ have a joint density. (This is the content of Theorem 9(1) below.)
Furthermore, the function $f:\mathbb{R}\rightarrow\mathbb{R}$ obtained by setting $f(y) := \lim_{\Delta y \mapsto 0} P(X \in A\ |\ Y \in (y-\Delta y, y+\Delta y))$ wherever possible, and, say, $f(y):=0$ elsewhere, is consistent with the traditional measure-theoretic definition of $P(X\in A\ |\ Y=y)$, given by \eqref{CondProb}. (This is the content of Corollary 17(1).)
The proof of these facts is a consequence of theorems 1.29 ("Differentiating measures") and 1.30 ("Differentiation of Radon measures") in reference [3] (a shout-out to user Del, who pointed me to this reference). It makes use of the concept of the derivative of an outer measure w.r.t another outer measure.
I will devote the rest of this answer to carefully derive facts 1 and 2 stated above. As far as I know, this is the first time this fundamental, intuitive result, which is often claimed (for instance, on p. 157 of [4], p. 136 of [1]), is proved. I'll be grateful (if somewhat disappointed) to anyone who can cite a precedence.
Example
Before embarking on a formal proof, let's see how the results of the next section can be used to introduce the concept of "probability conditioned on a non-discrete random variable" in a way that is simultaneously intuitive and mathematically sound.
Consider, for instance, the following excerpt taken from a popular undergraduate textbook ([6] example 5e, p. 255).
Consider $n + m$ trials having a common probability of success. Suppose, however, that this success probability is not fixed in advance but is chosen from a uniform $(0,1)$ population.
Letting $N$ denote the number of successes, and letting $X$ denote the probability that a given trial is a success, this excerpt gives natural rise to the concept of conditional probability, but when we attempt to parlay our intuition into formulas, we discover that an expression of the form $P(N = n\ |\ X=x)$ is not well-defined as per the familiar formula $P(A|B) = \frac{P(A\cap B)}{P(B)}$, since $P(X=x)=0$.
Intuition suggests to overcome this obstacle by defining $$ P(N=n\ |\ X=x) = \lim_{\Delta x\downarrow 0} \frac{P(N=n, x - \Delta x < X < x + \Delta x)}{P(x - \Delta x < X < x + \Delta x)}, $$ provided the limit exists. Theorem 9(1) assures us that the limit indeed exists almost everywhere. Theorem 17(1) implies that if we so define $P(N=n\ |\ X=x)$ wherever possible, we will obtain a function that is a conditional probability in the traditional measure-theoretic sense (described in the next paragraph), hence we may soundly subject it to the usual manipulations involving conditional probabilities, such as the law of total probability. Note that in this example the joint random variable $(N,X)$ does not have a joint density (more precisely no joint density w.r.t. the Lebesgue measure on $\mathbb{R}^2$).
Now scratch everything we have discussed so far, and suppose we start by defining the conditional probability $P(X\in A\ |\ Y=y)$ in the traditional measure-theoretic manner (cf. [2] theorem 5.3.1, p. 205) as
any solution, $\varphi$, to the integral system of equations $$ \int_B \varphi\ dP_Y = P(X\in A, Y\in B),\hspace{1cm}B\text{ Borel}, \tag{*}\label{CondProb} $$ where $P_Y$ is the distribution of $Y$, i.e. the probability measure induced on the Borel field via the formula $P_Y(E) = P(Y\in E)$.
(This concept of conditional probability is sometimes called "conditional distribution", as in [5] theorem 6.3, p. 107, and the term "conditional probability" is reserved to a closely-related, but different concept. I will keep to the "conditional probability" terminology.)
Given a two-dimensional random variable $(X,Y)$ with a joint density $f(x,y)$, we may now prove that the familiar definition of "conditional density", namely $$ f_{X|Y=y}(x) = \frac{f(x,y)}{f_Y(y)},\hspace{1cm}\text{wherever the denominator does not vanish} $$ can be used to generate conditional probabilities of the form $P(X\in A\ |\ Y=y)$.
Applying this technique to the following problem, taken from the same textbook ([6] example 5b, p. 252), we find that $P(X > 1\ |\ Y=y) = e^{-1/y}$, $y>0$.
Suppose that the joint density of $X$ and $Y$ is given by $$ f(x,y) = \begin{cases} \frac{e^{-x/y}e^{-y}}{y} & 0 < x < \infty, 0 < y < \infty \\ 0 & \text{otherwise} \end{cases} $$ Find $P(X > 1\ |\ Y=y)$.
Since the solution we obtained is continuous, Corollary 17(2) yields that, for every $y>0$, $$ e^{-1/y} = \lim_{\Delta y\downarrow 0}\frac{P(X>1, y-\Delta y<Y<y+\Delta y)}{P(y-\Delta y<Y<y+\Delta y)}. $$
The formal derivation
Notation 1 Let $n \in \{1, 2, \dots\}$ and let $r \in (0,\infty)$. For every $x \in \mathbb{R}^n$ we denote the open $n$-ball of (Euclidean) radius $r$ about $x$ by $B^{(n)}_r(x)$.
Notation 2 Let $n \in \{1, 2, \dots\}$. We denote the Euclidean topology on $\mathbb{R}^n$ by $\mathcal{E}_n$.
Notation 3 Let $n \in \{1, 2, \dots\}$. We denote the Borel $\sigma$-algebra on $\mathbb{R}^n$ by $\mathcal{B}_n$.
Notation 4 Let $n \in \{1, 2, \dots\}$. For every outer measure $\mu$ on $\mathbb{R}^n$, we denote the collection of $\mu$-measurable sets by $\mathcal{M}_\mu$.
Fix $n \in \{1, 2, \dots\}$ for the remainder of the proof.
Definition 5 An outer measure $\mu$ on $\mathbb{R}^n$ is Radon iff the following three conditions hold.
$\mathcal{B}_n \subseteq \mathcal{M}_\mu$.
For every $A\subseteq\mathbb{R}^n$ there exists a $B\in\mathcal{B}_n$ such that $A\subseteq B$ and $\mu(A) = \mu(B)$.
For every $\mathcal{E}_n$-compact $K\subseteq\mathbb{R}^n$, $\mu(K) < \infty$.
Definition 6 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We denote by $\mathrm{Diff}^\nu_\mu$ the set consisting of all $x \in \mathbb{R}^n$ for which the following pair of conditions hold.
For all $r \in (0,\infty)$, $\mu\left(B^{(n)}_r(x)\right) > 0$.
There exists some $d \in \mathbb{R}$ that satisfies: $$ d = \lim_{r \downarrow 0} \frac{\nu\left(B^{(n)}_r(x)\right)}{\mu\left(B^{(n)}_r(x)\right)}. $$
Definition 7 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We set $$ D^\nu_\mu(x) := \begin{cases} \lim_{r \downarrow 0} \frac{\nu\left(B^{(n)}_r(x)\right)}{\mu\left(B^{(n)}_r(x)\right)} &, x \in \mathrm{Diff}^\nu_\mu \\ 0 &, \text{otherwise}. \end{cases} $$
Definition 8 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We denote with $\mathrm{\mathbf{Diff}}^\nu_\mu$ the collection consisting of all $Z \subseteq \mathbb{R}^n$ such that the following pair of conditions hold.
$Z \subseteq \mathrm{Diff}^\nu_\mu$.
$Z = \mathbb{R}^n\setminus A$ for some $A \in \mathcal{M}_\mu$ with $\mu(A) = 0$.
Theorem 9 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$.
$\mathrm{\mathbf{Diff}}^\nu_\mu \neq \emptyset$.
For every $Z \in \mathrm{\mathbf{Diff}}^\nu_\mu$, $D^\nu_\mu$ is $\mathcal{Z}/\mathcal{B}_1$-measurable, where $\mathcal{Z}$ is the subset $\sigma$-algebra induced on $Z$ by $\mathcal{M}_\mu$.
Proof See [3], theorem 1.29, p. 48. Q.E.D.
Definition 10
Let $\mu, \nu$ be outer measures on $\mathbb{R}^n$. $\nu$ is absolutely continuous w.r.t. $\mu$, written $\nu \ll \mu$, provided, $\mu(A) = 0$ implies $\nu(A) = 0$, for every $A \subseteq \mathbb{R}^n$.
Let $\mathcal{F}$ be a $\sigma$-algebra on $\mathbb{R}^n$, and let $\mu, \nu$ be measures on $\mathcal{A}$. $\nu$ is absolutely continuous w.r.t. $\mu$, written $\nu \ll \mu$, provided, $\mu(A) = 0$ implies $\nu(A) = 0$, for every $A \in \mathcal{F}$.
Lemma 11 Let $\mu, \nu$ be measures on $\mathcal{B}_n$ such that $\nu\ll\mu$, and such that, for every $\mathcal{E}_n$-compact $K$, $\mu(K), \nu(K) < \infty$. Then $\mu, \nu$ can be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that $\nu^*\ll\mu^*$.
Proof
For every $A \subseteq \mathbb{R}^n$ define $$ \begin{align} \mu^*(A) &:= \inf \left\{\sum_{n = 1}^\infty \mu(B_n)\ :\!\big|\ \{B_1, B_2, \dots\} \subseteq \mathcal{B}_n,\ A \subseteq \bigcup_{n=1}^\infty B_n\right\}, \\ \nu^*(A) &:= \inf \left\{\sum_{n = 1}^\infty \nu(B_n)\ :\!\big|\ \{B_1, B_2, \dots\} \subseteq \mathcal{B}_n,\ A \subseteq \bigcup_{n=1}^\infty B_n\right\}. \end{align} $$
According to [7] theorem 2.21 (p. 38), $\mu^*, \nu^*$ are outer-measures on $\mathbb{R}^n$. According to [7] theorem 20.1(b) (p. 502), $\mu^*, \nu^*$ are extensions of $\mu, \nu$, respectively. This implies, in particular, that, for every $\mathcal{E}_n$-compact $K$, $\mu^*(K) = \mu(K) < \infty$. According to [7] theorem 20.1(a) (p. 502), $\mathcal{B}_n \subseteq \mathcal{M}_{\mu^*} \cap \mathcal{M}_{\nu^*}$. According to [7] Proposition 20.9 (p. 507), for every $A\subseteq\mathbb{R}^n$ there exist a $B \in \mathcal{B}_n$ such that $A \subseteq B$ and both $\mu^*(A) = \mu^*(B)$ and $\nu^*(A) = \nu^*(B)$. Thus, $\mu^*$ and $\nu^*$ are each Radon.
Let $A \subseteq \mathbb{R}^n$ be such that $\mu^*(A) = 0$. By the preceding paragraph there exists some $B \in \mathcal{B}_n$ such that $A \subseteq B$ and both $\mu^*(A) = \mu^*(B)$ and $\nu^*(A) = \nu^*(B)$. So $$ \mu(B) = \mu^*(B) = \mu^*(A) = 0. $$ So $$ \nu^*(A) = \nu^*(B) = \nu(B) \overset{\nu\ll\mu}{=} 0. $$ Thus $\nu^* \ll \mu^*$.
Q.E.D.
Theorem 12 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$, and let $Z \in \mathrm{\mathbf{Diff}}^\nu_\mu$. If $\nu \ll \mu$, then, for every $B \in \mathcal{M}_\mu$, $$ \nu(B) = \int_B D^\nu_\mu\mathbb{1}_Z d\mu. $$
Proof See [3], theorem 1.30, p. 50. Q.E.D.
Definition 13 Let $(\Omega, \mathcal{F}, P)$ be a probability space, and let $Y:\Omega\rightarrow\mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. We denote with $P_Y$ the probability measure induced on $\mathcal{B}_n$ by $Y$ via $(\Omega, \mathcal{F}, P)$.
Notation 14 For every probability measure $\mu$ on $\mathcal{B}_n$, we denote by $\overline{\mathcal{B}_n^\mu}$ the completion of $\mathcal{B}_n$ w.r.t. $\mu$, and we denote by $\overline{\mu}$ the unique extension of $\mu$ to $\overline{\mathcal{B}_n^\mu}$.
Definition 15 Let $(\Omega, \mathcal{F}, P)$ be a probability space, let $A \in \mathcal{F}$, and let $Y:\Omega \rightarrow \mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. We denote by $P(A\ |\ Y)$ the set of conditional probabilities of $A$ conditioned on $Y$, as follows. $P(A\ |\ Y)$ shall consist of all functions $f:\mathbb{R}^n\rightarrow\mathbb{R}$ that are $\overline{\mathcal{B}_n^{P_Y}}/\mathcal{B}_1$-measurable, $\overline{P_Y}$-semi-integrable, and such that, for every $B \in \mathcal{B}_n$, $$ \int_B f\ d\overline{P_Y} = P\left(A\cap\{Y \in B\}\right). $$
Definition 16 Let $\mu$ be a measure on $\mathcal{B}_n$. We denote $\mu$'s support by $\mathrm{supp}_\mu$. In other words, $\mathrm{supp}_\mu$ consists of all $x \in \mathbb{R}^n$ such that, for every $\mathcal{E}_n$-open-neighborhood, $G$, of $x$, $\mu(G) > 0$.
Corollary 17 Let $(\Omega, \mathcal{F}, P)$ be a probability space, let $A \in \mathcal{F}$, and let $Y:\Omega \rightarrow \mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. Set $\mu := P_Y$, and consider the measure $\nu:\mathcal{B}_n\rightarrow\mathbb{R}$ assigning to every $B \in \mathcal{B}_n$ $\nu(B) := P\left(A\cap\{Y \in B\}\right)$. Then $\mu, \nu$ can be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that:
$D^{\nu^*}_{\mu^*} \in P(A\ |\ Y)$.
For every $y \in \mathrm{supp}_\mu$ at which some $f \in P(A\ |\ Y)$ is $\mathcal{E}_n/\mathcal{E}_1$-continuous, $y \in \mathrm{Diff}^{\nu^*}_{\mu^*}$.
Proof
Since $\mu, \nu$ are finite measures on $\mathcal{B}_n$ such that $\nu \ll \mu$, then, by lemma 11, they may be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that $\nu^* \ll \mu^*$. Letting $Z \in \mathrm{\mathbf{Diff}}^{\nu^*}_{\mu^*}$, theorem 12 yields that $D^{\nu^*}_{\mu^*}\mathbb{1}_Z \in P(A\ |\ Y)$. Since, by choice of $Z$, $D^{\nu^*}_{\mu^*} = D^{\nu^*}_{\mu^*}\mathbb{1}_Z$ $P_Y$-a.e., the conclusion follows.
-
Let $y \in \mathrm{supp}_\mu$, and let $f \in P(A\ |\ Y)$ be $\mathcal{E}_n/\mathcal{E}_1$-continuous at $y$.
Let $\varepsilon \in (0,\infty)$. Choose $\delta \in (0,\infty)$ such that, for all $z \in B^{(n)}_\delta(y)$, $f(z) \in B^{(1)}_\varepsilon\left(f(y)\right)$. Let $r \in (0,\delta]$. Since $y \in \mathrm{supp}_\mu$, $P_Y\left(B^{(n)}_r(y)\right) > 0$, and we have $$ \begin{align} \frac{\nu^*\left(B^{(n)}_r(y)\right)}{\mu^*\left(B^{(n)}_r(y)\right)} &= \frac{\nu\left(B^{(n)}_r(y)\right)}{\mu\left(B^{(n)}_r(y)\right)} \\ &= \frac{P\left(A\cap\left\{Y\in B^{(n)}_r(y)\right\}\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= \frac{\int_{B^{(n)}_r(y)}\ f\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &<\frac{\int_{B^{(n)}_r(y)}\ f(y) + \varepsilon\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= \frac{(f(y)+\varepsilon)\ \int_{B^{(n)}_r(y)}\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= (f(y)+\varepsilon)\frac{\overline{P_Y}\left(B^{(n)}_r(y)\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= (f(y)+\varepsilon)\frac{P_Y\left(B^{(n)}_r(y)\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= f(y)+\varepsilon. \end{align} $$
Analogously, $$ f(y)-\varepsilon < \frac{\nu^*\left(B^{(n)}_r(y)\right)}{\mu^*\left(B^{(n)}_r(y)\right)}. $$
Q.E.D.
References
[1] Robert B. Ash, Basic Probability Theory, Dover, 2008. (An online version is freely available on the author's website.)
[2] Robert B. Ash, Catherine A. Doléance-Dade, Probability and Measure Theory, 2nd ed., Academic Press, 2000.
[3] Lawrence C. Evans, Ronald F. Gariepy, Measure Theory and Fine Properties of Functions, revised edition, CRC Press, 2015.
[4] William Feller, An Introduction to Probability Theory and Its Applications, Vol. 2, 2nd ed., John Wiley & Sons, 1971.
[5] Olav Kallenberg, Foundations of Probability Theory, 2nd ed., Springer, 2001.
[6] Sheldon M. Ross, A First Course in Probability, 9th ed., Pearson, 2013.
[7] James Yeh, Real Analysis : Theory of Measure and Integration, 3rd ed., World Scientific, 2014.