Bayes classifier: handling conditional expectation / probability

I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page: https://en.wikipedia.org/wiki/Bayes_classifier#Proof_of_Optimality

The question arises already in part a). Let me give some definitions first:

Let $(X,Y)$ be random variables with values in $(\mathbb{R}^d, \{0,1\}). $

For $x$ in the support of $X$ let $\eta(x) = \mathbb{P}(Y=1|X=x)$, and let $h$ be a classifier, meaning that $h(X) \in \{0, 1\}$. We further define the risk of a classifier as $R(h) := \mathbb{P}(h(X) \neq Y)$. Let me now state the proof (same as on Wikipedia) and the part where I am struggling. For any classifier $h$ we have:

$$R(h) = P(h(X) \neq Y) = \mathbb{E}_{XY}[\Bbb{1}_{h(X)\neq Y}] = \Bbb{E}_X\Bbb{E}_{Y|X}[\Bbb{1}_{h(X)\neq Y}|X=x],$$

where the last equality is the law of iterated expectations. Continuing, we get

$$\Bbb{E}_X\Bbb{E}_{Y|X}[\Bbb{1}_{h(X)\neq Y}|X=x] \\= \Bbb{E}_X[\Bbb{P}(Y\neq h(X)|X=x)] \\= \Bbb{E}_X[\Bbb{1}_{h(X) = 0}\cdot\Bbb{P}(Y=1|X=x) + \Bbb{1}_{h(X) = 1}\cdot\Bbb{P}(Y=0|X=x)] \\= \Bbb{E}_X[\Bbb{1}_{h(X) = 0}\cdot \eta(x) + \Bbb{1}_{h(X) =1}\cdot(1-\eta(x))] $$

Now the last line is pretty much what is written in the proof on Wikipedia (and also the proof I have seen in class), except that the argument of the function $\eta$ is not $x$ but $X$, in words it is not a point $x$ in the support of $X$ but a random variable. Now I wonder how this exchange is justified. Since I have not taken a class which deals with conditional expectations yet, there might be a somewhat straight-forward justification for this which I'm not aware of. It's also possible that I have made a mistake in the above computations.

I found a very thorough explanation of things that look similar in a thread here (provided by @Stefan Hansen):

https://math.stackexchange.com/a/498338/874549,

but to me this is very advanced, so it's hard to say if that is actually what I'm looking for.

If anyone sees a mistake, or has a somewhat elementary explanation this would be very appreciated!


As demonstrated by the Borel–Kolmogorov paradox, it is impossible that the term "$\mathsf P(Y=1\mid X=x)$" is defined using the event $\{X=x\}$ if $\mathsf P(\{X=x\})=0$. Instead, the term "$\mathsf P(Y=1\mid X=x)$" is very intimately related to the random variable $X$. Here is the usual definition:

For any Lebesgue-integrable (real) random variable $Y$ and any (real) random variable $X$ on the same probability space, one can define the conditional expectation $\mathsf E(Y\mid X)$ as the, informally speaking, "best approximation to $Y$ if $X$ is known". See for instance [1; Definition 8.11] for a formal definition.

Now, it follows from the definition that $\mathsf E(Y\mid X)$ is $\sigma(X)$-measurable, so by [1; Korollar 1.97], there exists a measurable function $\eta: \mathbb R\to\mathbb R$ such that $$\mathsf E(Y\mid X) = \eta\circ X$$ $\mathsf P$-almost everywhere. The function $\eta$ is uniquely determined $X_\#\mathsf P$-almost everywhere. (TODO: Prove this.)

(Here, $X_\#\mathsf P$ denotes the pushforward of $\mathsf P$ under $X$, i.e. $X_\#\mathsf P(A)\overset{\text{Def.}}=\mathsf P(X^{-1}(A))$ for all measurable $A\subset\mathbb R$.)

For example, suppose that we have a random variable $Y$ and a random variable $X$ such that $\mathsf E(Y\mid X)=X^2$. Then we have $\eta(x)=x^2$ $X_\#\mathsf P$-almost everywhere.

Therefore, one can now define, with an abuse of notation, (and using Iverson brackets, i.e. if $A$ is an event then $[A]$ shall denote the random variable that is the indicator function of $A$, it is often also denoted by $\mathbf 1_A$) $$\mathsf P(Y=1\mid X=x)=\eta(x)$$ where $\eta$ is a function satisfying $$\mathsf E([Y=1]\mid X) = \eta\circ X$$ $\mathsf P$-almost everywhere.


So, the Wikipedia article (with very confusing notation in my opinion) just says this: $$R(h)=\mathsf P(h(X)\neq Y)=\mathsf E([h(X)\neq Y]).$$ Since $h(X), Y$ only take the values $0$ and $1$, $$\mathsf E([h(X)\neq Y]) = \mathsf E([h(X)=0][Y=1])+\mathsf E([h(X)=1][Y=0]).$$

By the tower property for the conditional expectation [1; Satz 8.14 (iv)], we have $$\mathsf E([h(X)=0][Y=1]) = \mathsf E(\mathsf E([h(X)=0][Y=1]\mid X)).$$ By [1; Satz 8.14 (iii)], since $h(X)$ is $\sigma(X)$-measurable (assuming $h$ is measurable), we have $$\mathsf E([h(X)=0][Y=1]\mid X) = [h(X)=0]\mathsf E([Y=1]\mid X).$$

But we chose the notation $E([Y=1]\mid X)=\eta(X)$, so we get $$\mathsf E([h(X)=0][Y=1]) = \mathsf E([h(X)=0] \eta(X)).$$ Analogously (exercise), we have $$\mathsf E([h(X)=1][Y=0]) = \mathsf E([h(X)=1] (1-\eta(X))),$$ and this is enough to conclude the proof for what you wanted to show.

Literature

[1] Achim Klenke: Wahrscheinlichkeitstheorie. 3. Auflage (2012/2013). Springer Spektrum.


I think the confusion arises from the notation itself. Note that \begin{align} &\,\mathbb{E}_X[\mathbb{P}(Y \neq h(X)|X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0,h(X)=1 | X=x) + \mathbb{P}(Y=1,h(X)=0 | X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0|X=x)\mathbb{P}(h(X)=1|X=x) + \mathbb{P}(Y=1|X=x)\mathbb{P}(h(X)=0|X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0|X=x)\boldsymbol{1}_{h(x)=1} + \mathbb{P}(Y=1|X=x)\boldsymbol{1}_{h(x)=0}]\\ =&\,\mathbb{E}_X[(1-\eta(x))\boldsymbol{1}_{h(x)=1} + \eta(x)\boldsymbol{1}_{h(x)=0}] \end{align} Inside the expectation you have the function $g(x) = (1-\eta(x))\boldsymbol{1}_{h(x)=1} + \eta(x)\boldsymbol{1}_{h(x)=0}$, defined on $\mathbb{R}^d$. But of course, you want to take the expectation with respect to the random variable $g(X)$. It's just how the notation plays out, once you condition on a particular value $x$ for $X$. If you wrote the integrals explicitly instead of the expectation operators it would become clear.