Why is the inequality true?

I am studying the book "Understanding Machine Learning: From Theory to Algorithms". I am struggling to understand the solution to exercise 3 (2) on page 41.

Exercise:

An axis aligned rectangle classifier in the plane is a classifier that assigns 1 to a point if and only if it is inside a certain rectangle. Formally, given real numbers $a_1\leq b_1, a_2\leq b_2,$ define the classifier $h_{(a_1, b_1, a_2, b_2)}$ by $$ h_{(a_1, b_1, a_2, b_2)}(x_1, x_2) = \begin{cases}1&\textrm{if $a_1\leq x_1\leq b_1$ and $a_2\leq x_2\leq b_2$}\\ 0&\textrm{otherwise}\end{cases} $$ The class of all axis aligned rectangles in the plane is defined as $\mathcal{H}_\mathrm{rec}^2 = \{h_{(a_1, b_1, a_2, b_2)}:\textrm{$a_1\leq b_1$ and $a_2\leq b_2$}\}$...rely on realizability assumption. Let $A$ be an algorithm that returns the smallest rectangle enclosing all positive examples in the training set. It is shown in (1) that $A$ is an ERM.

(2): Show that if $A$ receives a training set of size $\geq \frac{4\log(4/\delta)}{\epsilon}$ then, with probability of at least $1-\delta$ it returns a hypothesis with error of at most $\epsilon$.

Hint: Fix some distribution $\mathcal{D}$ over $\mathcal{X}$, let $R^*=R(a_1^*,b_1^*,a_2^*,b_2^*)$ be the rectangle that generates the labels, and let $f$ be the corresponding hypothesis. Let $a_1\geq a_1^*$ be a number such that the probability mass (w.r.t $\mathcal{D}$) of the rectangle $R_1=R(a_1^*,a_1,a_2^*,b_2^*)$ is exactly $\epsilon/4$. Similarly, let $b_1,a_2,b_2$ be numbers suh that the probability masses of the rectangles $R_2=R(b_1,b_1^*,a_2^*,b_2^*),R_3=R(a_1^*,b_1^*,a_2^*,a_2),R_4=R(a_1^*,b_1^*,b_2,b_2^*)$ are all exactly $\epsilon/4$. Let $R(S)$ be the rectanlge returned by $A$. See the following illustration:

enter image description here

  • Show that $R(S)\subset R^*$
  • Show that if $S$ contains (positive) examples in all of the rectangles $R_1,R_2,R_3,R_4$, then the hypothesis returned by $A$ has error of at most $\epsilon$.
  • For each $i\in\{1,...,4\}$, upper bound the probability that $S$ does not contain an example from $R_i$
  • Use the union bound to conclude the argument.

From the solution manual, here is the answer on page 2:

Fix some distribution $\mathcal{D}$ over $\mathcal{X}$, and define $R^*$ as in the hint. Let $f$ be the hypothesis associated with $R^*$ a training $S$, denoted $R(S)$ the rectagnle returned by the proposed algorithm and by $A(S)$ the corresponding hypothesis. The definition of algorithm $A$ implies that $R(S)\subset R^*$ for every $S$. Thus, $$L_{(\mathcal{D},f)}(R(S))=\mathcal{D}(R^*\setminus R(S))$$

Fix some $\epsilon\in(0,1)$. Define $R_1,R_2,R_3,R_4$ as in the hint. For each $i\in[4]$, define the event $$F_i=\{S|_x:S|_x\cap R_i=\emptyset\}$$

Applying the union bound we obtain

$$\mathcal{D}^m(\{S:L_{(\mathcal{D},f)}(A(S))\gt \epsilon\})\leq \mathcal{D}^m\left(\bigcup^4_{i=1}F_i\right)\leq \sum^4_{i=1}\mathcal{D}^m(F_i)$$

So the above is what I don't understand:

Why is $$\mathcal{D}^m(\{S:L_{(\mathcal{D},f)}(A(S))\gt \epsilon\})\leq \mathcal{D}^m\left(\bigcup^4_{i=1}F_i\right)$$

I know what these quantities mean:

  • $\mathcal{D}^m(\{S:L_{(\mathcal{D},f)}(A(S))\gt \epsilon\})$ is equivalent to "the probability of observing samples $S$ such that when algorithm $A$ is applied to $S$, the true error is greater than $\epsilon$.
  • $\mathcal{D}^m\left(\bigcup^4_{i=1}F_i\right)$ is equivalent to "the probability of observing samples which don't intersect $R_1$ or $R_2$ or $R_3$ or $R_4$.

But why is the inequality true? The root of my understanding is that I can't understand how $\epsilon$ and $A(S)$ are supposed to be involved in interpreting this inequality. As in I don't understand geometrically how $A(S)$ and $\epsilon$ form subsets in the context of the figure above. I can easily imagine the sets $F_i$ though.


So in this question $R^*$ is the rectangle corresponding to the true labeling function $f$. That means in the population any example inside $R^*$ is positive and any example outside $R^*$ is negative. Then $A(S)$ correctly classifies any positive samples inside $R(S)$ and also correctly classifies any negative samples outside of $R^*$, but makes a mistake on positive samples lying inside $R^*$ but outside $R(S)$. Since the true error of $A(S)$ is the probability of misclassifying a point, as long as each of the rectangle $R_1,R_2,R_3,R_4$ intersect $R(S)$, the probability of misclassifying a point is at most the sum of the probability that the point lies in one of these rectangles, that is $\varepsilon$.

Now $R_1,R_2,R_3,R_4$ were constructed so that the probability mass under the distribution $\mathcal{D}$ that generates the samples is $\frac{\varepsilon}{4}$. Then for each example drawn, the probability that it does not lie in $R_1$ is $(1-\frac{\varepsilon}{4})$, and equivalently with $R_2,R_3,R_4$. Since we draw $m$ samples, the probability that none of these lie in $R_1$ is $(1-\frac{\varepsilon}{4})^m$. From here we apply the union bound and the result follows.