Why does the sup norm make the results of approximation theory independent from the unknown distribution of the input data?
I was reading the paper "Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality: a Review" and I was trying to understand the following statement in section 3.1:
On the other hand, our main results on compositionality require the sup norm in order to be independent from the unknown distribution of the inputa data. This is important for machine learning.
my questions are:
- what does it mean that the sup norm $\| f \|_{\infty} = \sup_{x \in X} |f(x)|$ make the results of the respective paper independent of the input distribution?
- additionally, why does the sup norm $\| f \|_{\infty} = \sup_{x \in X} |f(x)|$ make the results of the respective paper independent of the input distribution?
- Why is that important for machine learning?
I don't know the answers, maybe because of my lack of experience in functional analysis and approximation theory but my guesses what the answer might be are:
- I think what it means is that since the paper is concerned with proving bounds on the smallest distance between a target function and a space of functions (space of Neural Networks) denoted by the degree of approximation $dist(f,V_N) = \inf_{P \in V_N} \|f - P \|_{\infty}$, then what I assume it claims is that upper bounds on this quantity are independent (not a function of) the probability distribution of the input space $X$ where $f:X \to Y$. Does it matter because it means it applies to any distribution of $X$? I guess the reason I find this confusing is that I don't particularly see an issue with it dependent on the data distribution. I think what matters more is that the bound on the degree of approximation is not vacuous.i i.e. that its not infinity. If it is infinity for some distribution then the results are useless. However, I don't see why independence on the distribution would matter, I'd assume that boundedness or compactness rather than distribution is what matters (since its what makes things not explode).
- I don't understand why the sup norm make things not explode. The reason things should not explode should be due to boundedness or compactness, not anything to do with the sup norm. I guess obviously the sup norm implies things don't explode if its bounded, but that happens because of a apriori boundedness/compactness assumption, not due to the sup norm, right?
- I guess its important for machine learning because they care things hold for any probability/data distribution. But as I've said before, I don't understand how talking about data distribution matters. In my opinion it doesn't matter, since thats not what makes things explode, what matters is the boundedness/compactness of $X$, $f(x)$ and $Y$
Is this on the right track? Or am I misunderstanding the paper a lot?
Solution 1:
Though I can only guess what was the intention of the authors, I interpret this passage as follows:
Unlike the sup-norm, the $L_2$ norm (or any $L_p$ norm for $p < \infty$) can be "forgiving" to large errors if they occur "infrequently" throughout the domain, whereas the sup-norm always takes the worst error. For a more concrete example, assume $d(x)=f(x)-g(x)$ is zero everywhere except on the measurable set $A$, on which $d(x) = n$ for $n \in \mathbb{N}$. Let $\lambda$ be the Lebesgue measure, and so If $\lambda(A) = \frac{\epsilon^2}{n^2}$, i.e. the large error occurs "infrequently", then $||d||_2 = \epsilon$ while $||f||_\infty = n$. This "forgiveness" property is not generally a bad thing, as in many cases it's actually quite desirable to ignore "infrequent" errors, but here lies the problem in the specific context of ML: we implicitly assumed that the measure of $A$ is proportional to the frequency of the data on the domain, which might be a reasonable assumption in other contexts (for bounded domains it's the same as assuming uniform distribution), while in ML we want to be agnostic to the data distribution. More specifically, consider the previous example but now under the case where the data distribution is densely packed exactly on $A$, i.e. $\mathrm{Pr}(x \in A) = 1$, which means the true error is $\mathbb{E}[d^2] = n$ -- quite far from the result of the $L_2$ norm bound. In other words, depending on the data distribution, the $L_2$ bound might be irrelevant, whereas the sup-norm is always relevant if you have no prior information on the distribution.
So to sum it up: under the general settings of ML, we do not make any assumptions on the data distribution, because we do not typically know what our learning algorithms will encounter in practice. Hence, it is preferable for theoretical results to be data independent. In the context of this specific paper which deals with the approximation properties of neural networks, using sup-norm ensure that the degree of approximation (how large networks have to be approximate a function) would not be affected by the data distribution because it always takes the worst error, while under $L_2$ norm the bounds might not be relevant for data distributions which treat seemingly "infrequent" parts of the domain is highly frequent ones.
As a final note, while it is generally preferable to have data independent results (because we ideally want to prove our algorithms work for any case), making assumptions on the data distribution can lead to significantly better guarantees for those cases on which we believe they apply. On the present day, there are certain behaviors that we observe in practice that "classical" ML theory cannot fully explain (e.g. why neural networks generalize in practice), and many believe that forgoing the data independent setting is a key step for addressing these issues.