Since you mention the universal approximation theorem, I think that the explanation that you are looking for is one in the framework of approximation theory. In that regard, there are many works, and I will list a couple of them below, that show that deep networks allow for a more efficient approximation than shallow networks.

Let me recall a couple of results and a suitable framework below and, at the end, I also address polynomial and Fourier approximation.

1. Approximation theory of feed-forward networks

To understand this better, we need to give a complexity measure to neural networks. Recall that a feed-forward neural network is a function of the form $$ \Phi(x) = T_L(\rho(T_{L-1}(\rho(\dots \rho(T_1(x)))))), $$ where $\rho:\mathbb{R} \to \mathbb{R} $ is the so-called activation function, which we apply coordinate-wise here. The maps $T_\ell: \mathbf{R}^{N_{\ell-1}} \to \mathbf{R}^{N_\ell}$ are affine linear maps, $N_{\ell-1} \in \mathbb{N}$, for all $\ell \in \{1, \dots, L+1\}$, and $L$ is the number of layers. Here $\Phi$ is a map from $N_0$ to $N_L$. In this notation, the number $N_{\ell}$ corresponds to the number of neurons in the $\ell$'s layer.

In most works on this topic, people now measure the complexity of a network in one of two ways:

  1. Count the number of neurons: $N(\Phi) := \sum_{\ell = 1}^L N_\ell$.
  2. Count the number of parameters: Here, one observes that $T_\ell = A_\ell(\cdot) + b_\ell$ for a matrix $A_\ell$ and a vector $b_\ell$. A way to measure the complexity of the full network is using the number of nonzero entries of both $A_\ell$ and $b_\ell$: $W(\Phi) := \sum_{\ell = 1}^L \|A_\ell\|_0 + \|b_\ell\|_0$. Here $\|.\|_0$ denotes the number of nonzero entries of a matrix/vector.

We have now associated a complexity measure to every network. This means we can ask a more nuanced question than the one of the universal approximation theorem.

Neural network approximation theory: Let $f \in L^\infty(\mathbb R^d)$ and $\epsilon>0$, does there exist a network $\Phi$ with $L$ layers such that $$ \|f-\Phi\|_\infty \leq \epsilon, $$ and how large do $N(\Phi), W(\Phi)$ need to be?

2. Power of depth

The beautiful thing about the question above is that for types of functions $f$ if $L$ is allowed to be large, then for the same accuracy $\epsilon$ the numbers of neurons/parameters, $N(\Phi), W(\Phi)$, can be significantly smaller than if $L$ is small. Let me list a number of results in this direction. This list is certainly not complete.

  • The Power of Depth for Feedforward Neural Networks: For radial functions, deep networks are more efficient than shallow ones.
  • Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks: Radial functions are hard to approximate by small networks. Additionally smooth, non-linear functions cannot be efficiently approximated by networks that are not sufficiently deep.
  • Benefits of depth in neural networks: There exist deep networks with fixed-width that if approximated by shallow networks need exponentially more neurons.
  • Optimal approximation of piecewise smooth functions using deep ReLU neural networks: Depending on the smoothness of the pieces of a piecewise smooth function, networks need to have a certain depth to achieve optimal approximation rates, where an approximation rate is the relationship between $\epsilon$ and $N(\Phi)$ or $W(\Phi)$.

In all of these papers, some functions are identified which, when approximated with shallow networks, need much higher $N(\Phi), W(\Phi)$ than when deep networks are used.

Now there are other advantages of deep networks over shallow networks from an approximation point of view, such as, that shallow networks do not allow for something that people call localized approximation. But I do not want to go into detail on this.

3. Convolutional networks

Let me also make a second point. You ask "Why do deep networks work so well?". Another answer to this question is that they don't. Deep feed-forward networks do not work at all. In virtually every application people use convolutional neural networks. You could, of course, try to approach this problem from an approximation theoretical perspective again. I know of three works that explain the power of depth for CNNs:

  1. On the Expressive Power of Deep Learning: A Tensor Analysis
  2. Equivalence of approximation by convolutional neural networks and fully-connected networks
  3. Benefits of depth in neural networks:

The first two works consider quite specific CNN architectures but highlight that deep CNNs yield better approximation rates for some functions than shallow networks. (In the second paper this is not explicit, but it says that every approximation rate from feed-forward networks translates to a similar one for convolutional networks and vice versa. So the aforementioned results for feedforward networks carry over.) The third result does not restrict the architecture, but only shows that for some very specific functions deep convolutional networks are better approximators than shallow ones.

Here a lot more work needs to be done especially relating to understanding the role of pooling.

There are other aspects of deep convolutional networks that potentially make them work well such as, that they introduce invariances which makes them good classifiers. Since this is not approximation theoretical I shall not go into detail here but I want to mention Mallat's original work on this: Understanding Deep Convolutional Networks.

4. Comparison to polynomials and Fourier series

You also asked for a comparison to polynomial approximation or Fourier approximation. The approximation spaces of these function classes are well known, and one can simply check for which types of functions deep networks yield better approximation rates.

One point is that both polynomials and Fourier series are terrible at approximating functions with discontinuities. For example, in Optimal approximation of piecewise smooth functions using deep ReLU neural networks it was shown that deep networks approximate piecewise n-times differentiable functions essentially as well as n-times differentiable functions. The second point is, that deep networks are very efficient at representing polynomials as famously shown for ReLU networks in Error bounds for approximations with deep ReLU networks. Clearly, if each polynomial can be well approximated, then also sums of polynomials can be approximated, so the approximation rates carry over. Similarly, Approximation properties of a multilayered feedforward artificial neural network shows that certain neural networks are at least as good as a polynomial approximation. As for Fourier approximation Deep Neural Network Approximation Theory has some results on how to reapproximate sinusoidal functions.

Many other works show that DNNs are in fact optimal in approximating certain function classes in a certain framework, e.g. Optimal Approximation with Sparsely Connected Deep Neural Networks but also Error bounds for approximations with deep ReLU networks. In this case, they are of course at least as good as polynomials or Fourier series.

In essence, deep networks are in almost all approximation tasks as good as polynomials/Fourier series and in some tasks they are way better.


I would say the key of their success is due to their huge number of parameters, which allows for the modelling of very complex phenomena. Also, neural networks don't care about relationships between the variables being linear (in general, having any predefined pattern) or not

However, I think the key is in the parameters. If you take a neural network with a number of parameters comparable to a linear regression model, it will actually suck!