Why does Monte-Carlo integration work better than naive numerical integration in high dimensions?
Can anyone explain simply why Monte-Carlo works better than naive Riemann integration in high dimensions? I do not understand how chosing randomly the points on which you evaluate the function can yield a more precise result than distributing these points evenly on the domain.
More precisely:
Let $f:[0,1]^d \to \mathbb{R}$ be a continuous bounded integrable function, with $d\geq3$. I want to compute $A=\int_{[0,1]^d} f(x)dx$ using $n$ points. Compare 2 simple methods.
The first method is the Riemann approach. Let $x_1, \dots, x_n$ be $n$ regularly spaced points in $[0,1]^d$ and $A_r=\frac{1}{n}\sum_{i=1}^n f(x_i)$. I have that $A_r \to A$ as $n\to\infty$. The error will be of order $O(\frac{1}{n^{1/d}})$.
The second method is the Monte-Carlo approach. Let $u_1, \dots, u_n$ be $n$ points chosen randomly but uniformly over $[0,1]^d$. Let $A_{mc}=\frac{1}{n}\sum_{i=1}^n f(u_i)$. The central limit theorem tells me that $A_{mc} \to A$ as $n\to \infty$ and that $A_{mc}-A$ will be in the limit a gaussian random variable centered on $0$ with variance $O(\frac{1}{n})$. So with a high probability the error will be smaller than $\frac{C}{\sqrt{n}}$ where $C$ does not depend (much?) on $d$.
An obvious problem with the Riemann approach is that if I want to increase the number of points while keeping a regular grid I have to go from $n=k^d$ to $n=(k+1)^d$ which adds a lots of points. I do not have this problem with Monte-Carlo.
But if the number of points is fixed at $n$, does Monte-Carlo really yield better results than Riemann? It seems true in most cases. But I do not understand how chosing the points randomly can be better. Does anybody have an intuitive explanation for this?
I don't think it's fair to compare Riemann Integration to Monte Carlo. First off, using the usual Riemann sums for integration is a terrible way of computing integrals, especially those that oscillate wildly or are just plain naughty in certain areas. Just look at $d=1$. Riemann integration is $O(1/n)$ whereas the Trapezoid rule gives $O(1/n^2)$, assuming your function is three times differentiable. So for starters, instead of Riemann integration one has a plethora of numerical quadrature methods availabe. There are dozens of such methods and some adaptive methods try to concentrate the quadrature sample points in areas where the function behaves badly. Something like the trapezoid rule is a typical non-adaptive way of obtaining better accuracy by simply evaluating at midpoints.
However, as Hagen pointed out you have to deal with the curse of dimensionality. Quadrature methods start failing very badly in higher dimensions because you need a lot of points. So you need to be smart about it. You need to focus on areas where the function has large integral. Something like VEGAS Monte Carlo takes advantage of this by sampling near areas of large function values. By doing so, you are hedging your quadriture points where the function behaves the worst.
The simple Monte Carlo integration algorithm just samples random points. If your function isn't too crazy, this will tend to do slightly better than just a fixed size grid, especially on functions that vary (but not too much!). As a thought experiment, imagine trying to integrate something very sharply peaked and zero eleswhere. Then Monte Carlo will do a terrible job because there are only a few points where the function is nonzero, whereas Riemann integration may be slightly better, especially if the width of the peak is a bit bigger than the grid size. On the other end of the spectrum, consider a function whose derivative goes from small to large in small bursts. Then Riemann integration will be pretty terrible but Monte Carlo will tend to average out better.
Well I guess you answered the question yourself, minus the intuitive explanation. If you do accept that the "Reimann integration" error will be like O($ {} \frac{1}{n^{1/d}} $) and the MC error like $ {} \frac{C}{\sqrt{n}} $, then it's obvious that MC will in general give lower errors for $ \mathbb d > 2$. The only thing that I suppose might still prevent us from being sure is that C parameter which does not depend (much?) on $ \mathbb d $. That might conceivably mean that MC still gives a larger error say for $ \mathbb d = 3 $ or $ \mathbb 4$, but as C really doesn't depend much on $ \mathbb d$, then it's obvious that as $ \mathbb d$ goes further up, MC will win hands down.
As Alex mentioned above, you did sell the Reimann approach a bit short by breaking down each dimension in equal subintervals $\Delta$S and evaluating at the endpoints (which does give you O($ {} \frac{1}{n^{1/d}} $) error). To be more fair to the "placing points on a regular grid" approach (which is I presume what you are in essence contrasting with placing them randomly), you should use a regular grid of points but place them "better" in your hypercube. By "better" I mean have the first point on each dimension's axis at $ \mathbb 0.5\Delta$S, the second at $ \mathbb 1.5\Delta$S, etc, instead of $\Delta$S, $ \mathbb 2\Delta$S, etc. This will give an error of O($ {} \frac{1}{n^{2/d}} $) instead, as it is effectively uses the midpoints. This is the best you can do when trying to calculate an integral by just "sampling" points placed on a regular grid, but Monte Carlo will still beat that easily after $ \mathbb d = 4$.
So this is where it still seems counter-intuitive, how can MC beat the regular grid with the same number of total points when it just places points randomly?! OK, so here's the interesting part and really the answer to your question: placing points on a regular grid is not the best you can do integration-wise, as it leaves large holes that can be avoided by smarter placement. Think of a $ \mathbb3D$ regular grid. When you look at it from any side, i.e. when you look at a $ \mathbb2D$ projection of the points, all the points from the third dimension fall on the same place (they overlap) and thus leave large holes that don't help the integration. The same happens in higher dimensions, so for better integral estimation any lower-dimensional face(projection) should be as homogeneously and uniformly filled with points as possible, without having points overlapping and creating holes. This is what Quasi-Monte Carlo sequences try to do. They place points (deterministically) at locations that (should) avoid leaving them overlapping in lower dimensional projections and for this reason they can beat the regular grid even in $ \mathbb2D$ and $ \mathbb3D$.
Think then of Monte Carlo as having the same advantage as QMC (i.e. leaving less glaring holes in lower-dimensional faces, exactly because it lacks that regular structure that is responsible for the holes) but with the extra disadvantage of not placing those points very uniformly and homogeneously (like a regular grid and QMC do). At some point (at around $ \mathbb d>4$) the advantage wins over the disadvantage and MC starts beating the regular grid.
EDIT: I just added a typical numerical experiment testing the methods here
As you say, one primary reason to choose an MC method is precisely that you don't use the same numbers of points. That said, we assume we must use $n$ samples. Your essential question is "does choosing $n$ points randomly do better or worse than putting them on a grid?"
The answer is: it depends. On two key things.
- What sort of function are you integrating? This is crucial in determining your $C$ which essentially measures the variance of the function.
- What 'random' do you use? Again, you say "uniformly", which is a very tight restriction on a technique most often used with some sort of importance sampling, but let's go with uniformly.
Now the point is essentially that the error of a method is really dependent on the function, and will vary drastically. For example, one could construct a function (linear?) for which an even grid always gives an exact answer but for which MC methods have nonzero error. And so on. Similarly, you could probably engineer worst case situations for which even grids do badly at almost all scales due to wild fluctuations in one tiny corner which they are always sensitive to but which do not contribute overall. Then MC might never see them and do better. Swings and roundabouts.