Is the Law of Large Numbers empirically proven?

Does this reflect the real world and what is the empirical evidence behind this?

Wikipedia illustration

Layman here so please avoid abstract math in your response.

The Law of Large Numbers states that the average of the results from multiple trials will tend to converge to its expected value (e.g. 0.5 in a coin toss experiment) as the sample size increases. The way I understand it, while the first 10 coin tosses may result in an average closer to 0 or 1 rather than 0.5, after 1000 tosses a statistician would expect the average to be very close to 0.5 and definitely 0.5 with an infinite number of trials.

Given that a coin has no memory and each coin toss is independent, what physical laws would determine that the average of all trials will eventually reach 0.5. More specifically, why does a statistician believe that a random event with 2 possible outcomes will have a close to equal amount of both outcomes over say 10,000 trials? What prevents the coin to fall 9900 times on heads instead of 5200?

Finally, since gambling and insurance institutions rely on such expectations, are there any experiments that have conclusively shown the validity of the LLN in the real world?

EDIT: I do differentiate between the LLN and the Gambler's fallacy. My question is NOT if or why any specific outcome or series of outcomes become more likely with more trials--that's obviously false--but why the mean of all outcomes tends toward the expected value?

FURTHER EDIT: LLN seems to rely on two assumptions in order to work:

The universe is indifferent towards the result of any one trial, because each outcome is equally likely
The universe is NOT indifferent towards any one particular outcome coming up too frequently and dominating the rest.

Obviously, we as humans would label 50/50 or a similar distribution of a coin toss experiment "random", but if heads or tails turns out to be say 60-70% after thousands of trials, we would suspect there is something wrong with the coin and it isn't fair. Thus, if the universe is truly indifferent towards the average of large samples, there is no way we can have true randomness and consistent predictions--there will always be a suspicion of bias unless the total distribution is not somehow kept in check by something that preserves the relative frequencies.

Why is the universe NOT indifferent towards big samples of coin tosses? What is the objective reason for this phenomenon?

NOTE: A good explanation would not be circular: justifying probability with probabilistic assumptions (e.g. "it's just more likely"). Please check your answers, as most of them fall into this trap.

Solution 1:

Reading between the lines, it sounds like you are committing the fallacy of the layman interpretation of the "law of averages": that if a coin comes up heads 10 times in a row, then it needs to come up tails more often from then on, in order to balance out that initial asymmetry.

The real point is that no divine presence needs to take corrective action in order for the average to stabilize. The simple reason is attenuation: once you've tossed the coin another 1000 times, the effect of those initial 10 heads has been diluted to mean almost nothing. What used to look like 100% heads is now a small blip only strong enough to move the needle from 50% to 51%.

Now combine this observation with the easily verified fact that 9900 out of 10000 heads is simply a less common combination than 5000 out of 10000. The reason for that is combinatorial: there is simply less freedom in hitting an extreme target than a moderate one.

To take a tractable example, suppose I ask you to flip a coin 4 times and get 4 heads. If you've flip tails even once, you've failed. But if instead I ask you to aim for 2 heads, you still have options (albeit slimmer) no matter how the first two flips turn out. Numerically we can see that 2 out of 4 can be achieved in 6 ways: HHTT, HTHT, HTTH, THHT, THTH, TTHH. But the 4 out of 4 goal can be achieved in only one way: HHHH. If you work out the numbers for 9900 out of 10000 versus 5000 out of 10000 (or any specific number in that neighbourhood), that disparity becomes truly immense.

To summarize: it takes no conscious effort to get an empirical average to tend towards its expected value. In fact it would be fair to think in the exact opposite terms: the effect that requires conscious effort is forcing the empirical average to stray from its expectation.

Solution 2:

Nice question! In the real word, we don't get to let $n \to \infty$, so the question of why LLN should be of any comfort is important.

The short answer to your question is that we cannot empirically verify LLN since we can never perform an infinite number of experiments. Its a theoretical idea that is very well founded, but, like all applied mathematics, the question of whether or not a particular model or theory holds is a perennial concern.

A more useful law from a statistical standpoint is the Central Limit Theorem and the various probability inequalities (Chebyshev, Markov, Chernov, etc). These allow us to place bounds on or approximate the probability of our sample average being far from the true value for a finite sample.

As for an actual experiment to test LLN. One can hardly do better than John Kerrichs 10,000 coin flip experiment-- he got 50.67% heads!!

So, in general, I would say LLN is empirically well supported by the fact that scientists from all fields rely upon sample averages to estimate models, and this approach has been largely successful, so the sample averages appear to be converging nicely for finite, and feasible, sample sizes.

There are "pathological" cases that one can construct (I'll spare you the details) where one needs astronomical sample sizes to get a reasonable probability of being close to the true mean. This is apparent if you are using the Central Limit Theorem, but the LLN is simply not informative enough to give me much comfort in day-to-day practice.

The physical basis for probability

It seems you still an issue with why long-run averages exist in the real world, apart from the theory of probability regarding the behavior of these averages assuming long-run averages exist. Let me state a fact that may help you:

Fact Nether probability theory nor the existence of a long-run averages requires randomness !

The determinism vs. indeterminism debate is for philosophers, not mathematics. The notion of probability as a physical observable comes from ignorance or absence of the detailed dynamics of what you are observing. You could just as easily apply probability theory to a boring 'ol pendulum as to the stock market or coin flips...its just that with pendulum's we have a nice, detailed theory that that allows us make precise estimates of future observations. I have no doubt that a full physical analysis of a coin flip would allow for us to predict what face would come up...but in reality, we will never know this!

This isn't an issue though. We don't need to assume a guiding hand nor true indeterminism to apply probability theory. Lets say that coin flips are truly deterministic, then we can still apply probability theory meaningfully if we assume a couple basic things:

The underlying process is $ergodic$...okay, this is a bit technical, but it basically means that the process dynamics are stable over the long term (e.g., we are not flipping coins in a hurricane or where tornados pop in and out of the vicinity!). Note that I said nothing about randomness...this could be a totally deterministic, albeit very complex, process...all we need is that the dynamics are stable (i.e., we could write down a series of equations with specific parameters for the coin flips and they wouldn't change from flip to flip).
The values the process can take on at any time are "well behaved". Basically, like I said earlier wrt the Cauchy...the system should not produce values that consistently exceed $\approx n$ times the sum of all previous observations. It may happen once in a while, but it should become very rare, very fast (precise definition is somewhat technical).

With these two assumptions, we now have the physical basis for the existence of a long-run average of a physical process. Now, if its complicated, then instead of using physics to model it exactly, we can apply probability theory to describe the statistical properties of this process (i.e., aggregated over many observations).

Note that the above is independent from whether or not we have selected the correct probability model. Models are made to match reality...reality does not conform itself to our models. Therefore, it is the job of the modeler, not nature or divine provenance, to ensure that the results of the model match the observed outcomes.

Hope this helps clarify when and how probability applies to the real world.

Solution 3:

This isn't an answer, but I thought this group would appreciate it. Just to show that the behavior in the graph above is not universal, I plotted the sequence of sample averages for a Standard Cauchy distribution for $n=1...10^6$!. Note how, even at extremely large sample sizes, the sample average jumps around.

If my computer weren't so darn slow, I could increase this by another order of magnitude and you'd not see any difference. The sample average for a Cauchy Distribution behaves nothing like that for coin flips, so one needs to be careful about invoking LLN. The expected value of your underlying process needs to exist first!

enter image description here

Response to OP concerns

I did not bring this example up to further concern you, but merely to point out that "averaging" does not always reduce the variability of an estimate. The vast majority of the time, we are dealing with phenomena that possess an expected value (e.g., coin tosses of a fair coin). However, the Cauchy is pathological in this regard, since it does not possess an expected value...so there is no number for your sample averages to converge to.

Now, many moons ago when I first encountered this fact, it blew my mind...and shook my confidence in statistics for a short time! However, I've come to be comfortable with this fact. At the intuitive level (and as many of the posters here have pointed out) what the LLN relies upon is the fact that no single outcome can consistently dominate the sample average...sure, in the first few tosses the outcomes do have a large influence, but after you've accumulated $10^6$ tosses, you would not expect the next toss to change your sample average from, say, 0.1 to 0.9, right? It's just not mathematically possible.

Now enter the Cauchy distribution...it has the peculiar property that, no matter how many values you are currently averaging over, the absolute value of the next observation has a good (i.e., not vanishingly small - this part is somewhat technical, so maybe just accept this point) chance of being larger (much larger, in fact) than n times the sum of all previous values observed...take a moment to think about this, this means that at any moment, your sample average can be converging to some number, then WHAM, it gets shot off in a different direction. This will happen infinitely often, so you're sample average will never settle down like it does with processes that possess an expected value (e.g., coin tosses, normally distributed variables, poisson, etc.). Thus, you will never have an observed sum and an $n$ large enough to swamp the next observation.

I've asked @sonystarmap if he/she would mind calculating the sequence of medians, as opposed to the sequence of averages in their post (similar to my post above, but for 100x more samples!) What you should see is that the median of a sequence of Caychy random variables does converge in LLN fashion. This is because the Cauchy, like all random variables, does possess a median. This is one of the many reasons I like using medians in my work, where Normality is almost surely (sorry, couldn't help myself) false and there are extreme fluctuations. Not to mention the sample median minimizes the average deviation from the mean, when it does exist.

Second Addition: Cauchy DOES have a Median

To add another detail (read:wrinkle) to this story, the Cauchy does have a median, and so the sequence of medians does converge to the true median (i.e., $0$ for the standard Cauchy.) To show this, I took the exact same sequence of standard cauchy variates I used to make my first graph of the sample averates, and then took the first 20,000 and broke it up into four intervals of 5000 observations each (youll see why in a moment). I then plotted the sequence of sample medians as the samep size approaches 5000 for each of the four independent sequence. Note the dramatic difference in convergence properties!

This is another application of the law of large numbers, but to the sample median. Details can be seen here.

enter image description here

Solution 4:

One has to distinguish between the mathematical model of coin tossing and factual coin tossing in the real world.

The mathematical model has been set up in such a way that it behaves provably according to the rules of probability theory. These rules do not come out of thin air: They encode and describe in the most economical way what we observe when we toss real coins.

The deep problem is: Why do real coins behave the way they do? I'd say this is a question for physicists. An important point is symmetry. If there is a clear cut "probability" for heads, symmetry demands that it should be ${1\over2}$. Concerning independence: There are so many physical influences determining the outcome of the next toss that the face the coin showed when we picked it up from the table seems negligible. And on, and on. This is really a matter of philosophy of physics, and I'm sure there are dozens of books dealing with exactly this question.

Solution 5:

Based on your remarks, I think you are actually asking

"Do we observe the physical world behaving in a mathematically predictable way?"

"Why should it do so?"

Leading to:

"Will it continue to do so?"

See for example Philosophy stack exchange question.

My take on the answer is that, "Yes", for some reason the physical universe seems to be a machine obeying fixed laws, and this is what allows science to use mathematics to predict behaviour.

So, if the coin is unbiased and the world behaves consistently then number of heads will vary in a predictable way.

But please note that it is not expected to converge to exactly half. In fact, the excess or deficit will go as $\sqrt N$, which actually increases with $N$. It is the proportion of the excess relative to the total number of trials $N$ which goes to zero.

However, no-one can ever prove in principle whether, for example, the universe actually has a God who decides how the coin will fall. I recall that in Peter Bernstein's book about Risk the story is told that the Romans (who did not know probability as a concept) had rules for knucklebone based games that effectively assumed this.

Finally, if you ask which state of affairs is "well supported by evidence", the evidence available would include at least all of science and the finance industry. That's enough for most of us.