Do the Kolmogorov's axioms permit speaking of frequencies of occurence in any meaningful sense?

Solution 1:

I. I agree with you that no version of the Law of Large Numbers tells us something about real life frequencies, already for the reason that no purely mathematical statement tells us anything about real life at all, without first giving the mathematical objects in it a "real life interpretation" (which never can be stated, let alone "proven", within mathematics itself).

Rather, I think of the LLN as something which, within any useful mathematical model of probabilities and statistical experiments, should hold true! In the sense that: If you show me a new set of axioms for probability theory, which you claim have some use as a model for real life dice rolling etc.; and those axioms do not imply some version of the Law of Large Numbers -- then I would dismiss your axiom system, and I think so should you.


II. Most people would agree there is a real life experiment which we can call "tossing a fair coin" (or "rolling a fair die", "spinning a fair roulette wheel" ...), where we have a clearly defined finite set of outcomes, none of the outcomes is more likely than any other, we can repeat the experiment as many times as we want, and the outcome of the next experiment has nothing to do with any outcome we have so far.

And we could be interested in questions like: Should I play this game where I win/lose this much money in case ... happens? Is it more likely that after a hundred rolls, the added number on the dice is between 370 and 380, or between 345 and 350? Etc.

To gather quantitative insight into answering these questions, we need to model the real life experiment with a mathematical theory. One can debate (but again, such a debate happens outside of mathematics) what such a model could tell us, whether it could tell us something with certainty, whatever that might mean; but most people would agree that it seems we can get some insight here by doing some kind of math.

Indeed, we are looking for two things which only together have any chance to be of use for real life: namely, a "purely" mathematical theory, together with a real life interpretation (like a translation table) thereof, which allows us to perform the routine we (should) always do:

Step 1: Translate our real life question into a question in the mathematical model.

Step 2: Use our math skills to answer the question within the model.

Step 3: Translate that answer back into the real life interpretation.

The axioms of probability, as for example Kolmogorov's, do that: They provide us with a mathematical model which will give out very concrete answers. As with every mathematical model, those concrete answers -- say, $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$ -- are absolutely true within the mathematical theory (foundational issues a la Gödel aside for now). They also come with a standard interpretation (or maybe, a standard set of interpretations, one for each philosophical school). None of these interpretations are justifiable by mathematics itself; and what any result of the theory (like $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$) tells us about our real life experiment is not a mathematical question. It is philosophical, and very much up to debate. Maybe a frequentist would say, this means that if you roll 100 dice again and again (i.e. performing kind of a meta-experiment, where each individual experiment is already 100 "atomic experiments" averaged), then the relative frequency of ... is greater than the relative frequency of ... . Maybe a Bayesian would say, well it means that if you have some money to spare, and somebody gives you the alternative to bet on this or that outcome, you should bet on this, and not that. Etc.


III. Now consider the following statement, which I claim would be accepted by almost everyone:

( $\ast$ ) "If you repeat a real life experiment of the above kind many times, then the sample means should converge to (become a better and better approximation of) the ideal mean".

A frequentist might smirkingly accept ($\ast$), but quip that it's is true by definition, because he might claim that any definition of such an "ideal mean" beyond "what the sample means converge to" is meaningless. A Bayesian might explain the "ideal mean" as, well you know, the average -- like if you put it in a histogram, see, here is the centre of weight -- the outcome you would bet on -- you know! And she might be content with that. And she would say, yes, of course that is related to relative frequencies exactly in the sense of ($\ast$).

I want to strees that ($\ast$) is not a mathematical statement. It is a statement about real life experiments, which we claim to be true, although we might not agree on why we do so: depending on your philosophical background, you can see it as a tautology or not, but even if you do it is not a mathematical tautology (it's not a mathematical statement at all), just maybe a philosophical one.

And now let's say we do want a model-plus-translation-table for our experiments from paragraph II. Such a model should contain an object which models [i.e. whose "real life translation" is] one "atomic" experiment: that is the random variable $X$, or to be precise, an infinite collection of i.i.d. random variables $X_1, X_2, ...$.

It contains something which models "the actual sample mean after $100,1000, ..., n$ trials": that is $\bar X_n := \frac{1}{n}\sum_1^n X_i$.

And it contains something which models "an ideal mean": that is $\mu=EX$.

So with that model-plus-translation, we can now formulate, within such model, a statement (or set of related statements) which, under the standard translation, appear to say something akin to ($\ast$).

And that is the (or are the various forms of the) Law of Large Numbers. And they are true within the model, and they can be derived from the axioms of that model.

So I would say: The fact that they hold true e.g. in Kolmogorov's Axioms means that these axioms pass one of the most basic tests they should pass: We have a philosophical statement about the real world, ($\ast$), which we believe to be true, and of the various ways we can translate it into the mathematical model, those translations are true in the model. The LLN is not a surprising statement on a meta-mathematical level for the following reason: Any kind of model for probability which, when used as model for the above real life experiment, would not give out a result which is the mathematical analogy of statement ($\ast$), should be thrown out!

In other words: Of course good probability axioms give out the Law of Large Numbers. They are made so that they give them out. If somebody proposed a set of mathematical axioms, and a real-life-translation-guideline for the objects in there, and any model-internal version of ($\ast$) would be wrong -- then that model should be deemed useless (both by frequentists and Bayesians, just for different reasons) to model the above real life experiments.


IV. I want to finish by pointing out one instance where your argument seems contradictory, which, when exposed, might make what I write above more plausible to you.

Let me simplify an argument of yours like this:

(A) A mathematical statement like the LLN in itself can never make any statement about real life frequencies.

(B) Many sources claim that LLN does make statements about real life frequencies. So they must be implicitly assuming more.

(C) As an example, you exhibit a Kolmogorov quote about applying probability theory to the real world, and say that it "seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom."

I agree with (A) and (B). But (C) is where I want you to pause and think: Were we not in agreement, cf. (A), that no mathematical statement can ever tell us something about real life frequencies? Then what kind of "additional axiom" would say that? Whatever the otherwise mistaken sources in (B) are implicitly assuming, and Kolmogorov himself talks about in (C), it cannot just be an "additional axiom", at least not a mathematical one: Because one can throw in as many mathematical axioms as one wants, they will never bridge the fundamental gap in (A).

I claim the thing that all the sources in (B) are implicitly assuming, and what Kolmogorov talks about in (C), is not an additional axiom within the mathematical theory. It is the meta-mathematical translation / interpretation that I talk about above, which in itself is not mathematical, and in particular cannot be introduced as an additional axiom within the theory.

I claim, indeed, most sources are very careless, in that they totally forget the translation / interpretation part between real life and mathematical model, i.e. the bridge we need to cross the gap in (A); i.e. steps 1 and 3 in the routine explained in paragraph II. Of course it is taught in any beginner's class that any model in itself (i.e. without a translation, without steps 1 and 3) is useless, but it is commonly forgotten already in the non-statistical sciences, and more so in statistics, which leads to all kind of confusions. We spend so much time and effort on step 2 that we often forget steps 1 and 3; also, step 2 can be taught and learned and put on exams, but steps 1 and 3 not so well: they go beyond mathematics, seem to fit better into a science or philosophy class (although I doubt they get a good enough treatment there either). However, if we forget them, we are left with a bunch of axioms linking together almost meaningless symbols; and the remnants of meaning which we, as humans, cannot help applying to these symbols, quickly seem to be nothing but circular arguments.

Solution 2:

Kolmogorov's axioms, if one were to make an assumption about the distribution of the random variable $X_i$, could be used to derive the distribution of the random variable $\bar{X}$. Notice in the last statement that since $X_i$ is a random variable, $\bar{X}$ is also a random variable. The fact that $\bar{X}$ is a random variable means that there is a probability measure for the random variable $\bar{X}$. The beauty of the WLLN is that so long as both $\mu$ and $\sigma^2$ are finite, no assumptions about the measure $P()$ must be made in order to derive that $\bar{X_n}$ converges in probability to $\mu$. I agree with Hurkyl. Perhaps this post will help with the concept of a random variable https://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable

You do make a good point, however, about whether or not the assumptions that the $X$'s are independent and identically distributed random variables may not be true in practice, which is the problem alluded to in the Keynes example.

The example regarding dice appears to rely on the assumption that the die is fair, which may or may not be reasonable depending on how the die is constructed and rolled. However, it seems reasonable to assume that there exists appropriate setups of a dice rolling experiments for which the rolls are $i.i.d$ random variables with a probability measure $P$. In such a case, it does follow from the WLLN that $\bar{X}$ would indeed converge to $\mu$.

Solution 3:

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement. I will use this version of the LLN:

"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."

Interpreting "probability" in the frequentist sense, it becomes this:

Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."

Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."

You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.

Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).

So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).

First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this. It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.

Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.

In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two models of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is independent as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)