The theory in probability
Consider a real-life experiment (perhaps written as a problem in a textbook): A coin is continually tossed until two consecutive heads are observed. Assume that the results of the tosses are mutually independent and the coin is fair. What is the expected number of tosses before the experiment ends?
Now, how would one solve this problem? First, we would need to define a probability space to model the experiment. The issue is that there are so many possibilities:
We could define the sample space to be an uncountable set of all infinite-length strings of $H$ and $T$. This sample space already does not seem to fulfill the experiment's condition: that it terminates once two consecutive heads are observed. E.g. $HHTTTTT...$ and $HHHHHHH...$ are different outcomes in the sample space but are actually considered the same in the experiment (since we terminate after two heads). So to say that a sample space consists of all possible distinct outcomes of an experiment seem wrong to me. And yet, according to many sources, this is a natural space to use. Also, there can be infinitely many $\sigma$-algebras so there's the question of which one to use too. Lastly for the probability measure we can use the assumption that the coin is fair and the results of the tosses are independent so $P(\text{outcome starts with }TH) = P(\text{first toss is }T)P(\text{second toss is }H) = 0.25$, for example.
We could take outcomes of the experiment to be the entire string of heads and tails from the start until termination of the experiment. So our sample space has all finite-length strings of $T$ and $H$ ending with $HH$, as well as infinite length strings that do not contain $HH$ as a substring. This space is uncountable, so again there are infinitely many $\sigma$-algebras. Now how do we define a probability measure? We should have $P(\text{outcome starts with }HHH) = P(\text{first toss is }H)P(\text{second toss is }H)P(\text{third toss is }H) = 0.125$ since the coins are fair and independent, but there are no outcomes starting with $HHH$ in this sample space, so this probability should actually be $0$. So my guess is that, instead we assume that $P(\text{$k$-th toss is heads | outcome starts with $s$}) = 0.5$, where $s$ is any $(k-1)$-length string containing no $HH$. Now this is stated nowhere in the given problem, and yet is very intuitive in the conventional sense of "independence" in the real world. Is this correct?
We could only take the finite-length strings of $T$ and $H$ that end with $HH$, and ignore the infinite length strings. Now the question is: how do we know this is indeed the set of all outcomes? How do we know this is a valid assumption? Of course this simplifies a lot of things since the space is now discrete.
We could even take the last 4 (or less) coin flips of finite runs as well as infinite runs as the sample space, so $$\Omega = \{HH, THH, \text{ends with }TTHH, \text{ends with }HTHH\}\cup \{s : s\text{ is an infinite string of $H$ and $T$}\}.$$ Now independence is defined only in the context of a probability space. So in this space we have $$\begin{eqnarray*}P(HH) &=& P(\text{last toss is }H \cap \text{2nd last toss is }H) \\&=& P(\text{last toss is }H)P(\text{2nd last toss is }H) \\&=& 0.25\\ P(THH) &=& 0.125\\ P(\text{ends with }TTHH) = P(\text{ends with }HTHH) &=& 0.0625.\end{eqnarray*}$$ So $P(\text{infinite string of $H$ and $T$}) = 1 - 0.25 - 0.125 - 0.0625 - 0.0625 = 0.5$. But clearly this does not conform to the above 3 models. So can we use this as a probability space? What is the issue here?
Intuitively I can see how the probability spaces 1 to 3 all compute the same expectation. There are also countless other spaces that do too. The question is, is it possible to rigorize all this? It seems that each space above has some problems with it that raises the question of whether it is appropriate. Am I right to say that this portion relies completely on intuition and cannot possibly be rigorized? That is, the theory and mathematical rigor only starts after we have defined the probability space. Of course, then the question comes: how do we know our probability space "works" aside from intuition? Can we know that two different probability spaces will give us the same answer to a question?
Also, am I right that there are actually way more assumptions in probability aside from the probability space? For instance:
- Independence in mathematical context is just $P(A \cap B) = P(A)P(B)$ but in real life we say that this means occurrence of $A$ does not affect probability of $B$ -- is this an assumption?
- $P(A | B) = P(A \cap B)/P(B)$ but in real life, we say $P(A | B)$ is the chance that $A$ occurs, given that $B$ has happened -- is this an assumption?
- The expected value of random variable $X$ is just defined as $E(X) = \sum_{\omega\in\Omega}P(\omega)X(\omega)$ in discrete spaces but in real life, we say it is the mean of $X$ if the experiment is repeated many many times -- is this an assumption?
So all in all, there seems to be many assumptions in probability. So when does the theory start and end?
EDIT: I've read a bit more about this issue and found that the reason why probability spaces give the same answer is likely due to them being extensions of each other. Terence Tao wrote about this once and if I'm not mistaken, the extension he talks about essentially shows the equivalence of the first 3 probability spaces I mentioned. Is this correct?
According to this, independence should then be preserved under extension, so it doesn't make sense that independence ($P(A \cap B) = P(A)P(B)$) doesn't work under probability space 2 but works under 1. What I'm thinking might be the resolution is that when we state that "the results of some coin tosses are mutually independent", we are assuming that every single one of these independent coin tosses are performed in every single outcome. So in probability space 2, even though an outcome string might be of finite length, we assume that the other infinitely many coin tosses have also been performed (although their results are irrelevant for this outcome). Is this way of thinking correct?
But now it seems strange that $P(\text{the $100$-th coin is heads}) = 0.5$, since after all the $100$-th coin is only tossed in the event that the $99$ coins before it do not contain consecutive heads. What is wrong here?
Solution 1:
For any questions about coin tossing that involve an unbounded number of tosses the proper sample space $\Omega$ consists of all infinite binary sequences ${\bf b}=(b_1,b_2,\ldots)$. This space obtains a bona fide probability measure $\mu$ through a measure-theoretic construction starting from the requirement that the events $$A_n:=\{{\bf b}\in\Omega\>| b_n=1\}\qquad(n\geq1)$$ should have probability ${1\over2}$ and that $\mu$ behaves as it should for finite unions, complements, and intersections of such sets $A_n$. This construction is presented in detail in probability theory books, but in a first course on probability (where only simple problems about coin tossing sequences are considered) we may make use of this $\mu$ without asking questions.
Now to your experiment: Before you are actually starting it you don't know how many tosses will have to be made. In other words: The number of tosses is random as well. Therefore any careful analysis will have to work within $\Omega$. For any $r\geq2$ we consider the event $H_r$ consisting of all sequences ${\bf b}$ such that $b_{r-1}=b_r=1$, but $(b_1\ldots b_{r-1})$ does not contain the subword $(1,1)$. Each $H_r$ is defined by a finite number of conditions involving only the first $r$ coordinates of ${\bf b}$; therefore the probability $\mu(H_r)$ can be computed using careful counting. The $H_r$ are disjoint, and after putting $H_\infty:=\Omega\setminus\bigcup_{r\geq2} H_r$ we obtain a partition of $\Omega$. The expectation we are interested in is then given by $$E=\sum_{r=2}^\infty r\>\mu(H_r)\leq\infty\ ,\tag{1}$$ and a priori the value $\infty$ cannot be excluded. That $\mu(H_\infty)=0$ is a good sign $\ldots$
Now when it comes to really computing $E$ we could set up a recursion for the probabilities $p_r:=\mu(H_r)$ and then actually compute the sum $(1)$. But there is a shortcut (using Fubini's theorem): Denote by $E_1$ the expected number of additional tosses when we have seen $1$ immediately before, and by $E_0$ the expected number of additional tosses if this is not the case. Then we have the equations $$E_0=1+{1\over2} E_0+{1\over2} E_1,\qquad E_1=1+{1\over2}E_0\ ,$$ from which we immediately obtain $E_0=6$.