Expected Number of Flips for a Sequence of 4 to Repeat
I recently had this question in an interview:
You are flipping a fair coin until a sequence of four flips repeats itself. The sequences are allowed to overlap. What is the expected number of flips?
For example, if you flip HTHTHT, the sequence HTHT appears in flips 1-4 and 3-6. In this case, we are done after 6 flips. As another example, if you flip HHTTHHTT, the sequence HHTT repeats in flips 1-4 and 5-8, and we are done after 8 flips. What is the expected number of flips?
I've been thinking about this question for a few days now, and I haven't come to an answer. I've tried the simpler problem where we consider a sequence of two flips instead of four flips, but it is still rather difficult. I suspect that there is a nice recursive way to solve this problem, but I can't figure it out.
I am also interested in generalizations of this problem. For example, what is the expected number of flips for a sequence of $n$ to repeat itself? What happens if the coin isn't fair?
I have another interview in a few days, so I would very much like to see how to solve this problem in advance. Any help is appreciated.
Edit:
Based on computational evidence (assuming I didn't make a mistake in the code), it appears that the expected number of flips is about 9.81. I would like to know the exact answer, as well as an analytic solution to this problem.
Edit 2:
Another piece of information that may be of use: I had 30 seconds to answer the question. This makes me believe that there is some "easy" way to get the answer, or they were looking for an approximate answer.
Edit 3:
@r.e.s. has kindly provided exact solutions for $n=1,2,3,4,5$. For $n=6$, numerical computation seems to indicate that the answer is around $18.977$ or $18.978$. I will be updating periodically with approximations for other values of $n$.
Edit 4:
$$ \begin{array}{|c|c|} \hline n & E(L_{n}) \\ \hline 1 & \frac{5}{2}\\ \hline 2 & \frac{35}{8}\\ \hline 3 & \frac{435}{64}\\ \hline 4 & \frac{2513}{256}\\ \hline 5 & \frac{57922047}{4194304}\\ \hline 6 & \approx 18.9775\\ \hline 7 & \approx 25.928\\ \hline 8 & \approx 35.288\\ \hline \end{array} $$
I would like to know the exact answer, as well as an analytic solution to this problem.
Here's the exact answer as a function of $p=\Pr(\text{H})$, for several values of $n$. Let $L_n$ denote the length of the sequence when it first contains a repeated word of length $n$. By directly enumerating the possible sequences, we find (using Sage) the following exact results:
n E(L_n | p=1/2) E(L_n | p)
- ---------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/2 -2*p^2 + 2*p + 2
2 35/8 2*p^4 - 4*p^3 - 3*p^2 + 5*p + 3
3 435/64 16*p^8 - 64*p^7 + 61*p^6 + 41*p^5 - 90*p^4 + 37*p^3 - 10*p^2 + 9*p + 4
4 2513/256 -64*p^18 + 576*p^17 - 1880*p^16 + 1984*p^15 + 3094*p^14 - 10682*p^13 + 9560*p^12 + 2050*p^11 - 9950*p^10 + 6696*p^9 - 841*p^8 - 898*p^7 + 420*p^6 + 8*p^5 - 129*p^4 + 54*p^3 - 12*p^2 + 14*p + 5
5 57922047/4194304 -13312*p^34 + 226304*p^33 - 1690752*p^32 + 7137280*p^31 - 17544144*p^30 + 18701616*p^29 + 29742468*p^28 - 163840744*p^27 + 339947916*p^26 - 413989056*p^25 + 268118367*p^24 + 32316996*p^23 - 278576119*p^22 + 333006671*p^21 - 241401935*p^20 + 128354791*p^19 - 59474305*p^18 + 28069920*p^17 - 12055855*p^16 + 2638562*p^15 + 1497144*p^14 - 1982176*p^13 + 1101961*p^12 - 307798*p^11 - 27907*p^10 + 68859*p^9 - 31985*p^8 + 9681*p^7 - 3530*p^6 + 1571*p^5 - 678*p^4 + 195*p^3 - 26*p^2 + 20*p + 6
In particular, we have the exact expectation $\text{E}(L_4 | p={1\over 2})=2513/256 = 9.8164...$
(Cross-check: Monte Carlo simulations for $n=1,2,3,4,5$ are consistent with these numerical values to three three significant figures.)
The "direct enumeration" method calculates the expectation as follows, using $1$ and $0$ to represent $\text{H}$ and $\text{T}$ respectively:
Let $S_n$ denote the set of all the possible binary sequences obtainable by flipping the coin until a length-$n$ repetition occurs. Now $L_n$ is just the length of such a sequence, so $$\begin{align}\text{E}(L_n\mid p)&=\sum_{s\in S_n}\text{length}(s)\Pr(s)\\[2ex] \end{align}$$ where $$\Pr(s) = p^{\text{sum}(s)}(1-p)^{\text{length}(s)-\text{sum}(s)}$$ with $\text{sum}(s)$ being the number of $1$s in $s$, and $\text{length}(s)-\text{sum}(s)$ being the number of $0$s in $s$.
The program at the above-given link implements this by starting with a list of all length-$n$ sequences, then "growing" each of them in all possible ways by appending bits, until they stop growing when a repetition occurs. The result (stop_seqs(n)
) is a listing of the set $S_n$, from which the expectation is calculated by the above summation.
Apparently, no closed-form solution is known for arbitrary $n$, even in the case of a fair coin; however, asymptotic results for this case have been obtained, e.g., by Karnin, E. (1983), who showed that $$E\left(L_n\,{\Large\vert}\,p={\small{1\over 2}}\right)\sim\sqrt{\pi\,2^n},\ \text{ as }n\to\infty,$$ which is just to say $$\lim_{n\to\infty}{E(L_n\mid p={1\over 2}) \over \sqrt{\pi\,2^n}}=1.$$
But this is a poor approximation in your case of $n=4$, since $\sqrt{\pi\,2^4}=7.0898\ldots,$ whereas the exact value (by direct enumeration) is $\text{E}(L_4\mid p={1\over 2})=2513/256 = 9.8164\ldots$
Following is a plot of the ratio $E(L_n\mid p={1\over 2}) \over \sqrt{\pi\,2^n}$ for $1\le n\le 20$, using expected values correct to three significant figures from Monte Carlo simulation when $n>5$:
Ironically, the ratio equals $0.997...$ when $n=1$, but does not again get this close to $1$ for $n\le 20.$
NB: I strongly suspect that the interviewer's intended question was the following remarkable variant, obtained by changing just one word of the quoted paragraph (especially since in both examples it is the initial sequence of four flips that repeats):
You are flipping a fair coin until the initial sequence of four flips repeats itself. The sequences are allowed to overlap. For example, if you flip HTHTHT, the sequence HTHT appears in flips 1-4 and 3-6. In this case, we are done after 6 flips. As another example, if you flip HHTTHHTT, the sequence HHTT repeats in flips 1-4 and 5-8, and we are done after 8 flips. What is the expected number of flips?
The answer to this revised version is the integer $n+2^n$ (where $n$ is the initial-segment length, e.g. $n=4$), no matter what may be the value of $p=\Pr(\text{H})$ -- assuming that $p$ is constant, $0<p<1$, and that the tosses are independent!
This general result can be proved by a minor variation of the proof of Theorem 2 in this online article. The special case of $p={1\over 2}$ can be proved by the following argument given in the above-linked article by Karnin: Considering the words to be states of a Markov chain, the transition matrix is seen to be doubly stochastic, which implies that the invariant distribution assigns to each state the same probability $2^{-n}$; consequently, the expected waiting time (after the initial segment is formed) is the reciprocal of this, i.e., $2^n$, and the expected total waiting time is therefore $n+2^n.$