Probability of a winning consecutive $k$-subset out of $n$ coin flips

Assume we flip a coin $n$ times. A $k$-sequence is defined as any consecutive sequence of coin flips of length $k$. Call a $k$-sequence "winning" if there are strictly more heads than tails. What is the probability of there being a winning consecutive $k$-sequence in $n$ tosses?


In order to clarify some of the definitions, here are some examples. Assume we have flipped the sequence $\rm HHTHT$.

$2$-sequences: $\rm \underbrace{HH}THT, H\underbrace{HT}HT, HH\underbrace{TH}T, HHT\underbrace{HT}$, so $\rm HH, HT, TH, HT$ are consecutive $2$-sequences.

Winning $2$-sequences: $\rm HH$.


All $4$-flips with winning $2$-sequences highlighted, if there are multiples, I have selected the first one:

  • $\rm \underbrace{HH}HH $

  • $\rm \underbrace{HH}HT $

  • $\rm \underbrace{HH}TH $

  • $\rm \underbrace{HH}TT $

  • $\rm HT\underbrace{HH} $

  • $\rm HTHT$ (no winning $2$-sequences)

  • $\rm HTTH$ (no winning $2$-sequences)

  • $\rm HTTT$ (no winning $2$-sequences)

  • $\rm T\underbrace{HH}H$

  • $\rm T\underbrace{HH}T$

  • $\rm THTH$ (no winning $2$-sequences)

  • $\rm THTT$ (no winning $2$-sequences)

  • $\rm TT\underbrace{HH}$

  • $\rm TTHT$ (no winning $2$-sequences)

  • $\rm TTTH$ (no winning $2$-sequences)

  • $\rm TTTT$ (no winning $2$-sequences)

Out of the $8$ combinations, there are $4$ which have winning $2$-sequences (we can choose $2$ consecutive flips from the sequence which have more heads than tails). So $p(4, 2) = 8/16 = 1/2$.


Solution 1:

I suspect there is no closed form, but here's how you can calculate this for small values of $k$. I will be talking about binary strings, but it's the same problem.

If $S_k$ is the set of winning binary $k$-strings, consider the generalized question of, given a finite set $S$, deciding whether any given string contains a string from $S$. This is the problem of recognizing a regular language, and is solved by constructing a finite-state automaton corresponding to the set $S$. (See any reference on FSAs and the problem of matching regular expressions, etc.)

Following an idea I learnt from "Analytic Combinatorics" by Flajolet and Sedgewick, it is very easy to construct a generating function for a regular language (a language described by an FSA). Let $L_\alpha$ be the generating function corresponding to the state $\alpha$, so that the $n$-th coefficient counts the number of length-$n$ input strings such that the FSA, started from state $\alpha$, will finish in a terminal state. Let $\bar Q$ be the set of terminal states, and let $q(\alpha, a)$ be the transition function that gives the next state, where $\alpha$ is the current state and $a$ is the next input letter (belonging to the FSA's alphabet). Then the generating functions satisfy $$ L_\alpha(z) = [\alpha\in\bar Q] + z\sum_{\text{letter }a} L_{q(\alpha, a)}(z). $$ If we denote by $T$ the adjacency matrix of the directed graph of transitions of the FSA ($T_{\alpha\beta} = \sum_{a}[q(\alpha,a)=\beta]$), we get $$ \mathbf{L}(z) = v + z T \mathbf{L}(z), $$ where $\mathbf{L}=(L_\alpha)_\alpha$ and $v_\alpha=[\alpha\in\bar Q]$. This is a system of linear equations in $L$, which can be solved for $L_0$, the generating function for the initial state. Then $L_0$ is the generating function for the number of strings of length $n$ that belong to the regular language recognized by the FSA.

A short implementation (hopefully correct) leads me to the following generating functions for small $k$. Here $g_k(z)$ is the generating function for the set of binary strings that contain a winning $k$-string.

$$ g_2(z) = \frac{1}{1-2z}+\frac{1+z}{-1+z+z^2}. $$ $$ g_3(z) = -z^2+\frac{1}{1-2z}+\frac{1+z+z^2}{-1+z+z^3}. $$ $$ g_4(z) = -z^3+\frac{1}{1-2z}+\frac{-1-z-z^2-z^3+z^4+z^5}{1-z-z^2-z^4+z^6}. $$ $$ g_5(z) = -z^3-5 z^4+\frac{1}{1-2z}+\frac{-1-z-2 z^2-2 z^3-2 z^4+z^5+z^6+2 z^7+z^8+z^9}{1-z-z^3-2 z^5+z^8+z^{10}}. $$ $$ g_6(z) = \frac{1}{8 (-1+z)}-z^4-6 z^5+\frac{1}{8 (1+z)}+\frac{1}{1-2z}+\frac{\left(4+3 z+9 z^2+11 z^3+22 z^4+19 z^5+8 z^6+6 z^7+6 z^8-2 z^9-9 z^{10}-10 z^{11}-z^{13}+3 z^{15}+4 z^{16}+3 z^{17}\right)}{\left(4 \left(-1+z+z^3+2 z^5+3 z^6+2 z^7+z^9+z^{11}-3 z^{12}+z^{18}\right)\right)}. $$ $$ g_7(z) = -z^4-6 z^5-22 z^6+\frac{1}{1-2z}+\left(\begin{aligned}1&+z+2 z^2+3 z^3+5 z^4+6 z^5+5 z^6-4 z^7-5 z^8-9 z^9-10 z^{10}\\&-17 z^{11}-17 z^{12}-14 z^{13}-4 z^{14}+4 z^{15}+8 z^{16}+11 z^{17}+19 z^{18}\\&+13 z^{19}+12 z^{20}+2 z^{21}-4 z^{22}-5 z^{23}-8 z^{24}-9 z^{25}-5 z^{26}\\&-5 z^{27}+z^{29}+z^{30}+2 z^{31}+2 z^{32}+z^{33}+z^{34}\end{aligned}\right)/\left(\begin{aligned}-1&+z+z^3+z^5+z^6+5 z^7+z^8-3 z^{10}-z^{11}-4 z^{12}-z^{13}-10 z^{14}\\&-5 z^{15}-z^{16}+3 z^{17}-z^{18}+6 z^{19}+10 z^{21}+3 z^{22}\\&-z^{24}-4 z^{26}-5 z^{28}+z^{33}+z^{35}\end{aligned}\right)$$

For $n\geq k$, $k=3,4,5,6$, the coefficients of $g_k-\frac{1}{1-2z}$ reproduce A000930, A118647, A120118 and A133551. Also, $g_2$ matches the answer by Jean-Sebastien (A008466).

Edit. Here is also $g_8$ (I realize it's a mess): $$ g_8(z) = -z^5-7 z^6-29 z^7+\frac{1}{1-2 z} + \frac{\sum_{j\geq0}p_{8,j}z^j}{\sum_{j\geq0}q_{8,j}z^j}, $$ $$ \begin{aligned}(p_{8,j})_{0\leq j\leq 69} = \{&-1,-1,-1,-2,-3,-5,-7,-2,19,22, 16,15,8,17,40,10,-42,-76,\\&-73,-39,-12,-24,-52,-18,54,141,100,33,-16,-1,37,35,-41,\\&-146,-56,-19,14,14,-24,-38,45,104,12,14,-18,-21,14,13,-36,\\&-49,1,-3,18,17,-4,0,15,15,0,-1,-8,-7,0,0,-2,-2,0,0,1,1\}.\end{aligned} $$ $$ \begin{aligned}(q_{8,j})_{0\leq j\leq 70} = \{&1,-1,-1,0,-1,0,1,-3,-8,0,8, 8,10,5,-8,2,28,15,-25,-24,-28,\\&-24,19,18,-51,-40,55,16,55,45,-51,-36,61,45,-70,16,-67,\\&-40,70,19,-56,-24,58,-24,56,15,-56,2,28,5,-28,8,-28,0,28,\\&-3,-8,0,8,0,8,-1,-8,0,1,0,-1,0,-1,0,1\}.\end{aligned} $$ Here $g_8-\frac1{1-2z}$ matches A212398.

Solution 2:

Throwing in some observations that could help someone go further.

  1. Some notation, I use $N(k,n)$ for the number of length $n$ sequences that have at least one $k$ winning sequence. Define this to be $0$ when $k>n$.
  2. Remark that $N(n,n)=2^{n-1} - ((1+(-1)^n)/4){n\choose n/2}$. This is the number of binary sequence with more $1$ than $0$. See A058622
  3. Observe that $N(k,n)=2N(k,n-1)+ $Offset since any "good" sequence of length $n-1$ will be "good" at length $n$ whether we add $H$ or $T$. The problem is then to characterize the offset.
  4. Let's look at the offset for the case $k=2$

    • First for $n=2$, we have $$ \color{green}{HH}\\ HT\\ \color{blue}{TH}\\ TT $$ The green one is a good one and the blue one is one that can become winnning at the next step.

    • We keep going on with $n=3$ $$ \color{green}{HHH}\\ \color{green}{HHT}\\ \color{blue}{HTH}\\ HTT\\ \color{green}{THH}\\ THT\\ \color{blue}{TTH}\\ TTT $$ Colors are as above. We have some sort of characterization of the offset for $N(k,n)$. It is the number of blue colored sequence in the sets of length $n-1$. Call this number $F(n-1)$, so that $N(2,n)=2N(2,n-1)+F(n-1)$. Black at time $n$ is blue at time $n-1$ plus black at time $n-1$ (Those that will add a $T$). Blue at time $n$ is black at time $n-1$ (Those that will ad a $H$). From this, we see that $F(n-1)$ is the (n-1)^{th} Fibonacci number.

    • We finally get that $N(2,n)= 2N(2,n-1)+Fibo(n-1)$. This is A008466

I believe some similar offset exists for other $k$, but I haven't been able to figure it out. For $k=3$, I believe the formula is given by this sequence. It is not explicitely written as $N(3,n)=2N(3,n-1)+$ offset,however if we write it this way, the offset (the number of blue) seems to be this one. The reccurence relation for Blue and Black in this case seems to be $$ Blue(n)=Black(n-1)+Black(n-2)\\ Black(n)=Black(n-2)+Blue(n-2) $$ I conjectured that the blue/black reccurence for $k$ would be $$ Blue(n)=\sum_{i=1}^{k-1} Black(n-i)\\ Black(n)=Black(n-(k-1))+Blue(n-(k-1)) $$ However both seems to fail for $k=4$. The number $N(4,n)$ is given by this, and the first Blue and Black numbers are, starting at Blue$(4)$ and Black$(4)$, Blue: $\{3,5,9,17,28,47,81,\cdots\}$ and the Black: $\{8,14,24,40,69,119,204\}$. Found no pattern on those yet.