Expected length of the largest repeating substring given a distinct digit/character [duplicate]
Solution 1:
This problem was solved using generating functions by de Moivre in 1738. The formula you want is $$\mathbb{P}(\ell_n \geq m)=\sum_{j=1}^{\lfloor n/m\rfloor} (-1)^{j+1}\left(p+\left({n-jm+1\over j}\right)(1-p)\right){n-jm\choose j-1}p^{jm}(1-p)^{j-1}.$$
References
Section 14.1 Problems and Snapshots from the World of Probability by Blom, Holst, and Sandell
Chapter V, Section 3 Introduction to Mathematical Probability by Uspensky
Section 22.6 A History of Probability and Statistics and Their Applications before 1750 by Hald gives solutions by de Moivre (1738), Simpson (1740), Laplace (1812), and Todhunter (1865)
Added: The combinatorial class of all coin toss sequences without a run of $ m $ heads in a row is $$\sum_{k\geq 0}(\mbox{seq}_{< m }(H)\,T)^k \,\mbox{seq}_{< m }(H), $$ with corresponding counting generating function $$H(h,t)={\sum_{0\leq j< m }h^j\over 1-(\sum_{0\leq j< m }h^j)t}={1-h^ m \over 1-h-(1-h^ m )t}.$$ We introduce probability by replacing $h$ with $ps$ and $t$ by $qs$, where $q=1-p$: $$G(s)={1-p^ m s^ m \over1-s+p^ m s^{ m +1}q}.$$ The coefficient of $s^n$ in $G(s)$ is $\mathbb{P}(\ell_n<m).$
The function $1/(1-s(1-p^ m s^ m q ))$ can be rewritten as \begin{eqnarray*} \sum_{k\geq 0}s^k(1-p^ m s^ m q )^k &=&\sum_{k\geq 0}\sum_{j\geq 0} {k\choose j} (-p^ m q)^js^{k+j m }\\ %&=&\sum_{j\geq 0}\sum_{k\geq 0} {k\choose j} (-p^ m q )^js^{k+j m }. \end{eqnarray*} The coefficient of $s^n$ in this function is $c(n)=\sum_{j\geq 0}{n-j m \choose j}(-p^ m q)^j$. Therefore the coefficient of $s^n$ in $G(s)$ is $c(n)-p^ m c(n- m ).$ Finally, \begin{eqnarray*} \mathbb{P}(\ell_n\geq m)&=&1-\mathbb{P}(\ell_n<m)\\[8pt] &=&p^ m c(n- m )+1-c(n)\\[8pt] &=&p^ m \sum_{j\geq 0}(-1)^j{n-(j+1) m \choose j}(p^ m q)^j+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^ m q)^j\\[8pt] &=&p^ m \sum_{j\geq 1}(-1)^{j-1}{n-j m \choose j-1}(p^m q)^{j-1}+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^mq )^j\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m \choose j-1}q+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m +1\choose j}q \right]p^{ jm} q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[p+{n-j m +1\over j}\, q\right] {n-j m \choose j-1}\,p^{ jm} q^{j-1}. \end{eqnarray*}
Solution 2:
Define a Markov chain with states $0, 1, \ldots m$ so that with probability $1$ the chain moves from $m$ to $m$ and for $i<m$ with probability $p$ the chain moves from $i$ to $i+1$ and with probability $1-p$ the chain moves from $i$ to $0$. If you look at the $n$th power of the transition matrix for this chain you can read off the probability that in $n$ flips you have a sequence of at least $m$ consecutive heads.
Solution 3:
You can find a limiting distribution, otherwise it's a difficult problem and the closed form solution won't have much practical value. See this for an elementary approach. [Update] Previous link moved to this new address. "Longest Run of Heads", M.F.Schilling.
Solution 4:
I don't think you'll get a simple analytic formula. The problem is essentially equivalent to this one, see my answer there: it involves the $n$-power of a $m \times m$ stochastic matrix (notice that there we are interested in the runs equal or greater than $m$), using a Markov chain (as suggested in the comments by Shawn). You can find also there some asymptotics.
Solution 5:
Remark. The following answer uses the same methodology as the accepted leading answer. It is self-contained, includes Maple code and some asymptotics. Perhaps it can serve a didactic purpose and show that the basic computation is very simple.
With $z$ for successes and $w$ for failures we get the generating function
$$\left(1+z+\cdots+u\frac{z^m}{1-z}\right) \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} \left(z+\cdots+u\frac{z^m}{1-z}\right)^q \right)\frac{1}{1-w}.$$
Now as a sanity check when we put $w=z$ and $u=1$ we should get all binary strings. Doing the calculation we find
$$\frac{1}{1-z} \frac{1}{1-wz/(1-w)/(1-z)} \frac{1}{1-w} \\ = \frac{1}{(1-w)(1-z)-wz} = \frac{1}{1-w-z}.$$
Putting $z=w$ yields
$$\frac{1}{(1-z)^2} \frac{1}{1-z^2/(1-z)^2} = \frac{1}{(1-z)^2-z^2} = \frac{1}{1-2z}$$
and the sanity check goes through.
Now for the probabilities we must subtract the value for $u=0$ from the generating function and then set $u=1.$ We obtain for the zero term
$$\frac{1-z^m}{1-z} \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} z^q \frac{(1-z^{m-1})^q}{(1-z)^q}\right) \frac{1}{1-w} \\ = \frac{1-z^m}{1-z} \frac{1}{1-wz(1-z^{m-1})/(1-w)/(1-z)} \frac{1}{1-w} = \frac{1-z^m}{(1-w)(1-z)-wz(1-z^{m-1})} = \frac{1-z^m}{1-w-z+wz^{m}}.$$
To obtain the probabilities set $z=pv$ and $w=(1-p)v$ to get for the one term
$$\frac{1}{1-w-z} = \frac{1}{1-(1-p)v-pv} = \frac{1}{1-v}$$
so that $$[v^n] \frac{1}{1-v} = 1.$$
The subtracted contribution from the zero term is
$$\frac{1- p^m v^m}{1-v+(1-p)p^m v^{m+1}}.$$
Now for coefficient extraction we write
$$[v^Q] \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = [v^Q] \sum_{q\ge 0} (v-(1-p)p^m v^{m+1})^q \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{q-r} (-1)^r (1-p)^r p^{mr} v^{(m+1)r}. \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{mr+q} (-1)^r (1-p)^r p^{mr} \\ = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.$$
We thus get the closed form
$$\bbox[5px,border:2px solid #00A000]{ 1 - a_N + p^m a_{N-m} \quad \text{where} \quad a_Q = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.}$$
Note that for $N\lt m$ we get
$$1- {N\choose 0} \times 1 \times 1\times 1 = 0$$
which is the correct result. Also for $N=m$ we obtain
$$1 - 1 - {0\choose 1} \times -1 \times (1-p) \times p^m + p^m \times 1 = p^m $$
which is correct as well.
Now for an asymptotic we note that the root $\rho$ with the smalltest modulus of $1-v+(1-p)p^m v^{m+1}$ is the one close to $v_0=1$ and we get from Newton-Raphson the first approximation
$$v_1 = v_0 - \frac{1-v_0 + (1-p)p^m v_0^{m+1}}{-1+(m+1)(1-p)p^m v_0^{m}}.$$
This works out to
$$1+ \frac{(1-p)p^m}{1-(m+1)(1-p)p^m} = \frac{1-m(1-p)p^m}{1-(m+1)(1-p)p^m} = \rho.$$
The corresponding term from the partial fraction decomposition is
$$\frac{1}{v-\rho} \mathrm{Res}_{v=\rho} \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = \frac{1}{v-\rho} \times \left.\frac{1}{-1+(m+1)(1-p)p^m v^{m}}\right|_{v=\rho} \\ = - \frac{1}{\rho} \frac{1}{1-v/\rho} \frac{1}{-1+(m+1)(1-p)p^m \rho^{m}}.$$
We thus obtain
$$\bbox[5px,border:2px solid #00A000]{ a_Q \sim \frac{1}{\rho^{Q+1}} \frac{1}{1-(m+1)(1-p)p^m \rho^{m}}.}$$
We may replace $\rho$ by a better approximation from Newton-Raphson in a setting where we require numerics, which is done in the sample code which now follows.
MRUNPROB := proc(N, m) option remember; local ind, d, pos, cur, run, runs, prob, zcnt, wcnt; prob := 0; for ind from 2^N to 2*2^N-1 do d := convert(ind, base, 2); cur := -1; pos := 1; run := []; runs := []; while pos <= N do if d[pos] <> cur then if nops(run) > 0 then runs := [op(runs), [run[1], nops(run)]]; fi; cur := d[pos]; run := [cur]; else run := [op(run), cur]; fi; pos := pos + 1; od; runs := [op(runs), [run[1], nops(run)]]; if nops(select(r -> (r[1] = 1 and r[2] >= m), runs)) > 0 then wcnt := add(`if`(r[1] = 0, r[2], 0), r in runs); zcnt := add(`if`(r[1] = 1, r[2], 0), r in runs); prob := prob + p^zcnt * (1-p)^wcnt; fi; od; expand(prob); end; V1 := proc(N, m) option remember; local gf; gf := (1-p^m*v^m)/(1-v+(1-p)*p^m*v^(m+1)); expand(1-coeftayl(gf, v=0, N)); end; a := (Q, m) -> add(binomial(Q-m*r,r)*(-1)^r*(1-p)^r*p^(m*r), r = 0 .. floor(Q/m)); V2 := proc(N, m) option remember; expand(1-a(N,m)+p^m*a(N-m,m)); end; a2 := proc(Q, m) local rho; rho := (1-m*(1-p)*p^m)/(1-(m+1)*(1-p)*p^m); 1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m); end; a3 := proc(Q, m, p) local rho; rho := sort([fsolve(1-v + (1-p)*p^m*v^(m+1), v)], (r1, r2) -> abs(r1-1) < abs(r2-1)); rho := op(1, rho); 1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m); end;