Expected length of the largest repeating substring given a distinct digit/character [duplicate]

Solution 1:

This problem was solved using generating functions by de Moivre in 1738. The formula you want is $$\mathbb{P}(\ell_n \geq m)=\sum_{j=1}^{\lfloor n/m\rfloor} (-1)^{j+1}\left(p+\left({n-jm+1\over j}\right)(1-p)\right){n-jm\choose j-1}p^{jm}(1-p)^{j-1}.$$

References

  1. Section 14.1 Problems and Snapshots from the World of Probability by Blom, Holst, and Sandell

  2. Chapter V, Section 3 Introduction to Mathematical Probability by Uspensky

  3. Section 22.6 A History of Probability and Statistics and Their Applications before 1750 by Hald gives solutions by de Moivre (1738), Simpson (1740), Laplace (1812), and Todhunter (1865)


Added: The combinatorial class of all coin toss sequences without a run of $ m $ heads in a row is $$\sum_{k\geq 0}(\mbox{seq}_{< m }(H)\,T)^k \,\mbox{seq}_{< m }(H), $$ with corresponding counting generating function $$H(h,t)={\sum_{0\leq j< m }h^j\over 1-(\sum_{0\leq j< m }h^j)t}={1-h^ m \over 1-h-(1-h^ m )t}.$$ We introduce probability by replacing $h$ with $ps$ and $t$ by $qs$, where $q=1-p$: $$G(s)={1-p^ m s^ m \over1-s+p^ m s^{ m +1}q}.$$ The coefficient of $s^n$ in $G(s)$ is $\mathbb{P}(\ell_n<m).$

The function $1/(1-s(1-p^ m s^ m q ))$ can be rewritten as \begin{eqnarray*} \sum_{k\geq 0}s^k(1-p^ m s^ m q )^k &=&\sum_{k\geq 0}\sum_{j\geq 0} {k\choose j} (-p^ m q)^js^{k+j m }\\ %&=&\sum_{j\geq 0}\sum_{k\geq 0} {k\choose j} (-p^ m q )^js^{k+j m }. \end{eqnarray*} The coefficient of $s^n$ in this function is $c(n)=\sum_{j\geq 0}{n-j m \choose j}(-p^ m q)^j$. Therefore the coefficient of $s^n$ in $G(s)$ is $c(n)-p^ m c(n- m ).$ Finally, \begin{eqnarray*} \mathbb{P}(\ell_n\geq m)&=&1-\mathbb{P}(\ell_n<m)\\[8pt] &=&p^ m c(n- m )+1-c(n)\\[8pt] &=&p^ m \sum_{j\geq 0}(-1)^j{n-(j+1) m \choose j}(p^ m q)^j+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^ m q)^j\\[8pt] &=&p^ m \sum_{j\geq 1}(-1)^{j-1}{n-j m \choose j-1}(p^m q)^{j-1}+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^mq )^j\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m \choose j-1}q+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m +1\choose j}q \right]p^{ jm} q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[p+{n-j m +1\over j}\, q\right] {n-j m \choose j-1}\,p^{ jm} q^{j-1}. \end{eqnarray*}

Solution 2:

Define a Markov chain with states $0, 1, \ldots m$ so that with probability $1$ the chain moves from $m$ to $m$ and for $i<m$ with probability $p$ the chain moves from $i$ to $i+1$ and with probability $1-p$ the chain moves from $i$ to $0$. If you look at the $n$th power of the transition matrix for this chain you can read off the probability that in $n$ flips you have a sequence of at least $m$ consecutive heads.

Solution 3:

You can find a limiting distribution, otherwise it's a difficult problem and the closed form solution won't have much practical value. See this for an elementary approach. [Update] Previous link moved to this new address. "Longest Run of Heads", M.F.Schilling.

Solution 4:

I don't think you'll get a simple analytic formula. The problem is essentially equivalent to this one, see my answer there: it involves the $n$-power of a $m \times m$ stochastic matrix (notice that there we are interested in the runs equal or greater than $m$), using a Markov chain (as suggested in the comments by Shawn). You can find also there some asymptotics.

Solution 5:

Remark. The following answer uses the same methodology as the accepted leading answer. It is self-contained, includes Maple code and some asymptotics. Perhaps it can serve a didactic purpose and show that the basic computation is very simple.

With $z$ for successes and $w$ for failures we get the generating function

$$\left(1+z+\cdots+u\frac{z^m}{1-z}\right) \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} \left(z+\cdots+u\frac{z^m}{1-z}\right)^q \right)\frac{1}{1-w}.$$

Now as a sanity check when we put $w=z$ and $u=1$ we should get all binary strings. Doing the calculation we find

$$\frac{1}{1-z} \frac{1}{1-wz/(1-w)/(1-z)} \frac{1}{1-w} \\ = \frac{1}{(1-w)(1-z)-wz} = \frac{1}{1-w-z}.$$

Putting $z=w$ yields

$$\frac{1}{(1-z)^2} \frac{1}{1-z^2/(1-z)^2} = \frac{1}{(1-z)^2-z^2} = \frac{1}{1-2z}$$

and the sanity check goes through.

Now for the probabilities we must subtract the value for $u=0$ from the generating function and then set $u=1.$ We obtain for the zero term

$$\frac{1-z^m}{1-z} \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} z^q \frac{(1-z^{m-1})^q}{(1-z)^q}\right) \frac{1}{1-w} \\ = \frac{1-z^m}{1-z} \frac{1}{1-wz(1-z^{m-1})/(1-w)/(1-z)} \frac{1}{1-w} = \frac{1-z^m}{(1-w)(1-z)-wz(1-z^{m-1})} = \frac{1-z^m}{1-w-z+wz^{m}}.$$

To obtain the probabilities set $z=pv$ and $w=(1-p)v$ to get for the one term

$$\frac{1}{1-w-z} = \frac{1}{1-(1-p)v-pv} = \frac{1}{1-v}$$

so that $$[v^n] \frac{1}{1-v} = 1.$$

The subtracted contribution from the zero term is

$$\frac{1- p^m v^m}{1-v+(1-p)p^m v^{m+1}}.$$

Now for coefficient extraction we write

$$[v^Q] \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = [v^Q] \sum_{q\ge 0} (v-(1-p)p^m v^{m+1})^q \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{q-r} (-1)^r (1-p)^r p^{mr} v^{(m+1)r}. \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{mr+q} (-1)^r (1-p)^r p^{mr} \\ = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.$$

We thus get the closed form

$$\bbox[5px,border:2px solid #00A000]{ 1 - a_N + p^m a_{N-m} \quad \text{where} \quad a_Q = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.}$$

Note that for $N\lt m$ we get

$$1- {N\choose 0} \times 1 \times 1\times 1 = 0$$

which is the correct result. Also for $N=m$ we obtain

$$1 - 1 - {0\choose 1} \times -1 \times (1-p) \times p^m + p^m \times 1 = p^m $$

which is correct as well.

Now for an asymptotic we note that the root $\rho$ with the smalltest modulus of $1-v+(1-p)p^m v^{m+1}$ is the one close to $v_0=1$ and we get from Newton-Raphson the first approximation

$$v_1 = v_0 - \frac{1-v_0 + (1-p)p^m v_0^{m+1}}{-1+(m+1)(1-p)p^m v_0^{m}}.$$

This works out to

$$1+ \frac{(1-p)p^m}{1-(m+1)(1-p)p^m} = \frac{1-m(1-p)p^m}{1-(m+1)(1-p)p^m} = \rho.$$

The corresponding term from the partial fraction decomposition is

$$\frac{1}{v-\rho} \mathrm{Res}_{v=\rho} \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = \frac{1}{v-\rho} \times \left.\frac{1}{-1+(m+1)(1-p)p^m v^{m}}\right|_{v=\rho} \\ = - \frac{1}{\rho} \frac{1}{1-v/\rho} \frac{1}{-1+(m+1)(1-p)p^m \rho^{m}}.$$

We thus obtain

$$\bbox[5px,border:2px solid #00A000]{ a_Q \sim \frac{1}{\rho^{Q+1}} \frac{1}{1-(m+1)(1-p)p^m \rho^{m}}.}$$

We may replace $\rho$ by a better approximation from Newton-Raphson in a setting where we require numerics, which is done in the sample code which now follows.

MRUNPROB :=
proc(N, m)
option remember;
local ind, d, pos, cur, run, runs, prob,
    zcnt, wcnt;

    prob := 0;

    for ind from 2^N to 2*2^N-1 do
        d := convert(ind, base, 2);

        cur := -1; pos := 1;
        run := []; runs := [];


        while pos <= N do
            if d[pos] <> cur then
                if nops(run) > 0 then
                    runs :=
                    [op(runs), [run[1], nops(run)]];
                fi;

                cur := d[pos];
                run := [cur];
            else
                run := [op(run), cur];
            fi;

            pos := pos + 1;
        od;

        runs := [op(runs), [run[1], nops(run)]];

        if nops(select(r -> (r[1] = 1 and r[2] >= m), runs)) > 0
        then
            wcnt := add(`if`(r[1] = 0, r[2], 0), r in runs);
            zcnt := add(`if`(r[1] = 1, r[2], 0), r in runs);

            prob := prob + p^zcnt * (1-p)^wcnt;
        fi;
    od;

    expand(prob);
end;

V1 :=
proc(N, m)
option remember;
local gf;

    gf := (1-p^m*v^m)/(1-v+(1-p)*p^m*v^(m+1));
    expand(1-coeftayl(gf, v=0, N));
end;

a := (Q, m) ->
add(binomial(Q-m*r,r)*(-1)^r*(1-p)^r*p^(m*r),
    r = 0 .. floor(Q/m));


V2 :=
proc(N, m)
option remember;
    expand(1-a(N,m)+p^m*a(N-m,m));
end;


a2 :=
proc(Q, m)
local rho;

    rho := (1-m*(1-p)*p^m)/(1-(m+1)*(1-p)*p^m);

    1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m);
end;

a3 :=
proc(Q, m, p)
local rho;

    rho :=
    sort([fsolve(1-v + (1-p)*p^m*v^(m+1), v)],
         (r1, r2) -> abs(r1-1) < abs(r2-1));
    rho := op(1, rho);


    1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m);
end;