Expected length of the largest repeating substring given a distinct digit/character [duplicate]

Solution 1:

This problem was solved using generating functions by de Moivre in 1738. The formula you want is $$\mathbb{P}(\ell_n \geq m)=\sum_{j=1}^{\lfloor n/m\rfloor} (-1)^{j+1}\left(p+\left({n-jm+1\over j}\right)(1-p)\right){n-jm\choose j-1}p^{jm}(1-p)^{j-1}.$$

References

Section 14.1 Problems and Snapshots from the World of Probability by Blom, Holst, and Sandell
Chapter V, Section 3 Introduction to Mathematical Probability by Uspensky
Section 22.6 A History of Probability and Statistics and Their Applications before 1750 by Hald gives solutions by de Moivre (1738), Simpson (1740), Laplace (1812), and Todhunter (1865)

Added: The combinatorial class of all coin toss sequences without a run of $ m $ heads in a row is $$\sum_{k\geq 0}(\mbox{seq}_{< m }(H)\,T)^k \,\mbox{seq}_{< m }(H), $$ with corresponding counting generating function $$H(h,t)={\sum_{0\leq j< m }h^j\over 1-(\sum_{0\leq j< m }h^j)t}={1-h^ m \over 1-h-(1-h^ m )t}.$$ We introduce probability by replacing $h$ with $ps$ and $t$ by $qs$, where $q=1-p$: $$G(s)={1-p^ m s^ m \over1-s+p^ m s^{ m +1}q}.$$ The coefficient of $s^n$ in $G(s)$ is $\mathbb{P}(\ell_n<m).$

The function $1/(1-s(1-p^ m s^ m q ))$ can be rewritten as \begin{eqnarray*} \sum_{k\geq 0}s^k(1-p^ m s^ m q )^k &=&\sum_{k\geq 0}\sum_{j\geq 0} {k\choose j} (-p^ m q)^js^{k+j m }\\ %&=&\sum_{j\geq 0}\sum_{k\geq 0} {k\choose j} (-p^ m q )^js^{k+j m }. \end{eqnarray*} The coefficient of $s^n$ in this function is $c(n)=\sum_{j\geq 0}{n-j m \choose j}(-p^ m q)^j$. Therefore the coefficient of $s^n$ in $G(s)$ is $c(n)-p^ m c(n- m ).$ Finally, \begin{eqnarray*} \mathbb{P}(\ell_n\geq m)&=&1-\mathbb{P}(\ell_n<m)\\[8pt] &=&p^ m c(n- m )+1-c(n)\\[8pt] &=&p^ m \sum_{j\geq 0}(-1)^j{n-(j+1) m \choose j}(p^ m q)^j+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^ m q)^j\\[8pt] &=&p^ m \sum_{j\geq 1}(-1)^{j-1}{n-j m \choose j-1}(p^m q)^{j-1}+\sum_{j\geq 1}(-1)^{j+1}{n-j m \choose j}(p^mq )^j\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m \choose j-1}q+{n-j m \choose j}q\right]p^{ jm } q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[{n-j m \choose j-1}p+{n-j m +1\choose j}q \right]p^{ jm} q^{j-1}\\[8pt] &=&\sum_{j\geq 1}(-1)^{j+1} \left[p+{n-j m +1\over j}\, q\right] {n-j m \choose j-1}\,p^{ jm} q^{j-1}. \end{eqnarray*}

Solution 2:

Define a Markov chain with states $0, 1, \ldots m$ so that with probability $1$ the chain moves from $m$ to $m$ and for $i<m$ with probability $p$ the chain moves from $i$ to $i+1$ and with probability $1-p$ the chain moves from $i$ to $0$. If you look at the $n$th power of the transition matrix for this chain you can read off the probability that in $n$ flips you have a sequence of at least $m$ consecutive heads.

Solution 3:

You can find a limiting distribution, otherwise it's a difficult problem and the closed form solution won't have much practical value. See this for an elementary approach. [Update] Previous link moved to this new address. "Longest Run of Heads", M.F.Schilling.

Solution 4:

I don't think you'll get a simple analytic formula. The problem is essentially equivalent to this one, see my answer there: it involves the $n$-power of a $m \times m$ stochastic matrix (notice that there we are interested in the runs equal or greater than $m$), using a Markov chain (as suggested in the comments by Shawn). You can find also there some asymptotics.

Solution 5:

Remark. The following answer uses the same methodology as the accepted leading answer. It is self-contained, includes Maple code and some asymptotics. Perhaps it can serve a didactic purpose and show that the basic computation is very simple.

With $z$ for successes and $w$ for failures we get the generating function

$$\left(1+z+\cdots+u\frac{z^m}{1-z}\right) \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} \left(z+\cdots+u\frac{z^m}{1-z}\right)^q \right)\frac{1}{1-w}.$$

Now as a sanity check when we put $w=z$ and $u=1$ we should get all binary strings. Doing the calculation we find

$$\frac{1}{1-z} \frac{1}{1-wz/(1-w)/(1-z)} \frac{1}{1-w} \\ = \frac{1}{(1-w)(1-z)-wz} = \frac{1}{1-w-z}.$$

Putting $z=w$ yields

$$\frac{1}{(1-z)^2} \frac{1}{1-z^2/(1-z)^2} = \frac{1}{(1-z)^2-z^2} = \frac{1}{1-2z}$$

and the sanity check goes through.

Now for the probabilities we must subtract the value for $u=0$ from the generating function and then set $u=1.$ We obtain for the zero term

$$\frac{1-z^m}{1-z} \left(\sum_{q\ge 0} \frac{w^q}{(1-w)^q} z^q \frac{(1-z^{m-1})^q}{(1-z)^q}\right) \frac{1}{1-w} \\ = \frac{1-z^m}{1-z} \frac{1}{1-wz(1-z^{m-1})/(1-w)/(1-z)} \frac{1}{1-w} = \frac{1-z^m}{(1-w)(1-z)-wz(1-z^{m-1})} = \frac{1-z^m}{1-w-z+wz^{m}}.$$

To obtain the probabilities set $z=pv$ and $w=(1-p)v$ to get for the one term

$$\frac{1}{1-w-z} = \frac{1}{1-(1-p)v-pv} = \frac{1}{1-v}$$

so that $$[v^n] \frac{1}{1-v} = 1.$$

The subtracted contribution from the zero term is

$$\frac{1- p^m v^m}{1-v+(1-p)p^m v^{m+1}}.$$

Now for coefficient extraction we write

$$[v^Q] \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = [v^Q] \sum_{q\ge 0} (v-(1-p)p^m v^{m+1})^q \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{q-r} (-1)^r (1-p)^r p^{mr} v^{(m+1)r}. \\ = [v^Q] \sum_{q\ge 0} \sum_{r=0}^q {q\choose r} v^{mr+q} (-1)^r (1-p)^r p^{mr} \\ = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.$$

We thus get the closed form

$$\bbox[5px,border:2px solid #00A000]{ 1 - a_N + p^m a_{N-m} \quad \text{where} \quad a_Q = \sum_{r=0}^{\lfloor Q/m\rfloor} {Q-mr\choose r} (-1)^r (1-p)^r p^{mr}.}$$

Note that for $N\lt m$ we get

$$1- {N\choose 0} \times 1 \times 1\times 1 = 0$$

which is the correct result. Also for $N=m$ we obtain

$$1 - 1 - {0\choose 1} \times -1 \times (1-p) \times p^m + p^m \times 1 = p^m $$

which is correct as well.

Now for an asymptotic we note that the root $\rho$ with the smalltest modulus of $1-v+(1-p)p^m v^{m+1}$ is the one close to $v_0=1$ and we get from Newton-Raphson the first approximation

$$v_1 = v_0 - \frac{1-v_0 + (1-p)p^m v_0^{m+1}}{-1+(m+1)(1-p)p^m v_0^{m}}.$$

This works out to

$$1+ \frac{(1-p)p^m}{1-(m+1)(1-p)p^m} = \frac{1-m(1-p)p^m}{1-(m+1)(1-p)p^m} = \rho.$$

The corresponding term from the partial fraction decomposition is

$$\frac{1}{v-\rho} \mathrm{Res}_{v=\rho} \frac{1}{1-v+(1-p)p^m v^{m+1}} \\ = \frac{1}{v-\rho} \times \left.\frac{1}{-1+(m+1)(1-p)p^m v^{m}}\right|_{v=\rho} \\ = - \frac{1}{\rho} \frac{1}{1-v/\rho} \frac{1}{-1+(m+1)(1-p)p^m \rho^{m}}.$$

We thus obtain

$$\bbox[5px,border:2px solid #00A000]{ a_Q \sim \frac{1}{\rho^{Q+1}} \frac{1}{1-(m+1)(1-p)p^m \rho^{m}}.}$$

We may replace $\rho$ by a better approximation from Newton-Raphson in a setting where we require numerics, which is done in the sample code which now follows.

MRUNPROB :=
proc(N, m)
option remember;
local ind, d, pos, cur, run, runs, prob,
    zcnt, wcnt;

    prob := 0;

    for ind from 2^N to 2*2^N-1 do
        d := convert(ind, base, 2);

        cur := -1; pos := 1;
        run := []; runs := [];


        while pos <= N do
            if d[pos] <> cur then
                if nops(run) > 0 then
                    runs :=
                    [op(runs), [run[1], nops(run)]];
                fi;

                cur := d[pos];
                run := [cur];
            else
                run := [op(run), cur];
            fi;

            pos := pos + 1;
        od;

        runs := [op(runs), [run[1], nops(run)]];

        if nops(select(r -> (r[1] = 1 and r[2] >= m), runs)) > 0
        then
            wcnt := add(`if`(r[1] = 0, r[2], 0), r in runs);
            zcnt := add(`if`(r[1] = 1, r[2], 0), r in runs);

            prob := prob + p^zcnt * (1-p)^wcnt;
        fi;
    od;

    expand(prob);
end;

V1 :=
proc(N, m)
option remember;
local gf;

    gf := (1-p^m*v^m)/(1-v+(1-p)*p^m*v^(m+1));
    expand(1-coeftayl(gf, v=0, N));
end;

a := (Q, m) ->
add(binomial(Q-m*r,r)*(-1)^r*(1-p)^r*p^(m*r),
    r = 0 .. floor(Q/m));


V2 :=
proc(N, m)
option remember;
    expand(1-a(N,m)+p^m*a(N-m,m));
end;


a2 :=
proc(Q, m)
local rho;

    rho := (1-m*(1-p)*p^m)/(1-(m+1)*(1-p)*p^m);

    1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m);
end;

a3 :=
proc(Q, m, p)
local rho;

    rho :=
    sort([fsolve(1-v + (1-p)*p^m*v^(m+1), v)],
         (r1, r2) -> abs(r1-1) < abs(r2-1));
    rho := op(1, rho);


    1/rho^(Q+1)*1/(1-(m+1)*(1-p)*p^m*rho^m);
end;

Difference of uniform random variables

If $x \in R$ is non-invertible implies $x^2 \in \{\pm x\}$ and $|R| >9$ odd then $R$ is a field

Does $A$ have infinite order in $G= \langle A,B \ |\ B A B^{-1} = A^2 \rangle $?

Formula for curvature of two intersecting surfaces in terms of their normal curvature.

Transfer from external direct sum to internal direct sum

what is the formal notation about the CDF or PDF of a function with limited interval.

"I turn 31 tomorrow" vs "I’m going to turn 31 tomorrow," vs "I’ll turn 31 tomorrow"

Is the preposition optional in "going down (to) the store"?

Australian English: developed or developped?

Single word for "enclosed in parenthesis"

How is "plenty" a pronoun in "plenty of time"?

What is the term for when somebody answers your question with the same phrase as your question