Probability of a substring occurring in a string

Solution 1:

This sounds like a job for the Goulden-Jackson cluster method. Doron Zeilberger and John Noonan have a gem of a paper (and Maple programs) about it. In short, this method takes a collection of "bad" words over a finite alphabet and efficiently produces a two-variable generating function for words of various lengths containing specified numbers of bad words. In other words, if $s$ is the variable representing the lengths of strings and $t$ is the variable representing the numbers of bad words and $f$ is the generating function, then the coefficient of $s^mt^n$ in $f$ gives the number of strings of length $m$ that have exactly $n$ bad words. These generating functions always turn out to be rational.

In your example, you want to calculate the probability of selecting a string of length $n$ from the alphabet with ten letters $\{0,\dots,9\}$ containing at least two occurrences of $0000$. In the method, this corresponds to the ten-letter alphabet with a single bad word $0000$. Zeilberger and Noonan's Maple program gives the corresponding generating function $$f(s,t)=-{\frac {{s}^{3}t-{s}^{3}+{s}^{2}t-{s}^{2}+st-s-1}{9\,t{s}^{4}-9\,{s}^ {4}+9\,{s}^{3}t-9\,{s}^{3}+9\,{s}^{2}t-9\,{s}^{2}-st-9\,s+1}} $$ To compute the probability of selecting such a string, you would need to compute the number of strings of length $n$ with absolutely no bad words, i.e. $[s^nt^0]f(s,t)=[s^n]f(s,0)$, and the number of strings of length $n$ with exactly one bad word, i.e. $[s^nt^1]f(s,t)=[s^n]\partial_t f(s,t)_{|t=0}$. Granted, computing these coefficients will take some work, but I think the problem is very do-able from this stage.

Probability of a substring occurring in a string

Solution 1:

Related

Recent Posts