Average number of strings with edit distance at most 4
Consider a binary string of length $n \geq 4$. An edit operation is a single bit insert, delete or substitution. The edit distance between two strings is the minimum number of edit operations needed to transform one string into the other one. Given a string $S$, my question relates to the number of distinct strings of length $n$ which are edit distance at most $4$ from $S$.
Let us write $g_k(S)$ for the number of distinct strings of length $n$ which are edit distance at most $k$ from $S$.
Let $X_n$ be a random variable representing a random binary string of length $n$, with the bits chosen uniformly and independently. We can compute $\mathbb{E}(g_k(X_n))$ for $k = 0, 1, 2, 3$ explicitly.
- $\mathbb{E}(g_0(X_n)) = 1$
- $\mathbb{E}(g_1(X_n)) = n+1$
- $\mathbb{E}(g_2(X_n)) = \frac{13}{2} - \frac{5n}{2} + n^2 - 6\cdot2^{-n}$
- $\mathbb{E}(g_3(X_n)) = -\frac{83}{2 }+ \frac{331n}{12} -6 n^2 + \frac{2n^3}{3} + 2^{-n}(40 + 6n -4n^2)$ (Ref 1 and Ref 2)
This leads directly to my question:
Let $X_n$ be a random variable representing a random binary string of length $n$, with the bits chosen uniformly and independently. What is:
$$\mathbb{E}(g_4(X_n))\;?$$
For small $n$ we can compute the value exactly:
- $\mathbb{E}(g_4(X_4)) = 16$.
- $\mathbb{E}(g_4(X_5)) = 31 \frac{11}{16}$.
- $\mathbb{E}(g_4(X_6)) = 61 \frac{21}{32}$.
- $\mathbb{E}(g_4(X_7)) = 116 \frac{7}{8}$.
- $\mathbb{E}(g_4(X_8)) = 214 \frac{43}{128}$.
- $\mathbb{E}(g_4(X_9)) = 378 \frac{49}{246}$.
- $\mathbb{E}(g_4(X_{10})) = 640 \frac{301}{512}$.
- $\mathbb{E}(g_4(X_{11})) = 1042 \frac{1}{16}$.
- $\mathbb{E}(g_4(X_{12})) = 1631 \frac{1345}{2048}$.
- $\mathbb{E}(g_4(X_{13})) = 2466 \frac{3909}{4096}$.
- $\mathbb{E}(g_4(X_{14})) = 3614 \frac{563}{8192}$
It seems tempting to guess the general form of $\mathbb{E}(g_4(X_n))$ from the examples of $\mathbb{E}(g_2(X_n))$ and $\mathbb{E}(g_3(X_n))$ but I have not succeeded in getting that to work.
Solution 1:
Instead of considering one string and ask about strings close to it, it's easier to take a more symmetric approach and ask for the number of pairs of strings of given length $n$ that have levenshtein distance at most 4. Then the average number of strings of distance at most 4 is that number divided by $2^n$. The counting will be done using finite automata. To motivate the automaton, let's consider a nondeterministic algorithm to compute the levenshtein distance.
I imagine a machine with two input tapes, initially at the first letter, and a counter, initially set to 0. The machine compares the letters at the position of the heads of the tapes, and does one of the following steps: move both reading heads forward and, if the read letters were different, increment the counter, or just move one head forward and increment the counter. (A head cannot move forward when it has already passed the end of the word.) When both heads are behind the words, the computation ends and the value of the counter is the result. I claim that this result is an upper bound to the levenshtein distance of the input words, and that there is a computation that returns exactly the distance.
I'll formalize this description with pseudocode where I replace the tapes with
variables containing the remaining, unprocessed string. Note that I only
look at its first element (head
), and only replace it with a version
where this element is removed (tail
):
algorithm levdist(l,r):
# nondeterministically compute an upper bound for the levenshtein distance
# in a why where the real distance is among the possible results
c := 0;
while l and r are not both empty
do one of these operations:
- a:=head(l); b:=head(r); l:=tail(l); r:=tail(r); if a<>b then c:=c+1;
- l:=tail(l); c:=c+1;
- r:=tail(r); c:=c+1;
return c;
Let's now simulate this computation with a nondeterministic (infinite) automaton over the alphabet of pairs of bits, such that the input $(l_1,r_1),(l_2,r_2)\dots(l_n,r_n)$ corresponds to input $(l_1l_2\dots l_n,r_1r_2\dots r_n)$ to the original machine. Note that now the words are of the same length, a restriction we didn't need before.
The first machine can compare letters $a_i$ and $b_j$ at different positions. If e.g. $i<j$, the automaton has to read up to $(a_j,b_j)$ to know $b_j$, and then still must remember $a_i$ (and the following bits) in its state. So to be able to simulate the machine, the automaton will have states that know the counter, which (if any) tape is behind, and what the unprocessed bits on that tape were.
So we have states $N_c$ for $c\geq0$ for when the tapes are in sync, and states $L_{c,w}$ and $R_{c,w}$ with nonempty words $w$ and $c\geq |w|$, the length of $w$. To simplify the following, let's allow $L_{c,\epsilon}$ and $R_{c,\epsilon}$ as alternate names for $N_c$, where $\epsilon$ is the empty word.
For input $(l,r)$, the states $N_c$ have transitions to itself if $l=r$, otherwise to $N_{c+1}$, and to $L_{c+1,l}$ and $R_{c+1,r}$. If $\omega$ is a bit and $w=\omega w'$, then for that same input $L_{c,w}$ has transitions to $L_{c',w'l}$ with $c'=c$ if $\omega=r$, otherwise $c'=c+1$, and to $L_{c+1,wl}$. It also has an $\epsilon$-transition to $L_{c+1,w'}$. The transitions of the $R$ state are likewise, with the roles of $l$ and $r$ swapped.
When the automaton starts with initial state $N_0$ and processes some input, it will end in some state from which an $N_c$ state can be reached. This $c$ is an upper bound for the levenshtein distance of the input. Each computation of the first machine corresponds to a computation of the automaton that ends in the $N_c$ state with the $c$ that the machine computes.
Now we don't really want to compute the distance, we only want to know if it is at most 4, which we'll generalize to $k$. (And for real computations, we prefer finite automata.) To achieve this, we can restrict the states to those that have $c\leq k$. Now some of the $L_{c,w}$ and $R_{c,w}$ states can no longer reach an $N$ state. When we remove them, we keep those with $1\leq c\leq |w|\leq k-|w|$. When this automaton processes some input with initial state $N_0$ and doesn't get stuck, we know that it could reach a state $N_c$ (with $c\leq k$), so we can just declare all states as accepting states.
The following GAP code which uses the automata package (of which we will use more below) formalizes the construction of this nondeterministic finite automaton:
LoadPackage("automata");
levnfa := function(k)
local kh, d, c, states, l, w, AddT, dir, hd, tl, c0, c1,
t00, t01, t10, t11, te, e00, e01, x01, e10, x10, e11, ee;
kh:=QuoInt(k,2);
d:=NewDictionary([],true);
# prepare states
for c in [0..k] do
AddDictionary(d,[0,c],c+1);
AddDictionary(d,[1,c],c+1);
od;
states:=k+1;
for l in [1..kh] do
for w in IteratorOfTuples([0,1],l) do
for c in [l..k-l] do
states:=states+1;
AddDictionary(d,Concatenation([0,c],w),states);
states:=states+1;
AddDictionary(d,Concatenation([1,c],w),states);
od;
od;
od;
AddT:=function(sl,arg)
# add state id of state described by arg to sl if it exists,
# arg is [dir, c, w, x] where x is an optional extra bit appended to w
local x,l,val;
if Length(arg)=4 then
x:=[arg[4]];
else
x:=[];
fi;
l:=Concatenation(arg{[1,2]},arg[3],x);
val := LookupDictionary(d,l);
if val<>fail then
Add(sl,val);
fi;
end;
t00:=[]; t01:=[]; t10:=[]; t11:=[]; te:=[];
for c in [0..k] do
e00:=[]; e01:=[]; e10:=[]; e11:=[];
AddT(e00,[0,c,[]]); AddT(e11,[0,c,[]]);
AddT(e01,[0,c+1,[]]); AddT(e10,[0,c+1,[]]);
AddT(e00,[0,c+1,[0]]); AddT(e00,[1,c+1,[0]]);
AddT(e01,[0,c+1,[0]]); AddT(e01,[1,c+1,[1]]);
AddT(e10,[0,c+1,[1]]); AddT(e10,[1,c+1,[0]]);
AddT(e11,[0,c+1,[1]]); AddT(e11,[1,c+1,[1]]);
Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,[]);
od;
for l in [1..kh] do
for w in IteratorOfTuples([0,1],l) do
for c in [l..k-l] do
for dir in [0,1] do
e00:=[]; x01:=[]; x10:=[]; e11:=[]; ee:=[];
hd:=w[1];
tl:=w{[2..Length(w)]};
if hd=0 then
c0:=0; c1:=1;
else
c0:=1; c1:=0;
fi;
AddT(e00,[dir,c+c0,tl,0]);
AddT(x01,[dir,c+c1,tl,0]);
AddT(x10,[dir,c+c0,tl,1]);
AddT(e11,[dir,c+c1,tl,1]);
AddT(e00,[dir,c+1,w,0]);
AddT(x01,[dir,c+1,w,0]);
AddT(x10,[dir,c+1,w,1]);
AddT(e11,[dir,c+1,w,1]);
AddT(ee, [dir,c+1,tl]);
if dir=0 then
e01:=x01; e10:=x10;
else
e01:=x10; e10:=x01;
fi;
Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,ee);
od;
od;
od;
od;
return Automaton( "epsilon", states, 5,
[t00, t01, t10, t11, te],
[1], [1..states] );
end;
The automaton that we get can be transformed into a minimal deterministic one. From its transition function we can get a matrix $M$ that contains the number of ways we can get from one state to another in one step. From the matrix $M^n$ we can see how many ways there are to get from one state to another in $n$ steps. The number of ways to get from the initial state to a terminal one in $n$ steps equals the number of accepted words, i.e. the number of pairs of words of length $n$ with levenshtein distance $n$, because the automaton is deterministic.
The automaton has a sink state, that we may remove to very slightly simplify the calculation, since it does not contribute to the accepted words. Then all the remaining states are accepting.
Here is GAP code for these calculations:
nfa := levnfa(4); # for k=4
dfa := RemovedSinkStates(MinimalizedAut(nfa));
size := NumberStatesOfAutomaton(dfa);
mat := NullMat(size, size);;
for row in TransitionMatrixOfAutomaton(dfa) do
for i in [1..size] do
if row[i]<>0 then
mat[i][row[i]] := mat[i][row[i]]+1;
fi;
od;
od;
init := ListWithIdenticalEntries(size, 0);;
init[InitialStatesOfAutomaton(dfa)[1]] := 1;;
Assert(0, FinalStatesOfAutomaton(dfa)=[1..size]);
fin := ListWithIdenticalEntries(size, 1);;
Now we can compute the number of pairs of words of length $n$
with distance at most $k$ with init * mat^n * fin
, but if we want that
for a lot of $n$, and especially if $k$ and therefore the size of the matrix
gets bigger, it is more efficient to do something like this:
res:=[];; v:=init;;
for i in [1..150] do
v:=v*mat;
Add(res, v*fin);
od;
To find a closed formula, we need the eigenvalues of the matrix $M$. From the GAP computation
gap> Set(Factors(CharacteristicPolynomial(Rationals,mat)));
[ x_1-2, x_1-1, x_1, x_1+1, x_1^2+1, x_1^2+x_1+1 ]
(GeneralizedEigenvalues(Rationals,mat);
computes the same but takes longer)
we see that the eigenvalues are
$2, 1, 0, -1, \pm i$ and the third roots of unity.
We can use the index of the nilpotent part of the Jordan-Chevalley decomposition minus one as bound for the degree of the polynomials that appear in the closed formula, and as bound for the values for which it works, but for bigger $k$ that seems to be too much work and we can just guess some number. Also, if we do the work to compute the decomposition, it should be possible to directly get the formula from it. Anyway, GAP allows us to compute:
gap> j:=JordanDecomposition(mat);;
gap> IsZero(j[2]^4);
false
gap> IsZero(j[2]^5);
true
Now let's find out how to express the function as linear combination of exponential functions (for nonzero real eigenvalues) and sine and cosine functions (for complex eigenvalues), multiplied with powers of $n$. Let's use powers up to the fifth, even though we expect only fourth powers. The sine function corresponding to the third root of unity is multiplied with $\sqrt{3}$ to get rational results. We need to compute 42 coefficients, to be sure we do the fitting with all the values of $n$ between 10 and 100:
gap> e := x -> n -> x^n;;
gap> ci := n -> RealPart(E(4)^n);;
gap> si := n -> ImaginaryPart(E(4)^n);;
gap> c3 := n -> RealPart(E(3)^n);;
gap> s3 := n -> ImaginaryPart(E(3)^n) * Sqrt(3);;
gap> SolutionMat(TransposedMat(List([10..100],n->Concatenation(
> List([e(2),e(1),e(-1),ci,si,c3,s3], fn ->
> List([0..5], exp -> n^exp*fn(n)))))),
> res{[10..100]});
[ 168264301/345744, -4115011/16464, 19597/336, -161/24, 17/48, 0,
-309817/648, -7058/81, -856/81, 449/81, -515/324, 0, 1/72, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -50920/194481, 2378/27783, 22/3969, 0,
0, 0, -61456/583443, 2566/83349, -10/3969, 0, 0, 0 ]
We succeeded, and we see that the eigenvalues $\pm i$ do not contribute to the result. After division by $2^n$, we get the result
$$\mathbb{E}(g_4(X_n))=p_2(n)+2^{-n}\Biggl(p_1(n)+(-1)^n p_{-1}(n)+ \cos\left(\frac{2\pi n}{3}\right)p_c(n)+\sqrt{3}\sin\left(\frac{2\pi n}{3}\right)p_s(n)\Biggr) $$
where
\begin{align} p_2(n) &= \frac{17}{48}n^4-\frac{161}{24}n^3+\frac{19597}{336}n^2- \frac{4115011}{16464}n+\frac{168264301}{345744}, \\ p_1(n) &= -\frac{515}{324}n^4+\frac{449}{81}n^3-\frac{856}{81}n^2- \frac{7058}{81}n-\frac{309817}{648}, \\ p_{-1}(n) &= \frac{1}{72}, \\ p_c(n) &= \frac{22}{3969}n^2+\frac{2378}{27783}n-\frac{50920}{194481}, \\ p_s(n) &= -\frac{10}{3969}n^2+\frac{2566}{83349}n-\frac{61456}{583443}. \end{align}
By comparison we see that the formula works for $n\geq 5$.
Bigger $k$
When we do the same for $k=5$, we get no new eigenvalues, but now we get a contribution from the $\pm i$ eigenvalues. Amazingly, the polynomial corresponding to the eigenvalue 1 has degree 6. And the formula only works for $n\geq 8$, so it misses some of the nontrivial cases.
For even bigger $k$ we get more roots of unity as eigenvalues. The computations get more difficult: the size of the NFA grows faster than $2^{k/2}$, and the DFA is much bigger. The matrix $M$ has dimension $N\times N$ where $N$ is the size of the DFA. But it is a sparse matrix with less than $4N$ nonzero entries.
The following GAP function computes a sparse representation of the matrix directly from the transition matrix of the automaton and writes it to a file named sparse
:
save_sparse := function(tm)
local os, size, row, i, st, states;
size:=Length(tm[1]);
os:=OutputTextFile("sparse",false);
AppendTo(os, size, " ", size, " ", "M\n");
for i in [1..size] do
states := Filtered(List([1..4], n -> tm[n][i]), s -> s<>0);
for st in Set(states) do
AppendTo(os, i, " ", st, " ", Number(states, s-> s=st), "\n");
od;
od;
AppendTo(os, "0 0 0\n");
CloseStream(os);
end;
It can be used like this:
save_sparse(TransitionMatrixOfAutomaton(dfa));
I could find the eigenvalues for bigger $k$ by computing the minimal polynomial of the matrix using LinBox.
The following table gives the sizes of the automata and the new eigenvalues for $k\leq 10$:
k #NFA #DFA new eigenvalues
0 1 1 2
1 2 2 -
2 7 15 1,0,-1
3 12 38 -
4 25 265 E(3),E(4)
5 38 700 -
6 67 4389 E(5),E(6)
7 96 11856 -
8 157 64905 E(7),E(8)
9 218 175766 -
10 343 859265 E(9),E(10)
where E(n)
denotes all primitive $n$-th roots of unity.