Words that agree on the count of all subwords of length $\leq k$

I'm working with a two letter alphabet $\{0,1\}$, and I'm talking about generalized sub-word i.e. letters don't need to be adjacent, $|01010|_{00} = 3$

For example, the two words $u=1001$ and $v=0110$ agree on all subwords of length $\leq$ 2 (1, 0, 10, 01, 11 and 00), and are both of length 4, which happens to be the shortest length where this property will be true with $k=2$. I've found the minimal lengths $\{2,4,7,12,16,22\}$ for $k=\{1,2,3,4,5,6\}$ respectively, through bruteforcing.

What I'm basically looking for is, for two words $|u|=|v|=n$, what bound you need on $k$ such that if $u$ and $v$ agree on all subwords of length $\leq k$, then $u=v$.

I'm aiming for a $\mathcal{O}(\log n)$ bound, so I've tried looking for an inductive proof on k where the length of the word becomes a multiple after each iteration.

Also, as a side proof, so far experimentally, I've found that if two words agree on all subwords of length $k$, they also agree on all of length $\leq k$, but can't quite find a proof or counter-proof for this.


Solution 1:

It is a natural problem of combinatorics on words which has already been studied.

We don't actually know a good asymptotic equivalent for $k(n)=1,3,6,11,\dots$. We have the upper bound $k(n)=O(\sqrt n)$, or more precisely: $$k(n)\le 5+\left\lfloor\tfrac{16}{7}\sqrt{n}\right\rfloor\qquad\text{(Krasikov1997)}$$ $$k(n)\le 3+2\left\lfloor\sqrt{n\log 2}\right\rfloor\qquad\text{(Krasikov2000)}$$ (Krasikov2000) does not appear to be online (it is cited in (Ligeti2007)), and for some reason this better upper bound is not cited in (Dudik2002), so caveat emptor.

There is an easy $\Omega(\log n)$ lower bound from considering e.g. Thue-Morse words and their complement, but it is not sharp. In fact, $k(n)$ grows faster than any polynomial in $\log n$, so the bound you were hoping for is not satisfied: $$\log k(n)=\Omega(\sqrt{\log n})$$ See (Dudik2002).

In a way these two bounds can be reconciled to give the admittedly vague equivalent $$\log \log k(n)=\Theta(\log \log n)$$ but this is still precise enough to reject logarithmic growth, which would be $\Theta(\log \log \log n)$.


References:

  • (Krasikov1997) Krasikov, Roditty, On a Reconstruction Problem for Sequences. J. Combin. Theory Ser. A, 77 (2) (1997), pp. 344–348.
  • (Krasikov2000) Foster, Krasikov, An improvement of a Borwein-Erdélyi-Kós result. Methods Appl. Anal., 7(4):605–614, 2000.
  • (Dudik2002) Dudı́k, Schulman, Reconstruction from subsequences, J. Combin. Theory Ser. A, 103 (2) (2003), pp. 337-348.
  • (Ligeti2007) Ligeti, Combinatorics on words and its applications. Doctoral thesis.