Subwords of the Thue-Morse Sequence
In addition to Complexity of Thue-Morse Sequence, I have the following question:
Has anyone found a characterization for subwords of Thue–Morse sequence? I.e., for a given binary word, can I (easily for some definition of easy) decide whether it is a subword of the Thue–Morse sequence or not? Any references (in addition to Brlek and de Luca–Varricchio) are very welcome!
If you want an algorithm then here it is: the Prouhet-Morse-Thue sequence is given by the iteration of the map $1\mapsto 10, 0\mapsto 01$ starting with $1$. If you want to check if a word $w$ is a subword of the sequence, start with $1$ and iterate the map sufficient number of times. The sequence is uniformly recurrent (Morse), so "sufficient" is not large, at most $|w|$ in this case because every subword $U$ of length $2^{n}$ contains all subwords of length $n$. I am sure that the $2^n$ bound can be made polynomial and that the original problem is in $P$.
Edit. In fact by Theorem 8.2 in Morse, Marston; Hedlund, Gustav A. Symbolic Dynamics. Amer. J. Math. 60 (1938), no. 4, 815–866 one can replace $2^n$ above by $10n$ and so the number of iterations needed is $O(\log n)$ and the time complexity of the problem is $O(n^2)$ (or $O(n)$ depending on the mode of computation).
The Thue-Morse sequence can be generated by repeatedly applying the replacement rule $(0 \to 01), (1 \to 10)$. To check if a word $w$ is a subword of the Thue-Morse sequence, we can apply this rule in reverse: try to "undo" the replacement rule, replacing $01$ by $0$ and $10$ by $1$. This gives us a new word half the length of $w$, and repeat until we're at a short word we know the answer to.
That's the high-level idea; here are the details.
If $w$ appears in the Thue-Morse sequence, it does not have to appear at an even position. For example, suppose we want to know if $001100$ appears in the Thue-Morse sequence:
- Naively, we might break this up into blocks as $[00][11][00]$ and then fail, because we can only replace $[01]$ by $0$ and $[10]$ by $1$; $[00]$ and $[11]$ are never the outputs of the replacement rule.
- But we can also break this up into blocks as $[.0][01][10][0.]$; in this case, the values replaced by $.$ must be $[10][01][10][01]$, and undoing the replacement rule gives us $1010$.
So which one should we do? We try both! If $w$ contains $00$ or $11$ as a subword, then those two symbols must be in separate blocks, so there's at most one way to continue. If $w$ does not contain $00$ or $11$ as a subword, then the $0$'s and $1$'s must alternate; such a $w$ is not contained in the Thue-Morse sequence when $|w| \ge 5$.
This is an $O(|w|)$ algorithm; in linear time, we can either undo the replacement rule or realize that we cannot do it, and then we get a word of length about $\frac12|w|$ to recurse on.
An example which is a subword
Suppose we want to know if 0010110100101100110100110010
appears in the Thue-Morse sequence.
- Separating it into
00 10 11 01 00 10 11 00 11 01 00 11 00 10
does not work; there are00
's and11
's. So we separate it as.0 01 01 10 10 01 01 10 01 10 10 01 10 01 0.
and undo the replacement rule to get100110010110100
. - Separating this into
10 01 10 01 01 10 10 0.
works (but separating it into.1 00 11 00 10 11 01 00
would not). We undo the replacement rule to get10100110
. - Separating this into
10 10 01 10
works (but separating it into.1 01 00 11 0.
would not). We undo the replacement rule to get1101
. - Separating this into
11 01
does not work, so we separate it into.1 10 1.
. We undo the replacement rule to get011
. - All words of length $3$ except for
000
and111
occur in the Thue-Morse sequence; therefore, so does our original word. (We could compute all the words of some longer lengths in advance, and stop earlier.)
An example which is not a subword
Suppose we want to know if 10110011010011001011010010110100101
appears in the Thue-Morse sequence.
- Separating it into
10 11 00 11 01 00 11 00 10 11 01 00 10 11 01 00 10 1.
does not work, so we separate it into.1 01 10 01 10 10 01 10 01 01 10 10 01 01 10 10 01 01
. We undo the replacement rule to get001011010011001100
. - Separating this into
00 10 11 01 00 11 00 11 00
does not work, so we separate it into.0 01 01 10 10 01 10 01 10 0.
. We undo the replacement rule to get1001101010
. - Separating this into
10 01 10 10 10
works (but separating it into.1 00 11 01 01 0.
would not). We undo the replacement rule to get10111
. - Now neither
10 11 1.
nor.1 01 11
work: both contain a11
block, which cannot be the result of the replacement rule. So this word does not appear in the Thue-Morse sequence; therefore, neither does our original word.
Here's a self-contained recursive algorithm that should run in $O(n)$ time on a word $w$ of length $n$.
- If $n\leq 2$, accept. If $3\leq n\leq 4$, accept only if $w$ contains neither $000$ nor $111$ as a substring.
- If $w$ contains $01010$ or $10101$ as a subword, reject.
- Construct two words $w_1$ and $w_2$ from $w$ in the following way: If $n$ is even, let $w_1=w$ and let $w_2$ be $awb$, where $a,b\in\{0,1\}$ are chosen such that $a$ is different from the first character of $w$ and such that $b$ is different from the last character of $w$. If $n$ is odd, let $w_1=wb$ and $w_2=aw$, where $a$ and $b$ are chosen identically to the even case.
- Check whether $w_1$ or $w_2$ is in $\{01,10\}^*$ (i.e. whether either is a string formed from concatenating some copies of either $01$ or $10$). If neither is, reject. If one of them, say $w_i$, is, then let $x$ to be the string formed by replacing each $01$ with $0$ and each $10$ with $1$. Note that $|x|\leq \frac{n}2+1$. (At most one of $w_1$ and $w_2$ is of this form, which I'll prove later.)
- Run the algorithm on $x$. If it accepts, accept; if not, reject.
This algorithm spends $O(n)$ time to reduce the problem to one of input length at most $n/2+1$, and so its full runtime is $O(n)$.
We now show correctness. Let $\mathcal W$ be the set of all subwords of the Thue-Morse sequence. Firstly, it is easy to show that $\mathcal W$ contains every word of length at most $2$, that $\mathcal W$ does not contain $000$ or $111$ (as any $0$ in the Thue-Morse sequence must have a $1$ next to it, and any $1$ must have a $0$ next to it), and that it contains every $3$- and $4$-character word without a $000$ or $111$ subword. It may not contain $01010$ or $10101$, as these can only be generated using the rules $0\to 01$ and $1\to 10$ via starting with a $000$ or $111$ string. So, the algorithm behaves correctly on those words accepted or rejected in steps 0 and 1.
For steps 2 and 3, we see that, for a word $w$ to be in $\mathcal W$, it must occur at an even index or at an odd index in the Thue-Morse sequence. If it occurs at an even index, then it must begin with $01$ or $10$, then follow with a $01$ or $10$, et cetera; if $n$ is odd, a character distinct from its final character must follow it. Similarly, if $w$ occurs at an odd index, it must be preceded by a character distinct from its first character; after this, there must be a $01$ or $10$, et cetera. So, $w_1$ and $w_2$ as constructed represent the possible ways for $w$ to be contained in a minimal even-length even-index subword of the Thue-Morse sequence. If these are both in $\{01,10\}^*$, then $w$ must consist of alternating $0$s and $1$s, and so it would already have been dealt with in step 1. On the other hand, if some $w_i$ for $i\in\{1,2\}$ is formed by $01$s and $10$s, then the only way for $w$ to be in some iteration $k$ of the Thue-Morse sequence is for the word $x$ as constructed to be in iteration $k-1$ of the sequence. So, $w\in\mathcal W$ if and only if $x\in\mathcal W$, and we can apply the algorithm recursively to get a correct result.