Subwords of the Thue-Morse Sequence

In addition to Complexity of Thue-Morse Sequence, I have the following question:

Has anyone found a characterization for subwords of Thue–Morse sequence? I.e., for a given binary word, can I (easily for some definition of easy) decide whether it is a subword of the Thue–Morse sequence or not? Any references (in addition to Brlek and de Luca–Varricchio) are very welcome!

If you want an algorithm then here it is: the Prouhet-Morse-Thue sequence is given by the iteration of the map $1\mapsto 10, 0\mapsto 01$ starting with $1$. If you want to check if a word $w$ is a subword of the sequence, start with $1$ and iterate the map sufficient number of times. The sequence is uniformly recurrent (Morse), so "sufficient" is not large, at most $|w|$ in this case because every subword $U$ of length $2^{n}$ contains all subwords of length $n$. I am sure that the $2^n$ bound can be made polynomial and that the original problem is in $P$.

Edit. In fact by Theorem 8.2 in Morse, Marston; Hedlund, Gustav A. Symbolic Dynamics. Amer. J. Math. 60 (1938), no. 4, 815–866 one can replace $2^n$ above by $10n$ and so the number of iterations needed is $O(\log n)$ and the time complexity of the problem is $O(n^2)$ (or $O(n)$ depending on the mode of computation).

The Thue-Morse sequence can be generated by repeatedly applying the replacement rule $(0 \to 01), (1 \to 10)$. To check if a word $w$ is a subword of the Thue-Morse sequence, we can apply this rule in reverse: try to "undo" the replacement rule, replacing $01$ by $0$ and $10$ by $1$. This gives us a new word half the length of $w$, and repeat until we're at a short word we know the answer to.

That's the high-level idea; here are the details.

If $w$ appears in the Thue-Morse sequence, it does not have to appear at an even position. For example, suppose we want to know if $001100$ appears in the Thue-Morse sequence:

Naively, we might break this up into blocks as $[00][11][00]$ and then fail, because we can only replace $[01]$ by $0$ and $[10]$ by $1$; $[00]$ and $[11]$ are never the outputs of the replacement rule.
But we can also break this up into blocks as $[.0][01][10][0.]$; in this case, the values replaced by $.$ must be $[10][01][10][01]$, and undoing the replacement rule gives us $1010$.

So which one should we do? We try both! If $w$ contains $00$ or $11$ as a subword, then those two symbols must be in separate blocks, so there's at most one way to continue. If $w$ does not contain $00$ or $11$ as a subword, then the $0$'s and $1$'s must alternate; such a $w$ is not contained in the Thue-Morse sequence when $|w| \ge 5$.

This is an $O(|w|)$ algorithm; in linear time, we can either undo the replacement rule or realize that we cannot do it, and then we get a word of length about $\frac12|w|$ to recurse on.

An example which is a subword

Suppose we want to know if 0010110100101100110100110010 appears in the Thue-Morse sequence.

Separating it into 00 10 11 01 00 10 11 00 11 01 00 11 00 10 does not work; there are 00's and 11's. So we separate it as .0 01 01 10 10 01 01 10 01 10 10 01 10 01 0. and undo the replacement rule to get 100110010110100.
Separating this into 10 01 10 01 01 10 10 0. works (but separating it into .1 00 11 00 10 11 01 00 would not). We undo the replacement rule to get 10100110.
Separating this into 10 10 01 10 works (but separating it into .1 01 00 11 0. would not). We undo the replacement rule to get 1101.
Separating this into 11 01 does not work, so we separate it into .1 10 1.. We undo the replacement rule to get 011.
All words of length $3$ except for 000 and 111 occur in the Thue-Morse sequence; therefore, so does our original word. (We could compute all the words of some longer lengths in advance, and stop earlier.)

An example which is not a subword

Suppose we want to know if 10110011010011001011010010110100101 appears in the Thue-Morse sequence.

Separating it into 10 11 00 11 01 00 11 00 10 11 01 00 10 11 01 00 10 1. does not work, so we separate it into .1 01 10 01 10 10 01 10 01 01 10 10 01 01 10 10 01 01. We undo the replacement rule to get 001011010011001100.
Separating this into 00 10 11 01 00 11 00 11 00 does not work, so we separate it into .0 01 01 10 10 01 10 01 10 0.. We undo the replacement rule to get 1001101010.
Separating this into 10 01 10 10 10 works (but separating it into .1 00 11 01 01 0. would not). We undo the replacement rule to get 10111.
Now neither 10 11 1. nor .1 01 11 work: both contain a 11 block, which cannot be the result of the replacement rule. So this word does not appear in the Thue-Morse sequence; therefore, neither does our original word.

Here's a self-contained recursive algorithm that should run in $O(n)$ time on a word $w$ of length $n$.

If $n\leq 2$, accept. If $3\leq n\leq 4$, accept only if $w$ contains neither $000$ nor $111$ as a substring.
If $w$ contains $01010$ or $10101$ as a subword, reject.
Construct two words $w_1$ and $w_2$ from $w$ in the following way: If $n$ is even, let $w_1=w$ and let $w_2$ be $awb$, where $a,b\in\{0,1\}$ are chosen such that $a$ is different from the first character of $w$ and such that $b$ is different from the last character of $w$. If $n$ is odd, let $w_1=wb$ and $w_2=aw$, where $a$ and $b$ are chosen identically to the even case.
Check whether $w_1$ or $w_2$ is in $\{01,10\}^*$ (i.e. whether either is a string formed from concatenating some copies of either $01$ or $10$). If neither is, reject. If one of them, say $w_i$, is, then let $x$ to be the string formed by replacing each $01$ with $0$ and each $10$ with $1$. Note that $|x|\leq \frac{n}2+1$. (At most one of $w_1$ and $w_2$ is of this form, which I'll prove later.)
Run the algorithm on $x$. If it accepts, accept; if not, reject.

This algorithm spends $O(n)$ time to reduce the problem to one of input length at most $n/2+1$, and so its full runtime is $O(n)$.

We now show correctness. Let $\mathcal W$ be the set of all subwords of the Thue-Morse sequence. Firstly, it is easy to show that $\mathcal W$ contains every word of length at most $2$, that $\mathcal W$ does not contain $000$ or $111$ (as any $0$ in the Thue-Morse sequence must have a $1$ next to it, and any $1$ must have a $0$ next to it), and that it contains every $3$- and $4$-character word without a $000$ or $111$ subword. It may not contain $01010$ or $10101$, as these can only be generated using the rules $0\to 01$ and $1\to 10$ via starting with a $000$ or $111$ string. So, the algorithm behaves correctly on those words accepted or rejected in steps 0 and 1.

For steps 2 and 3, we see that, for a word $w$ to be in $\mathcal W$, it must occur at an even index or at an odd index in the Thue-Morse sequence. If it occurs at an even index, then it must begin with $01$ or $10$, then follow with a $01$ or $10$, et cetera; if $n$ is odd, a character distinct from its final character must follow it. Similarly, if $w$ occurs at an odd index, it must be preceded by a character distinct from its first character; after this, there must be a $01$ or $10$, et cetera. So, $w_1$ and $w_2$ as constructed represent the possible ways for $w$ to be contained in a minimal even-length even-index subword of the Thue-Morse sequence. If these are both in $\{01,10\}^*$, then $w$ must consist of alternating $0$s and $1$s, and so it would already have been dealt with in step 1. On the other hand, if some $w_i$ for $i\in\{1,2\}$ is formed by $01$s and $10$s, then the only way for $w$ to be in some iteration $k$ of the Thue-Morse sequence is for the word $x$ as constructed to be in iteration $k-1$ of the sequence. So, $w\in\mathcal W$ if and only if $x\in\mathcal W$, and we can apply the algorithm recursively to get a correct result.