Do runs of every length occur in this string?
In reference to the strings defined here (constructed by repeatedly appending the last "half" of the current string), consider the particular infinite string $s$ generated by starting with $\text{abc}$:
$$\begin{align} \quad &\text{abc}\\ &\text{abcbc}\\ &\text{abcbccbc}\\ &\text{abcbccbcccbc}\\ &\cdots\\ &\text{______________________________}\\ s = \ &\text{abcbccbcccbcbcccbccbcbcccbccccb...} \end{align} $$
More formally, the general rewriting rule is $$a_0 a_1 \cdots a_{n-1} \ \ \to \ \ a_0 a_1 \cdots a_{n-1} a_{\left\lfloor\frac{n}{2}\right\rfloor } a_{\left\lfloor\frac{n}{2}\right\rfloor+1} \cdots a_{n-1}. $$ Clearly,
- $\text{a}$ occurs only in the initial position,
- $\text{b}^k$ occurs infinitely often for $k=1$, but never occurs for $k\ge2$,
and one may conjecture that
- $\text{c}^k$ occurs for every $k\ge 1$ (and hence, infinitely often for each $k\ge 1$).
How to prove or disprove the conjecture?
Some possibly-relevant facts:
Computations show that the index of the first occurrence of $\text{c}^k$ is as follows, for some small $k$: $$\begin{align} &\text{substring} \quad & \text{index}\\ &\text{c} &\text{2}\\ &\text{cc} &\text{4}\\ &\text{ccc} &\text{7}\\ &\text{cccc} &\text{26}\\ &\text{ccccc} &\text{27308}\\ &\text{cccccc} &\approx 10^{519}\\ &\text{ccccccc} &? \ (\gt 10^{40677})\\ \end{align} $$ (The exact index for the first occurrence of $\text{c}^6$ is a $520$-digit number, and $\text{c}^7$ does not occur in the first $10^{40677}$ terms of $s$.)
Let $L_n$ be the length of the $n$th intermediate string in the generating process illustrated above. Then $$L_{n+1} = L_n + \left\lfloor\frac{L_n + 1}{2}\right\rfloor,\ \ L_0 = 3.$$ Hence, $L_n$ grows exponentially: $$L_n \gtrsim 3(\frac{3}{2})^n.$$
Let $s_n$ be the $n$th intermediate string, and let $t_n$ be the $n$th appended string (so $s_n = s_{n-1} t_n$). Now, every intermediate string ends with $\text{bc}$, so $\text{c}^k$ would first occur only when some $t_n$ begins with $\text{c}^{k-1}$, in which case $s_n= s_{n-1} t_n$ will contain the first occurrence of $\text{c}^k$ beginning at index $L_{n-1}-1$.
After $\text{c}^k$ first occurs, the number of instances of $\text{c}^k$ grows approximately exponentially in the number of iterations, as does the length, and the ratio $$p_k = \frac{\text{number of instances of c}^k\text{ on the }n\text{th iteration}}{L_n} \approx \frac{1}{\text{index of the first occurrence of c}^k} $$ is approximately a constant independent of $n$. Since $\text{c}^{k+1}$ first occurs when one of these instances of $\text{c}^k$ happens to begin the "last half" of an intermediate string, this may be compared to a sequence of Bernoulli trials, each with success probability $p_k$. For such a process, the expected number of trials to get the first success is just $1/p_k$, so the index of the first occurrence of $\text{c}^{k+1}$ would be compared to $$\frac{2}{3} L_{n_k + 1/p_k} \approx 2(\frac{3}{2})^{n_k + i_k} $$ where $n_k$ is the number of iterations to get the first occurrence of $\text{c}^k$, and $i_k$ is the corresponding index. E.g., the first-occurrence index of $\text{c}^6$ would be compared to $2(\frac{3}{2})^{n_5 + i_5} = 2(\frac{3}{2})^{22 + 27308} \approx 10^{4813}$ (when in fact it is approximately $10^{519}$). Similarly, the first-occurrence index for $\text{c}^7$ (if it exists) would be compared to $2(\frac{3}{2})^{n_6 + i_6} \approx 10^{10^{518}}$. These comparisons are quite poor, but may help to understand how the first-occurrence indices can be so enormous.
Solution 1:
Okay, to the OP of this question. Since you requested a computer program.
Below is a string based program written in Javascript that is able to calculate index of $cccc$ in your string. See for yourself It does construct your string.
input = 'abc';
m=20;
for (i = 0; i < m; i++) {
l = input.length;
n=(l/2);
input = input + input.substring(n, l);
}
var index=input.search(/cccc/i);
alert(index)
For $m=20$ it constructs a string where $cccc$ is at the index $26$,$ccc$ is at $7$,$cc$ is at $4$.
For $m=30$ it constructs a string where $ccccc$ is at the index of $27308$. (which follows your conjecture).
For $m=40$ it constructs a string, where $cccccc$ is not there and it returns $-1$, to denote that search found nothing.
Intrestingly, when I increase m to $45$, it exceeds the computational power and does not return anything on the browser (I am using chrome on win 7 64-bit). And this is because the browser storage for string as per this answer is about 5 MB which is a string of 2,621,440 characters.
You can however, will get success in re-writing this code in Java because it supports maximum string length of $2^{31}-1$ possibly over 2 billion as per this answer.
But, for that you need at least $1024 MB$ of heap size. I firmly believe your conjecture can be proved :).
Good luck!
PS: I did not had success on recreating this on Java, since I am not an expert. However, I will return to repost the solution in Java.