Count of near matching sequences

Consider all pairs of binary strings $P$ and $T$. Let the length of $P$ be $n$ and the length of $T$ be $2n-1$. For each such pair, we can compute the Hamming distance between $P$ and each of the $n$ substrings of $T$ in order from left to right and output a sequence representing these results. For example:

$$P = 11011, T = 110011001$$

gives an output sequence:

$$1,1,4,4,1.$$

For my problem I just want to record where the Hamming distance is at most 1. We write a $Y$ where this is the case and an $N$ otherwise. So in this case I would get:

$$Y,Y,N,N,Y$$

where $Y$ represents Hamming distance at most $1$ and $N$ represents Hamming distance greater than $1$

If we iterate over all possible pairs $P$, $T$ we can count how many distinct sequences of $Y$s and $N$s we get from the outputs. For $n = 1, \dots, 9$ we get the following interesting result:

$$1 ,4, 8, 16, 32, 63, 120, 216, 368$$

Is it possible to give a closed form formula for the number of distinct sequences of $Y$s and $N$s?

Related In Count number of exact matching sequences a very nice solution is given by Smylic which corresponds to this problem with a Hamming distance bound of $0$ rather $1$.


Solution 1:

I'll show that the number $F_1(n)$ of near matching sequences is in $\Omega(n^2 \log n)$. I will use my previous result that the number $F_0(n)$ of exact matching sequences is $\frac12 n^2 \ln n (1 + O(1))$. (The only difference in definition of this type of sequences is that Y corresponds to zero Hamming distance that is reflected in the lower index.)

  1. $F_1(2n) \ge F_0(n)$.
    Proof. Let $P = p_1p_2\ldots p_n$ and $T = t_1t_2\ldots t_{2n-1}$ produce exact matching sequence $m_1m_2\ldots m_n$. Then $P' = p_1p_1p_2p_2\ldots p_np_n$ and $T' = t_1t_1t_2t_2\ldots t_{2n-1}t_{2n-1}0$ produce near matching sequence $m_1x_1m_2x_2\ldots m_nx_n$ for some $x_1, x_2, \ldots x_n$. The reason is that distance $d$ between $P$ and $t_i\ldots t_{i + n - 1}$ gives distance exactly $2d$ between $P'$ and $t_it_i\ldots t_{i + n - 1}t_{i + n - 1}$. $\square$

  2. $F_1(2n+1) \ge F_0(n)$.
    Proof. Let $P = p_1p_2\ldots p_n$ and $T = t_1t_2\ldots t_{2n-1}$ produce near matching sequence $m_1m_2\ldots m_n$. Then $P' = p_1p_1p_2p_2\ldots p_np_n0$ and $T' = t_1t_1t_2t_2\ldots t_{2n-1}t_{2n-1}000$ produce near matching sequence $m_1x_1m_2x_2\ldots x_{n-1}m_nx_nx_{n+1}$ for some $x_1, x_2, \ldots x_{n+1}$. The reason is that distance $d$ between $P$ and $t_i\ldots t_{i + n - 1}$ gives distance either $2d$ or $2d + 1$ between $P'$ and $t_it_i\ldots t_{i + n - 1}t_{i + n - 1}t_{i + n}$, where $t_{2n} = 0$. $\square$

Therefore $F_1(n) \ge F_0(\lfloor n/2 \rfloor) \sim \frac{n^2}{8} \ln n$ and $F_1(n) = \Omega(n^2 \log n)$. This bound may be rather weak.


Conjecture. Each near matching sequence with at least two Y's has the following form: $$N \times n_1, (Y, N \times n_2) \times n_3, Y, N \times (n_2n_4 + n_4 - 1), Y, (N \times n_2, Y) \times n_5, N \times n_6,$$ where $S \times k$ means sequence $S$ repeated $k$ times and $n_i$ are non-negative integers. Also each such sequence is a near matching sequence.

This would imply that $F_1(n) = \Theta(n^4)$.

It was easy to find that this conjecture is wrong comparing the number of described sequences with the number of near matching sequences. Even for $n = 6$ there are near matching sequences YNNYNY, YNYNNY, YNYNYY and YYNYNY that have other pattern. However it may be possible to use the number of described sequences as a lower bound for the number of near matching sequences.


Computational results: $$\begin{array}{c|c|c|c} n & F_1(n) & \frac{F_1(n)}{n^4} & \frac{F_1(n)}{n^4 \ln n}\\\hline 1 & 1 & 2 & -\\ 2 & 4 & 0.25 & 0.361\\ 3 & 8 & 0.099 & 0.090\\ 4 & 16 & 0.063 & 0.045\\ 5 & 32 & 0.051 & 0.032\\ 6 & 63 & 0.049 & 0.027\\ 7 & 120 & 0.050 & 0.026\\ 8 & 216 & 0.054 & 0.025\\ 9 & 368 & 0.056 & 0.026\\ 10 & 596 & 0.060 & 0.026\\ 11 & 930 & 0.064 & 0.026\\ 12 & 1387 & 0.067 & 0.027\\ 13 & 2009 & 0.070 & 0.027\\ 14 & 2818 & 0.073 & 0.028\\ 15 & 3872 & 0.076 & 0.028\\ 16 & 5191 & 0.079 & 0.029\\ 17 & 6850 & 0.082 & 0.029\\ 18 & 8838 & 0.084 & 0.029\\ 19 & 11310 & 0.087 & 0.029\\ 20 & 14209 & 0.089 & 0.030\\ 21 & 17687 & 0.091 & 0.030\\ 22 & 21719 & 0.093 & 0.030\\ 23 & 26512 & 0.095 & 0.030\\ 24 & 31938 & 0.096 & 0.030\\ \end{array}$$

EDIT. Looking at these results I guess that $F_1(n)$ is close to $\frac1{30}n^4 \ln n$.