A communication complexity problem for finding near matches

I am stuck with this problem that looks like it should be a counting argument. Consider two binary strings $A$ and $B$. String $A$ is of length $n$ and $B$ is of length $2n$.

Alice has both strings $A$ and $B$ and wants to send one single message to Bob (who doesn't get to see the strings at all) telling him approximately how close $A$ is to every substring of $B$ of length $n$. In particular, for every substring of $B$ of length $n$, Bob wants to be able to work out the Hamming distance between $A$ and that substring to within a multiplicative factor of $2$ just from this single message.

As an example, let $A = 101$ and $B = 011001$. The true Hamming distances are $2, 2, 1, 1$. Bob, having received the message from Alice, has to output 4 numbers. Bob's outputs values must be in the range 2 to 4 for the first two numbers and 1 to 2 for the next two.

Assuming that Alice and Bob can agree on a determinisic protocol beforehand, how many bits does Alice have to send to Bob?

Now posted to https://mathoverflow.net/questions/264959/a-communication-complexity-problem-for-finding-near-matches .


This is not a complete answer up to a function in closed form or trivial summation, but just a reduction to a pure combinatorial problem.

Is it possible to send data at all? Yes, Alice can send just a concatenation of $A$ and $B$ and be happy doing nothing. This takes $3n$ bits. There is a trivial way to save 1 bit. Note that $A$ and $B$ give the same sequence as $\overline{A}$ and $\overline{B}$, where $\overline{S}$ means string $S$ with each character flipped (i. e., 0 becomed 1 and 1 becomes 0). So we can throw away the first character $a_0$ of $A$ assuming it to be zero (otherwise replace $A$ and $B$ with $\overline{A}$ and $\overline{B}$ correspondingly). So we have $3n - 1$ bits. This approach I call trivial.

Is it possible to save significant number of bits with no information loss? The best way to do it is the following. Let $D = D(A, B)$ be a desired sequence of distances. Let Alice and Bob just enumerate all possible such sequqnces, i. e., all sequences $D_i$ such that there is at least one pair $(A_i, B_i)$ such that $D_i = D(A_i, B_i)$. Then Alice sends to Bob just a zero-based serial number of $D$. So new question is: how many different sequences can be produced by all pairs of $A$ and $B$ of length $n$ and $2n$ correspondingly? There is a related question for the case when length of $B$ is $2n-1$. Note that if Bob doesn't know $n$ then Alice can append leading zeros to the number to make the length of her message an information about $n$. I have reason to say that the number of difference sequences is greater than $2^n$, so the length of the message with serial number of the sequence reflects the value of $n$.

Now let see how it is possible to save even more information using allowed one-sided error of Bob's result. Let Alice after computing the sequence $D$ transform it into a sequence $T = T(D)$ in the following way and send a serial number of $T$. Each 0 remains to be 0. Each 1 is replaced by 2, and each 2 remains to be 2. Each 3, 4, 5 are replaced by 6, and each 6 remains to be 6 etc. I. e., each number $2^k - 1, 2^k, \ldots, 2^{k + 1} - 3$ is replaced by $2^{k + 1} - 2$, and $2^{k + 1} - 2$ is not changed. It is easy to see that taking the sequence $T$ instead of the sequence $D$ Bob makes an allowed one-sided error. So the desired question is: how many different sequences $T(D(A, B))$ there are for each $n$? Also it is possible that information about $n$ should be added somehow.

And the third version of the problem is about two-sided error allowed. I suggest the same idea: let Alice transform $D$ into $T' = T'(D)$ in the following way and send a serial number of $T'$. Each 0 remains to be 0. Each 1, 2, 3 and 4 become 2. Each 5, 6, $\ldots$, 20 becomes 10. So, each $\frac{4^k - 1}{3}, \frac{4^k - 1}{3} + 1, \ldots, \frac{4^{k + 1} - 4}{3}$ becomes $\frac{2\cdot 4^k - 2}{3}$. It is easy to see that taking sequence $T'$ instead of sequence $D$ Bob makes an allowed two-sided error. So the new question is: how many different sequences $T'(D(A, B))$ there are for each $n$? Also it is possible that information about $n$ should be added somehow.

And finally I give some computational results and compare them with trivial one. $$\begin{array}{|c|r|l|r|l|r|l|} \hline n & \#D & \frac{\log_2 \#D}{3n - 1} & \#T & \frac{\log_2 \#T}{3n - 1} & \#T' & \frac{\log_2 \#T'}{3n - 1} \\\hline 1 & 4 & 1 & 4 & 1 & 4 & 1 \\\hline 2 & 21 & 0.878463 & 8 & 0.6 & 8 & 0.6 \\\hline 3 & 124 & 0.869275 & 41 & 0.669694 & 14 & 0.475919 \\\hline 4 & 825 & 0.88075 & 132 & 0.640399 & 23 & 0.411233 \\\hline 5 & 6024 & 0.896893 & 326 & 0.596338 & 109 & 0.483442 \\\hline 6 & 47391 & 0.913666 & 700 & 0.555954 & 363 & 0.500225 \\\hline 7 & 395944 & 0.929747 & 2698 & 0.569884 & 1149 & 0.508308 \\\hline 8 & 3396881 & 0.943295 & 9129 & 0.57201 & 3230 & 0.50684 \\\hline 9 & 30095408 & 0.955502 & 31348 & 0.574465 & 7209 & 0.492907 \\\hline 10 & 267147407 & 0.965278 & 102212 & 0.573835 & 14910 & 0.478069 \\\hline 11 & ? & ? & 297939 & 0.568271 & 31060 & 0.466337 \\\hline 12 & ? & ? & 742605 & 0.557207 & 62634 & 0.455276 \\\hline \end{array} $$

It looks like sending sequence with no error tends to have no significant benefit, while allowed two-sided error reduces quantities of information approximately twice.