Lets define:

$U=\left \{ u_j\right \} , 1 \leq j\leq N= 2^{L},$ the set of all different binary sequences of length $L$.

$V=\left \{ v_i\right \} , 1 \leq i\leq M=\binom{L}{k}2^{k},$ the set of all different gaped binary sequences with $k$ known bits and $L-k$ gaps.

$A_{M*N}=[a_{i,j}]$ is a binary matrix defined as following:

$$a_{i,j} = \left\{\begin{matrix} 1 & \text{if } v_i \text{ matches } u_j\\ 0 & \text{otherwise } \end{matrix}\right.$$

and finally,

$S_{M*M}=AA^{T}$

now, the question is that:

i) What is the rank of matrix $S$ ?

ii) What is the eigen decomposition for $S$ ?

Here is an example for $L=2, k=1$:

$$U = \left \{ 00,01,10,11\right \} $$ $$V = \left \{ 0.,1.,.0,.1\right \} ^*$$

$$ A = \begin{bmatrix} 1 & 1 & 0 &0 \\ 0 & 0 & 1 &1 \\ 1 & 0 & 1 &0 \\ 0 & 1 & 0 &1 \end{bmatrix}$$

$$ S = \begin{bmatrix} 2 & 0 & 1 &1 \\ 0 & 2 & 1 &1 \\ 1 & 1 & 2 &0 \\ 1 & 1 & 0 &2 \end{bmatrix}$$

For the special case $k=1$, this has been previously solved by joriki and the solution can be found here.

any comments or suggestion is appreciated.

$^{*}$ here dots denote gaps. a gap can take any value, and each gaped sequence with $k$ known bits and $(L−K)$ gaps in $V$, exactly matches to $2^{L−k}$ sequences in U, hence the sum of elements in each row of $A$ is $2^{L−k}$.


As M.S. pointed out, $\mathrm{rank}(AA^\mathrm{T})=\mathrm{rank}(A)$, so we can determine $\mathrm{rank}(A)$ instead.

Let $w_{jr}=2u_{jr}-1$, that is, $w_{jr}$ is $\pm1$ according as the $r$-th bit of $u_j$ is $1$ or $0$. Then the vectors $b^S$ defined by

$$b^S_j=\prod_{r\in S}w_{jr}$$

for all $S\subseteq\{1,\ldots,L\}$ form a basis of $\mathbb{R}^N$: There are $N=2^L$ of them and they are mutually orthogonal, since for $S,S'\subseteq\{1,\ldots,L\}$ with $S\neq S'$ and some $s\in S$ with $s\notin S'$ (assuming $S\neq\emptyset$ without loss of generality)

$$ \begin{eqnarray} \sum_{j=1}^N b^S_j b^{S'}_j &=& \sum_{j=1}^N\prod_{r\in S}w_{jr}\prod_{r'\in S'}w_{jr'} \\ &=& \sum_{j=1}^N w_{js}\prod_{r\in S-\{s\}}w_{jr}\prod_{r'\in S'}w_{jr'}\;, \end{eqnarray} $$

which is zero since the sum splits into two parts for $w_{js}=\pm1$ which cancel each other.

Now consider the representation of the symmetric group $S_L$ induced on $\mathbb{R}^N$ by the permutation of the bits of the $u_j$: a permutation $\pi\in S_L$ acts on the set $U$ by permuting the bits of the $u_j$, and this induces a representation $\rho$ on $\mathbb{R}^N$ in which $\pi$ acts on a vector of $\mathbb{R}^N$ by permuting the vector's entries as it permutes the $u_j$. The vectors $b^S$ with $|S|=m$ for fixed $m$ span an irreducible invariant subspace $B_m$, since they are transformed into each other by permutations and each of them can be transformed into any other by a suitable permutation. Thus, $\rho$ decomposes into $L+1$ irreducible subrepresentations $\rho_m$ over the $B_m$, of dimensions $\left({L\atop m}\right)$. The rows of $A$ also span a (reducible) invariant subspace under $\rho$, since they are transformed into each other by permutations. All rows of $A$ are orthogonal to a given $b^S$ iff $|S|> k$: If $|S|> k$, for each row $i$ at least one of the bits in $S$ is unknown in $v_i$ and thus takes on both values under the permutations and causes the inner product to split into two cancelling parts, whereas if $|S|\le k$, there are rows such that all bits indexed by $S$ are known, so that there is no cancellation in the sum.

Without using the decomposition of $\rho$, simply from orthogonality, we can infer

$$\text{rank}(A)\le\sum_{m=0}^k\left({L\atop m}\right)\;.$$

Now the row space of $A$ is not orthogonal to the subspaces $B_m$ with $m\le k$, and is orthogonal to the subspaces $B_{L-m}$ with $m<L-k$. The representations $\rho_m$ and $\rho_{L-m}$ are the only ones of their dimension (and thus potentially of the same character) in the decomposition of $\rho$. Since the row space of $A$ is an invariant subspace under $\rho$, it must decompose into irreducible subspaces, and for $m\le \min(k,L-k-1)$ it follows that $B_m$ must be contained in the decomposition. Thus

$$\text{rank}(A)\ge\sum_{m=0}^{\min(k,L-k-1)}\left({L\atop m}\right)\;,$$

and therefore

$$\text{rank}(A)=\sum_{m=0}^k\left({L\atop m}\right)\;\;\text{for $2k+1\le L$.}$$

Numerical results show that this is also true for $2k+1>L$, but I haven't been able to figure out yet how to show in this case that both $B_m$ and $B_{L-m}$ are in the row space for $L-k\le m \le k$.

There was a question on MO whether there's a closed form for this, and the answer was negative.

[Edit:] I just realized that all this representation stuff is actually complete overkill, since we can construct the $b^S$ with $|S|\le k$ explicitly from the rows of $A$: Let $a_i$ denote the row of $A$ corresponding to $v_i$, let $K_i$ be the set of the indices of the known bits of $v_i$, and in analogy to $w_{jr}$ let $z_{ir}=2v_{ir}-1$ for $r\in K_i$; then

$$\sum_{K_i\supseteq S}a_i\prod_{r\in S} z_{ir}$$

is a multiple of $b^S$, and it's non-zero iff $|S|\le k$. Thus, the rank given above is indeed correct for all $k$ and $L$.