Average Length of Longest Palindromic Subsequence

Suppose there is a finite set $A$ containing $n$ elements. One can construct a sequence of finite length:

$$\{a_i\},\ a_i \in A,\ i\in \mathbb N,\ i\le N$$

This sequence contains $2^N$ subsequences of the form:

$$\{ a_{i_k}\}, k\in \mathbb N,\ k\le M\le N,\ i_k>i_{k-1}$$

Such a subsequence is palindromic iff $a_{i_{M+1-k}} = a_{i_k}\ \forall k$. The length of the longest such palindromic subsequence can thus be defined as:

$$M_p\{a_i\} = \max\{|\{a_{i_k}\}|:a_{i_{M+1-k}} = a_{i_k}\ \forall k\}$$

The algorithm to find $M_p$ is a well known dynamic programming problem, but after implementing it I found an interesting behavior. The average proportional length of the longest palindromic subsequence can be written as:

$$\pi(N) = \left\langle\frac{M_p\{a_i\}}{N}\right\rangle$$

where $\left\langle\right\rangle$ denotes an average over all $\{a_i\}$ of length $N$. In principle, a random sampling should provide a good approximation. Numerically, it appears that $\lim_{N\to\infty}\pi(N)$ converges to a constant depending only on the $n$, the number of elements from which the sequence is constructed. For binary sequences, this limit appears to be $\approx 0.8$, while for $n=10$, it appears to be $\approx 0.45$.

The fact that the limit converges to a constant other than $1$ or $0$ indicates that the longest palindrome scales linearly with the length of the sequence, and that the calculated values appear to be rational numbers quite curious.

Here are some rough approximations of $\lim_{N\to\infty}\pi_n(N)$, generated using 256 random sequences of length 256. $\sigma$ indicates the standard deviation of the distribution, not the sample mean.

\begin{array}{c|lcr} n & \pi_n(N) & \sigma \\ \hline 2 & 0.80035 & 0.01647 \\ 3 & 0.70724 & 0.01615 \\ 4 & 0.64514 & 0.01544 \\ 5 & 0.59718 & 0.01697 \\ 6 & 0.56046 & 0.01617 \\ 7 & 0.53024 & 0.01627 \\ 8 & 0.50377 & 0.01527 \\ 9 & 0.48380 & 0.01520 \\ 10 & 0.46349 & 0.01489 \end{array}

I would like to know if there is a way to derive the asymptotic behavior observed in these numerical results. The complexity of the algorithm for $M_p$ makes it difficult to derive a probability distribution, and the rational number results hint that there may be a simpler approach. Any input would be appreciated.


Solution 1:

For two words $W_1$ and $W_2$ sharing a common alphabet, let $L(W_1, W_2)$ be the longest common subword of $W_1$ and $W_2$ (i.e. the longest word appearing as a subword of both $W_1$ and $W_2$). Define $$\gamma_n = \lim_{N \rightarrow \infty} \frac{E[L(W_1, W_2)]}{N},$$ where $W_1$ and $W_2$ are two independent words of length $N$ uniformly chosen from an alphabet of length $n$. This limit exists because the longest common word is superadditive when we concatenate words: We have $$L(W_1 X_1, W_2 X_2) \geq L(W_1, W_2) + L(X_1, X_2).$$

Determining $\gamma_n$ is a longstanding open problem, and it seems likely that $\gamma_n$ is also the limiting fraction of the longest palindrome.

Why $\gamma_n$ is a lower bound for $\pi(N)$: Divide your sequence into two equal halves. You can construct a palindrome by taking a common word between the first half and the reversal of the second half.

Why $\gamma_n$ is an upper bound for $\pi(N)$ (NOT RIGOROUS). Let $L_k$ be the length of the longest common sequence between the first $k$ characters and the reversal of the last $N-k$ characters of your word. The longest palindrome can be thought of as the maximum of $2L_k$ over all $k$ between $1$ and $n$ ($k$ corresponds to the midpoint of your word).

By definition, the expected value of $L_{N/2}$ is $\frac{N}{2} \gamma_n$. Furthermore, we have some amount of concentration: Because changing a single character in either sequence can only change the longest common subsequence by $1$, it follows from McDiarmid's Inequality (which in turn comes from Azuma's martingale inequality) that $L_n$ is almost always within $O(\sqrt{N})$ of its expectation. In particular, it is very unlikely (probability much smaller than $1/N$) for $2L_{N/2}$ to be at least $cN$ for any $c>\gamma_n$.

Intuitively, it should be even less likely for $L_k$ to be unusually large when the sequence is divided unevenly. If this were true, then what we wanted would follow from the union bound. I don't see an obvious way to show it though.

Wikipedia has quite a few references on the problem of determining $\gamma_k$ in its article on the Chvátal–Sankoff constants