Are the primes compressible?

Solution 1:

It turns out that storage of primes can be compressed by an arbitrary amount, though the storage needed is exponential in the compression factor.

Note: This has been corrected and been made more precise.

Let $p_n$ be the $n$-th prime (with $p_1 = 2$), and let $P_n$ be the product of the first $n$ primes, so that $P_1 = 2, P_2 = 6, P_3 = 30, P_4 = 210$.

If we do a sieve of Eratosthenes, sieving out multiples of the first $n$ primes, all that are left are the first $n$ primes and the numbers relatively prime to $P_n$.

For example, for $n=2$, the numbers left are $2, 3$, and the forms $6m+1$ and $6m+5$. For $n=3$, the numbers left are $2, 3, 5$ and the forms $30m+1,7,11,13,17,19,23,$ and $29$.

For general $n$, the numbers remaining are of the form $P_nm+q_i$, where the $q_i$ are the numbers from $1$ to $P_n-1$ relatively prime to $P_n$.

To illustrate, I will use the case $n=3$.

Each block of 30 numbers in a range $30m+1$ to $30m+29$ can have as most $8$ primes as shown above. Therefore, only one bit is needed for each of these $8$ possibilities, to indicate whether or not that value is actually prime. Therefore, storage of primes can be compressed by a factor of $\frac{8}{30} \approx 0.267$.

Here is what happens for general $n$.

The $P_n$ values in the range from $mP_n$ to $(m+1)P_n-1$ are compressed to $\phi(P_n)$ bits, where $\phi(m)$ is Eu;er's phi function (though he lets me use it) that counts the number of integers from $1$ to $m$ relatively prime to $m$.

Since $\phi(P_n) = \prod_{i=1}^n (p_i-1) $, the compression factor is $\dfrac{\phi(P_n)}{P_n} =\prod_{i=1}^n (1-\dfrac1{p_i}) $. Since this goes to zero (because $\sum_{i=1}^n \dfrac1{p_i} \sim \ln \ln n $ - this is Merten's theorem), the amount of compression can be arbitrarily large, though about $e^n/n$ bits are needed.

To see this, by one of the corollaries of the prime number theorem, $P_n \sim e^n$, and $\dfrac{\phi(P_n)}{P_n} =\prod_{i=1}^n (1-\dfrac1{p_i}) \approx e^{-\ln \ln n} = 1/\ln n $. Therefore, each block of $P_n$ values can be represented by about $\frac{P_n}{n}$ bits.

Therefore, using a sieve with the first $n$ primes and the numbers relatively prime to $P_n$ can compress the primes by about a factor of $n$.

Therefore the primes can be compressed by an arbitrary amount, though the amount of storage needed is exponential in $n$.

A number of years ago, I used this idea with $n=4$, so I got a compression of $\frac{1\cdot 2\cdot 4\cdot 6}{2\cdot 3\cdot 5\cdot 7} =\frac{8}{35} $.

Solution 2:

Primes are always odd, so the last bit is always 1 (except for the initial prime 2, of course). So there is definitely some redundancy.

Another consideration is that your description scheme (listing all the bits of every number) is not optimal. To describe any subset of $\{1,2,\ldots,n\}$ takes at most $n$ bits, since there are $2^n$ such subsets. We just have one bit corresponding to each number telling us whether that number is in or out of the subset.

For the particular subset you are considering (the primes), under the description scheme you are using (listing the bits in succession), there are about $\frac{n}{\ln n}$ elements (Prime Number Theorem), and each one has about $\log_2{n}$ bits. So the number of bits in your description is approximately $\frac{n}{\ln n}(\log_2{n}) = n (\log_2{e})\approx 1.44 n.$

Solution 3:

The first 23,163,298 primes can be considered compression-friendly. It is the maximum number of primes for which every gap between them is <= 255, i.e. fits into a single byte. So for those primes, you can use just a single byte per prime, storing the gap, rather than the actual prime as number, which consumes 8 bytes. This way, you end up using 8 times less memory.

I used this useful fact in my solution for caching primes that way.

UPDATE

This can be extended even further, to 303,371,455,241 - first prime with gap > 511, by taking advantage of the fact that all prime gaps are even numbers (except between 2 and 3), and so we can use bit 0 for storing bit 8, which in turn doubles the supported gap to 511, and thus increasing the range of primes significantly, to 303,371,455,241.

I implemented the complete 511 compression solution in my library.