Best Compression algorithm for a sequence of integers
I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive.
What would be the best algorithm to compress this?
I tried the deflate algorithm but that gives me only 50% compression.
Note that the algorithm cannot be lossy.
All numbers are unique and progressively increasing.
Also if you can point me to the java implementation of such algorithm that would be great.
Solution 1:
We have written recent research papers that survey the best schemes for this problem. Please see:
Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137
Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399
They include an extensive experimental evaluation.
You can find a complete implementation of all techniques in C++11 online: https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection
There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte
If you prefer Java, please see https://github.com/lemire/JavaFastPFOR
Solution 2:
First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.
This is how the PNG format does to improve its compression (it does one of several difference methods followed by the same compression algorithm used by gzip).
Solution 3:
Well, i'm voting for smarter way. All you have to store is [int:startnumber][int/byte/whatever:number of iterations] in this case, you'll turn your example array into 4xInt value. After it you can compress as you want :)
Solution 4:
While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:
None 1.0
Deflate 0.50
Filtered 0.34
BZip2 0.11
Lzma 0.06