Why doesn't Gzip compression eliminate duplicate chunks of data?
I just did a little experiment where I created a tar archive with duplicate files to see if it would be compressed, to my awe, it was not! Details follow (results indented for reading pleasure):
$ dd if=/dev/urandom bs=1M count=1 of=a
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.114354 s, 9.2 MB/s
$ cp a b
$ ln a c
$ ll
total 3072
-rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 a
-rw-r--r-- 1 guido guido 1048576 Sep 24 15:51 b
-rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 c
$ tar -c * -f test.tar
$ ls -l test.tar
-rw-r--r-- 1 guido guido 2109440 Sep 24 15:51 test.tar
$ gzip test.tar
$ ls -l test.tar.gz
-rw-r--r-- 1 guido guido 2097921 Sep 24 15:51 test.tar.gz
$
First I created a 1MiB file of random data (a). Then I copied it to a file b and also harlinked it to c. When creating the tarball, tar was apparently aware of the hardlink, since the tarball was only ~2MiB and not ~3Mib.
Now I expected gzip to reduce the size of the tarball to ~1MiB since a and b are duplicates, and there should be 1MiB of continuous data repeated inside the tarball, yet this didn't occur.
Why is this? And how could I compress the tarball efficiently in these cases?
Solution 1:
Gzip gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. It's a lossless data compression algorithm that works by transforming the input stream into compressed symbols using a dictionary built on-the-fly and watching for duplicates. But it can't find duplicates separated by more than 32K. Expecting it to spot duplicates 1MB apart isn't realistic.
Solution 2:
Nicole Hamilton correctly notes that gzip
won't find distant duplicate data due to its small dictionary size.
bzip2
is similar, because it's limited to 900 KB of memory.
Instead, try:
LZMA/LZMA2 algorithm (xz
, 7z
)
The LZMA algorithm is in the same family as Deflate, but uses a much larger dictionary size (customizable; default is something like 384 MB). The xz
utility, which should be installed by default on most recent Linux distros, is similar to gzip
and uses LZMA.
As LZMA detects longer-range redundancy, it will be able to deduplicate your data here. However, it is slower than Gzip.
Another option is 7-zip (7z
, in the p7zip
package), which is an archiver (rather than a single-stream compressor) that uses LZMA by default (written by the author of LZMA). The 7-zip archiver runs its own deduplication at the file level (looking at files with the same extension) when archiving to its .7z
format. This means that if you're willing to replace tar
with 7z
, you get identical files deduplicated. However, 7z does not preserve nanosecond timestamps, permissions, or xattrs, so it may not suit your needs.
lrzip
lrzip
is a compressor that preprocesses the data to remove long-distance redundancy before feeding it to a conventional algorithm like Gzip/Deflate, bzip2, lzop, or LZMA. For the sample data you give here, it's not necessary; it's useful for when the input data is larger than what can fit in memory.
For this sort of data (duplicated incompressible chunks), you should use lzop
compression (very fast) with lrzip
, because there's no benefit to trying harder to compress completely random data once it's been deduplicated.
Bup and Obnam
Since you tagged the question backup, if your goal here is backing up data, consider using a deduplicating backup program like Bup or Obnam.
Solution 3:
gzip
won't find duplicates, even xz
with a huge dictionary size won't. What you can do is use mksquashfs
- this will indeed save the space of duplicates.
Some quick test results with xz
and mksquashfs
with three random binary files(64MB) of which two are the same:
Setup:
mkdir test
cd test
dd if=/dev/urandom of=test1.bin count=64k bs=1k
dd if=/dev/urandom of=test2.bin count=64k bs=1k
cp test{2,3}.bin
cd ..
Squashfs:
mksquashfs test/ test.squash
> test.squash - 129M
xz:
XZ_OPT='-v --memlimit-compress=6G --memlimit-decompress=512M --lzma2=preset=9e,dict=512M --extreme -T4 ' tar -cJvf test.tar.xz test/
> test.tar.xz - 193M