Compressing many similar big files
I have hundreds of similar big files (30 megabyte each) which I want to compress. Every pair of files have 99% of same data (less then 1% difference), so I expect to have not more than 40-50 megabyte archive.
Single file can be compressed from 30 MB to 13-15 MB (with xz -1
, gz -1
, bzip2 -1
), but when compressing two or more files I want to have archive with size 13-15MB + N*0.3MB
where N is number of files.
When using tar
(to create solid archive) and xz -6
(to define compression dictionary to be bigger than one file - Update - this was not enough!), I still have archive with size N*13MB
.
I think that both gzip
and bzip2
will not help me because they have dictionary less than 1 MB, and my tar stream has repetitions every 30 MB.
How can I archive the my problem in modern Linux using standard tools?
Is it possible to tune xz
to compress fast, but use dictionary bigger than 30-60 MB?
Update: Did the trick with tar c input_directory | xz --lzma2=dict=128M,mode=fast,mf=hc4 --memory=2G > compressed.tar.xz
. Not sure about necessary of mf=hc4
and --memory=2G
options; but dict=128M
set the dictionary to be big enough (bigger than one file), and mode=fast
make the process bit faster than -e
.
Solution 1:
Given your details, I assume that you have verified that your files really have 99% of data in common, with a contiguous (or almost contiguous) 1% of difference in them.
First, you should use tar to make one archive with your files inside it. For tests, I would create a .tar with 10 files, so having a 300MB size.
Then, using xz, you have to set it so that the dictionary is bigger than the size of one file. Since you don't say if you have memory restrictions, I'd go with xz -9. There's no point in not using all available memory.
I'd also use the --extreme preset, to test if it makes difference.
Dictionary size
In one documentation that I have available - site - it's said that the dictionary size is roughly equal to the decompressor memory usage. And the -1 parameter means a dict of 1MiB, -6 means 10 MiB (or 8 MiB in another part of the same manual). That's why you're not getting any advantage by tarring those files together. Using the -9 would make the decompessor (and, so, dictionary) be 64 MiB, and I think that is what you wanted.
Edit
Another possibility would be using another compressor. I'd go with 7zip, but would tar those files first and then 7zip them.
Depending on your files content, perhaps you could use 7zip with PPM-D method (instead of LZMA or LZMA2, that is the default and the same used by xz)
Not good: Zip (dict = 32kB), Bzip (dict = 900 kB).
Solution 2:
If they are truly 99% similar as you say, you should be able to use bsdiff or a similar algorithm to calculate differences between the files. Is the difference cumulative (i.e., each file differs a little more from the first), or is the difference between any two files pretty much the same?
If it's not cumulative, you should be able to:
- Take any arbitrary file as the "baseline"
- Run
bsdiff
comparing the baseline file to each additional file - Store each diff as a separate file, alongside the baseline file
- Run a compressor like
xz
across the results (the baseline + the diffs).
The result should be much smaller than just xz
ing the entire archive.
You can then "reconstitute" the original files by "applying" the diff on top of the baseline to get each of the other files out.
Solution 3:
You (I) may use tar with some archiver capable of long-range pattern detection, for example, rzip or lrzip (Readme). Both uses long-range redundency detection/deduplication, then rzip uses bzip2 and lrzip uses xz(lzma)/ZPAQ:
rzip is a compression program, similar in functionality to gzip or bzip2, but able to take advantage long distance redundencies in files, which can sometimes allow rzip to produce much better compression ratios than other programs. ... The principal advantage of rzip is that it has an effective history buffer of 900 Mbyte. This means it can find matching pieces of the input file over huge distances compared to other commonly used compression programs. The gzip program by comparison uses a history buffer of 32 kbyte and bzip2 uses a history buffer of 900 kbyte
lrzip have larger buffer and may use many compression algorithms (very fast, fast, good, and one of the best - ZPAQ) after deduplication:
Lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size.
The data is then either: 1. Compressed by lzma (default) which gives excellent compression at approximately twice the speed of bzip2 compression ...
Other way is using bup - backup program with block-/segment-level deduplication, based on git packfile:
It uses a rolling checksum algorithm (similar to rsync) to split large files into chunks.