Which archive/compression format?

Zip, Rar, 7z, Gzip, BZip2, Tar etc. I'm hearing 7z is the flavor of the month, why? Is it best for all situations or are there better choices for specific situations.

Or maybe the actual file archiver ie WinZip, WinRar, 7Zip etc (as opposed the format) has a bigger effect?

In your answer could you describe what sort of speed/compression tradeoff your mentioned format uses.

Please provide links to any empirical tests that back up your answer.

Background: I need to backup a custom search index that creates about 3000 relatively small files (less then 10MB), each containing a lot of repetitive data.

(As usual Wikipedia has a relevant article but the section on performance comparison is brief.)

Thanks


Compress, Gzip, Bzip, Bzip2 are not for archiving multiple files. They only compress single file. For archiving they are usually used with TAR. The problem with TAR is that it has no index table. It's only good if you're planning to restore the whole thing. If you're expecting that you ever need to restore only limited number of selected files, forget about TAR. To get the last file from tar.gz or tar.bz2 archive, you have to decompress and process all of it. In the case of zip, rar or 7-zip, it'll go to the index table, skip to relevant position of the archive and only process relevant files.

Ok, TAR's out, so that leaves you with ZIP, RAR and 7-ZIP. Of these three, ZIP is the most proliferated, most anything supports it, many applications have built-in support. And it's fast. On the other hand 7-ZIP is also portable, the library is LGPL, and has compression rates much better then other two, comes as a cost of being more CPU consuming. RAR is real loser there, neither great compression, nor really portable, nor fast.

EDIT: seems that the best option would be 7-ZIP, but with bzip2 compression method. This way you won't have the disadvantages of TAR, but you'll can still take advantage of bzip2 multi-core support. See this article.


Recommended reading:

File Compression in the Multi-Core Era (Jeff Atwood a.k.a. CodingHorror, february 2009)

I've been playing around a bit with file compression again, as we generate some very large backup files daily on Stack Overflow.

We're using the latest 64-bit version of 7zip (4.64) on our database server. I'm not a big fan of more than dual core on the desktop, but it's a no brainer for servers. The more CPU cores the merrier! This server has two quad-core CPUs, a total of 8 cores, and I was a little disheartened to discover that neither RAR nor 7zip seemed to make much use of more than 2.

Still, even if it does only use 2 cores to compress, the 7zip algorithm is amazingly effective, and has evolved over the last few years to be respectably fast. I used to recommend RAR over Zip, but given the increased efficiency of 7zip and the fact that it's free and RAR isn't, it's the logical choice now.

And regarding algorithms:

Why is bzip2 able to work so much faster than 7zip? [...] Bzip2 uses more than 2 CPU cores to parallelize its work.


It isn't all about efficiency and speed. Sure they are important and you can look at the benchmarks for those and choose wisely from the options (though I'd recommend some simple benchmarking of your own with your own data on your own server). But archiving inevitably leads at some point to accessing your data again (otherwise why not just delete it?). Or maybe years down the road it won't be you accessing the data at all, but someone third party. Pick something that will be around when you need to access the data and something that people recognize. I personally use 7zip, but when I archive files others might need I use zip. They know it, lots of tools can handle it. It may not be quite as fast or quite as small, but it helps with the human factor.


lzma seems to perform very well in both compression ratio and speed.

In the following http://tukaani.org/lzma/benchmarks benchmarks the fastest setting for lzma gave compression times considerably faster than the fastest bzip2 option, while still giving compression better than the slowest bzip2 option:

    ratio   bzip2   lzmash
    fastest 35.8%   31.7%       
    slowest 34.0%   25.4%

    time    bzip2   lzmash  
    fastest 1m 26s  0m 58s  
    slowest 2m 37s  12m 20s

    *Compressing  full installation of OpenOffice.org 1.1.4 for Linux (203 MB) 

It performs especially well with binary data, but I think I read some benchmarks of plain text where bzip2 outperformed it.

The lzma man page is worth reading:

   lzma  provides  notably  better compression ratio than bzip2 especially
   with files having other than plain text content. The other advantage of
   lzma  is fast decompression which is many times quicker than bzip2. The
   major disadvantage is that achieving  the  highest  compression  ratios
   requires  extensive  amount of system resources, both CPU time and RAM.
   Also software to handle LZMA  compressed  files  is  not  installed  by
   default on most distributions.

Take a look at this Wikipedia entry. Towards the bottom, "Comparison of efficiency". It will give you compression percentage, and time taken, approximately. All those numbers will vary (speed wise) based on the speed of the machine being used, the amount of memory, etc.

More compression benchmarks:

  • Maximum Compression.
  • Lossless Data Compression Benchmarks.