Is it better to compress all data or compressed directories?

I'm archiving some projects, let's say each of them has own directory:

projects
 |- project-1
 |- project-2
 |- project-3

I started compressing them as following:

==== SITUATION 1 ====

projects
 |- project-1.zip
 |- project-2.zip
 |- project-3.zip

and then I started wondering if wouldn't it be better to compress all data in one zip file:

==== SITUATION 2 ====

projects.zip
 |- project-1
 |- project-2
 |- project-3

or maybe compress already compressed files?:

==== SITUATION 3 ====

projects.zip
 |- project-1.zip
 |- project-2.zip
 |- project-3.zip

Which situation is the best (occupies the least space)? Why? Does it depend on compression algorithm? I know that compressing one compressed file cannot help much, but let's say 20 of them? For me situation 1 doesn't look like a good idea.


Solution 1:

I doubt that the different schemes would make a lot of difference to be honest since the compression algorithms typically only look forward a limited amount in order to control memory use.

The exception is S3 which would end up larger most likely since compressing a compressed file adds overheads but cannot compress.

If you want better compression, look for newer archiving tools that have better algorithms. 7-zip for example is generally better than zip.

In terms of the difference between s1 and s2, I would say that it depends on how you are most likely to use the archive in the future and how big they end up.

Really big archives are a pain to handle (moving, opening, etc) and this is likely to be more important than saving a few kB.

Additionally, when thinking of long-term storage, don't ignore "bit-rot". A small error in a large archive can be devastating. Loosing one project is probably much better than loosing them all.

You might however look at something like RAR which allows redundancy and split archives. This is a bit like RAID5. You create multiple archive files each of which has built in redundancy so that you can loose a file and still recreate the original data.

Solution 2:

First of all, keep the excellent arguments of @Julian Knight in mind. Even the best compression is useless if your archive is either too big to handle or gets corrupted by some flipped bits.

If space is your main concern, it might be worthwhile to do some experiments with your particular data and different compression algorithms.

Also, your 3rd approach can indeed lead to another reduction in size. I remember some discussion (see here) about compressing files multiple times using different algorithms. The author was compressing highly redundant text files and could go from 100GB to a few MB after experimenting enough. Note that his case was a bit special, but the general idea is that iterated compression can actually be worthwhile in some cases.

If you are willing to try different compression algorithms, here are some benchmarks that compare speed and compression ratio:

  • http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
  • http://binfalse.de/2011/04/04/comparison-of-compression