Why do people use tarballs?

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of the small tools that do specific jobs very well and can be chained together. That's why there are two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of a compression tool (bzip, gzip, etc).

An important distinction is in the nature of the two kinds of archives.

TAR files are little more than a concatenation of the file contents with some headers, while gzip and bzip2 are stream compressors that, in tarballs, are applied to the whole concatenation.

ZIP files are a concatenation of individually compressed files, with some headers. Actually, the DEFLATE algorithm is used by both zip and gzip, and with appropriate binary adjusting, you could take the payload of a gzip stream and put it in a zip file with appropriate header and dictionary entries.

This means that the two different archive types have different trade-offs. For large collections of small files, TAR followed by a stream compressor will normally result in higher compression ratio than ZIP because the stream compressor will have more data to build its dictionary frequencies from, and thus be able to squeeze out more redundant information. On the other hand, a (file-length-preserving) error in a ZIP file will only corrupt those files whose compressed data was affected. Normally, stream compressors cannot meaningfully recover from errors mid-stream. Thus, ZIP files are more resilient to corruption, as part of the archive will still be accessible.

It's odd that no-one else has mentioned that modern versions of GNU tar allow you to compress as you are bundling:

tar -czf output.tar.gz directory1 ...

tar -cjf output.tar.bz2 directory2 ...

You can also use the compressor of your choosing provided it supports the '-c' (to stdout, or from stdin) and '-d' (decompress) options:

tar -cf output.tar.xxx --use-compress-program=xxx directory1 ...

This would allow you to specify any alternative compressor.

[Added: If you are extracting from gzip or bzip2 compressed files, GNU tar auto-detects these and runs the appropriate program. That is, you can use:

tar -xf output.tar.gz
tar -xf output.tgz        # A synonym for the .tar.gz extension
tar -xf output.tar.bz2

and these will be handled properly. If you use a non-standard compressor, then you need to specify that when you do the extraction.]

The reason for the separation is, as in the selected answer, the separation of duties. Amongst other things, it means that people could use the 'cpio' program for packaging the files (instead of tar) and then use the compressor of choice (once upon a time, the preferred compressor was pack, later it was compress (which was much more effective than pack), and then gzip which ran rings around both its predecessors, and is entirely competitive with zip (which has been ported to Unix, but is not native there), and now bzip2 which, in my experience, usually has a 10-20% advantage over gzip.

[Added: someone noted in their answer that cpio has funny conventions. That's true, but until GNU tar got the relevant options ('-T -'), cpio was the better command when you did not want to archive everything that was underneath a given directory -- you could actually choose exactly which files were archived. The downside of cpio was that you not only could choose the files -- you had to choose them. There's still one place where cpio scores; it can do an in-situ copy from one directory hierarchy to another without any intermediate storage:

cd /old/location; find . -depth -print | cpio -pvdumB /new/place

Incidentally, the '-depth' option on find is important in this context - it copies the contents of directories before setting the permissions on the directories themselves. When I checked the command before entering the addition to this answer, I copied some read-only directories (555 permission); when I went to delete the copy, I had to relax the permissions on the directories before 'rm -fr /new/place' could finish. Without the -depth option, the cpio command would have failed. I only re-remembered this when I went to do the cleanup - the formula quoted is that automatic to me (mainly by virtue of many repetitions over many years). ]

Why do people use tarballs?

Related

Recent Posts