What is the difference between different "compression" systems?

tar stands for tape archive. All it does is pack files, and their metadata ( permissions, ownership, etc ) into a stream of bytes that can be stored on a tape drive ( or a file ) and restored later. Compression is an entirely separate matter that you used to have to pipe the output through an external utility to compress if wanted that. GNU tar was nice enough to add switches to tell it to automatically filter the output through the appropriate utility as a shortcut.

Zip and 7z combine the archiving and compression together into their own container format, and they are meant to pack files on a DOS/Windows system, so they do not store unix permissions and ownership. Thus if you want to store permissions for proper backups, you need to stick with tar. If you plan on exchanging files with Windows users, then zip or 7z is good. The actual compression algorithms zip and 7zip use can be used with tar, by uzing gzip and lzma respectively.

lzma ( aka. *.xz ) has one of the best compression ratios, and is quite fast at decompression, making it a top choice these days. It does however, require a ton of ram and cpu time to compress. The venerable gzip is quite a bit faster at compression, so may be used if you don't want to dedicate that much cpu time. It also has an even faster variant called lzop. bzip2 is still fairly popular as it largely replaced gzip for a time before 7zip/lzma came about, since it got better compression ratios, but is falling out of favor these days since 7z/lzma is faster at decompression and gets better compression ratios. The compress utility, which normally names files *.Z, is ancient and long forgotten.

One of the other important differences between zip and tar is that zip compresses the data in small chunks, whereas when you compress a tar file, you compress the whole thing at once. The latter gives better compression ratios, but in order to extract a single file at the end of the archive, you must decompress the whole thing to get to it. Thus the zip format is better at extracting a single file or two from a large archive. 7z and dar allow you to choose to compress the whole thing ( called "solid" mode ) or small chunks for easy piecemeal extraction.

The details of the algorithms are off topic here¹ since they are not in any way specific to Linux, let alone Ubuntu. You will, however, find some nice info here.

Now on to tar, as you said, tar is not and never has been a compression program. Instead, it is an archiver; its primary purpose is to make one big file out of a lot of small ones. Historically this was to facilitate storing on tape drives, hence the name: Tape ARchive.

Today, the primary reason to use tar is to decrease the number of files on your system. Each file on a Unix file system takes up an inode, the more files you have, the fewer inodes available and when you run out of inodes, you can no longer create new files. To put it simply, the same amount of data stored as thousands of files will take up more of your hard drive than those same files in a single tar archive.

To illustrate, since this has been contested in the comments, on my 68G / partition, I have the following number of total and used inodes (bear in mind that inode count depends on the file system type and the size of the partition):

Inode count:              393216
Free inodes:              171421

If I now proceed to attempt to create more files than I have inodes:

$ touch {1..171422}
touch: cannot touch ‘171388’: No space left on device
touch: cannot touch ‘171389’: No space left on device
touch: cannot touch ‘171390’: No space left on device
touch: cannot touch ‘171391’: No space left on device
touch: cannot touch ‘171392’: No space left on device
touch: cannot touch ‘171393’: No space left on device
touch: cannot touch ‘171394’: No space left on device
touch: cannot touch ‘171395’: No space left on device
touch: cannot touch ‘171396’: No space left on device
touch: cannot touch ‘171397’: No space left on device

No space? But I have loads of space:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       5,8G  4,3G  1,2G  79% /

As you can see above, creating a few hundred thousand empty files rapidly depletes my inodes and I can no longer create new ones. If I were to tar these I would be able to start creating files again.

Having fewer files also greatly speeds up the file system I/O especially on NFS mounted filesystems. I always tar my old work directories when a project is finished since the fewer files I have, the faster programs like find will work.

There is a great answer on Super User that goes into far more detail, but in addition to the above, the other basic reasons why tar is still popular today are:

Efficiency: using tar to pipe through a compression program like gzip is more efficient since it avoids the creation of intermediate files.
tar comes with all sorts of bells and whistles, features that have been designed over its long history that make it particularly useful for *nix backups (think permissions, file ownership, the ability to pipe data straight to STDOUT and over an SSH link...)
Inertia. We're used to tar. It's safe to assume it will be available on any *nix you might happen to use which makes it very portable and handy for source code tarballs.

¹ This is absolutely true and has nothing to do with the fact that I don't know enough about them to explain :)

There are two distinct but related tasks. Packing a tree of files (including filenames, directory structure, filesystem permissions, ownership and any other metadata) into a byte stream is called archiving. Removing redundancy in a byte stream to produce a smaller byte stream is called compression.

On Unix, the two operations are separated, with distinct tools for each. On most other platforms (current and historical) combined tools perform both archiving and compression.

(gzip and other programs that mimic gzip's interface often have the option to store the original filename in the compressed output, but this, along with a CRC or other check to detect corruption, is the only metadata they can store.)

There are advantages to separating compression from archiving. Archiving is platform-specific (the filesystem metadata needing preserving varies widely), but the implementation is straightforward, largely I/O-bound, and changes little over time. Compression is platform-independent, but implementations are CPU-bound and algorithms are constantly improving to take advantage of the increased resources that modern hardware can bring to bear on the problem.

The most popular Unix archiver is tar, although there exist others such as cpio and ar. (Debian packages are ar archives, while cpio is often used for inital ramdisks.) tar is or has often been combined with compression tools such as compress (.Z), gzip (.gz), bzip2 (.bz2) and xz (.xz), from oldest to youngest, and not coincidentally from worst to best compression.

Making a tar archive and compressing it are distinct steps: the compressor knows nothing about the tar file format. This means that extracting a single file from a compressed tar archive requires decompressing all of the preceding files. This is often called a "solid" archive.

Equally, since tar is a "streaming" format--required for it to be useful in a pipeline--there is no global index in a tar archive, and listing the contents of a tar archive is just as expensive as extracting it.

By contrast, Zip and RAR and 7-zip (the most popular archivers on modern Windows platforms) usually compress each file separately, and compress metadata lightly if at all. This allows for cheap listing of the files in an archive and extraction of individual files, but means that redundancy between multiple files in the same archive cannot be exploited to increase compression. While in general compressing an already-compressed file does not reduce file size further, occasionally you might see a zip file within a zip file: the first zipping turned lots of small files into one big file (probably with compression disabled), which the second zipping then compressed as a single entity.

There is cross-pollination between the differing platforms and philosophies: gzip is essentially zip's compressor without its archiver, and xz is essentially 7-zip's compressor without its archiver.

There are other, specialized compressors. PPM variants and their successor ZPAQ are optimized for maximum compression without regard to resource consumption. They can easily chew up as much CPU and RAM as you can throw at them, and decompression is just as taxing as compression (for contrast, most widely-used compression tools are asymmetric: decompressing is cheaper than compressing).

On the other end of the spectrum, lzo, snappy and LZ4 are "light" compressors designed for maximum speed and minimum resource consumption, at the cost of compression. They're widely used within filesystems and other object stores, but less so as standalone tools.

So which should you pick?

Archiving:

Since you're on Ubuntu there's no real reason to use anything other than tar for archiving, unless you're trying to make files that are easily readable elsewhere.

zip is hard to beat for ubiquity, but it's not Unix-centric and will not keep your filesystem permissions and ownership information, and its baked-in compression is antiquated. 7-zip and RAR (and ZPAQ) have more modern compression but are equally unsuited to archiving Unix filesystems (although there's nothing stopping you using them just as compressors); RAR is also proprietary.

Compression:

For maximum compression you can have a look at a benchmark, such as the enormous one at http://mattmahoney.net/dc/text.html. This should give you a better idea of the tradeoffs involved.

You probably don't want maximum compression, though. It's way too expensive.

xz is the most popular general-purpose compression tool on modern Unix systems. I believe 7-zip can read xz files too, as they are closely related.

Finally: if you're archiving data for anything other than short-term storage you should pick something open-source and preferably widespread, to minimize headaches later on.

lzo, gz, b2, lzma (.lzma2 =.xz) are "stream" compressors: they compress a stream of byes an don't know and don't care about files, directories and metadata like permissions. You have to use an archiver like tar to bundle all that data into a stream of bytes (a tar file) and compress that with a compressor. If it is the data of a single file you care about, you could also feed that file alone to one of these compressors.

Tar, cpio and pax are archivers: they take a bunch of files and directories and encode the data and metadata in a single file. tar is the most popular and most compatible though the technical merits between the three are minimal enough that there were religious wars about it during the dawn of time.

7z and zip are compressors AND arcihvers: Then store all the data and meta data and compress it. However AFAICT, neither of them save unix permissions.

Zip uses the same algorithm as gzip called DEFLATE. 7z uses the lzma algorithm

to read a single file from a tar.gz or the like, you will need to decompress the whole gz stream till the enough of the tar file is exposed so you can extract it. Zip allows you to compress and pull out each file individually. 7z can have either behavior.

Compression ratios and speeds: gzip and lzo have very very fast compression and decompression speeds but low compression ratios. It also does not take much memory to compress. gzip is a little slower and gives a little better compression ratio than lzo.

It is so fast, it can be faster to read a gz or lzo compressed file from the disk and decompress it on the fly instead of reading the uncompressed file directly from the disk.

LZMA (xz) gives excellent compression on general data but takes very long to compress and decompress along with taking significant amounts of memory to compress.

bz2 used to be the high compression algorithm of choice but fell out of favour as it is both slower than lzma and takes longer to compress and decompress. However for certain kinds of data (dna sequences, files with very large runs of the same byte etc) bzip2 can beat everything else hands down. As an example, I once had to compress a 4GB file of 1's and b2 reduced i to a few 10's of kb while lzma took some 10's of MBs if I remember correctly.

What is the difference between different "compression" systems?

Related

Recent Posts