What is the advantage of using 'tar' today?

I know that tar was made for tape archives back in the day, but today we have archive file formats that both aggregate files and perform compression within the same logical file format.

Questions:

  • Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

  • Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

  • Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?


Solution 1:

Part 1: Performance

Here is a comparison of two separate workflows and what they do.

You have a file on disk blah.tar.gz which is, say, 1 GB of gzip-compressed data which, when uncompressed, occupies 2 GB (so a compression ratio of 50%).

The way that you would create this, if you were to do archiving and compression separately, would be:

tar cf blah.tar files ...

This would result in blah.tar which is a mere aggregation of the files ... in uncompressed form.

Then you would do

gzip blah.tar

This would read the contents of blah.tar from disk, compress them through the gzip compression algorithm, write the contents to blah.tar.gz, then unlink (delete) the file blah.tar.

Now, let's decompress!

Way 1

You have blah.tar.gz, one way or another.

You decide to run:

gunzip blah.tar.gz

This will

  • READ the 1 GB compressed data contents of blah.tar.gz.
  • PROCESS the compressed data through the gzip decompressor in memory.
  • As the memory buffer fills up with "a block" worth of data, WRITE the uncompressed data into the file blah.tar on disk and repeat until all the compressed data is read.
  • Unlink (delete) the file blah.tar.gz.

Now, you have blah.tar on disk, which is uncompressed but contains one or more files within it, with very low data structure overhead. The file size is probably a couple of bytes larger than the sum of all the file data would be.

You run:

tar xvf blah.tar

This will

  • READ the 2 GB of uncompressed data contents of blah.tar and the tar file format's data structures, including information about file permissions, file names, directories, etc.
  • WRITE the 2 GB of data plus the metadata to disk. This involves: translating the data structure / metadata information into creating new files and directories on disk as appropriate, or rewriting existing files and directories with new data contents.

The total data we READ from disk in this process was 1 GB (for gunzip) + 2 GB (for tar) = 3 GB.

The total data we WROTE to disk in this process was 2 GB (for gunzip) + 2 GB (for tar) + a few bytes for metadata = about 4 GB.

Way 2

You have blah.tar.gz, one way or another.

You decide to run:

tar xvzf blah.tar.gz

This will

  • READ the 1 GB compressed data contents of blah.tar.gz, a block at a time, into memory.
  • PROCESS the compressed data through the gzip decompressor in memory.
  • As the memory buffer fills up, it will pipe that data, in memory, through to the tar file format parser, which will read the information about metadata, etc. and the uncompressed file data.
  • As the memory buffer fills up in the tar file parser, it will WRITE the uncompressed data to disk, by creating files and directories and filling them up with the uncompressed contents.

The total data we READ from disk in this process was 1 GB of compressed data, period.

The total data we WROTE to disk in this process was 2 GB of uncompressed data + a few bytes for metadata = about 2 GB.

If you notice, the amount of disk I/O in Way 2 is identical to the disk I/O performed by, say, the Zip or 7-Zip programs, adjusting for any differences in compression ratio.

And if compression ratio is your concern, use the Xz compressor to encapsulate tar, and you have LZMA2'ed TAR archive, which is just as efficient as the most advanced algorithm available to 7-Zip :-)

Part 2: Features

tar stores Unix permissions within its file metadata, and is very well known and tested for successfully packing up a directory with all kinds of different permissions, symbolic links, etc. There are more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it (although compression is useful and often used).

Part 3: Compatibility

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems.

Solution 2:

This has been answered on Stack Overflow.

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems is important.