Time to zip very large (100G) files

You can change the speed of gzip using --fast --best or -# where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.


The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.

The problem is that gzip is constrained (as you discovered) to a single thread.

Enter pigz, which can use multiple threads to perform the compression. An example of how to use this would be:

tar -c --use-compress-program=pigz -f tar.file dir_to_zip

There is a nice succint summary of the --use-compress-program option over on a sister site.


I seem to be using a single CPU at approximately 100%.

That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).

If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.

If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel. Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.


One can exploit the number of process available as well in pigz which is usually faster performance as shown in the following command

tar cf - directory to archive | pigz -0 -p largenumber > mydir.tar.gz

Example - tar cf - patha | pigz -0 -p 32 > patha.tar.gz

This is probably faster than the methods suggested in the post as -p is the number of processes one can run. In my personal experience setting a very large value doesnt hurt performance if the directory to be archived consists of a large number of small files. Else the default value considered is 8. For large files, my recommendation would be to set this value as the total number of threads supported on the system.

Example setting a value of p = 32 in case of a 32 CPU machine helps.

0 is meant for the fastest pigz compression as it doesnt compress the archive and rather is focussed on speed. Default value is 6 for compression.