What is the maximum compression ratio of gzip?

Solution 1:

Update 2020-02-06: As mentioned in the comments, I have been unable to reproduce the original result with gzip. Working on the assumption that I accidentally used a different compression format in that original quick test I've repeated with gzip and updated the figures below accordingly. This new result agrees with the theoretical maximum compression stated in other answers/comments.


It very much depends on the data being compressed. A quick test with a 1Gb file full of zeros using a standard version of gzip (with default options or specifying -9) gives a compressed size of ~1018Kb, so your 10Kb file could potentially expand into ~10Mbytes.

If the data has low redundancy to start with, for instance, the archive contains images files in a format that is compressed natively (gif, jpg, png, ...), then gzip may add not further compression at all. For binary files like program executables you might see up to 2:1 compression, for plain text, HTML or other markups 3:1 or 4:1 or more is not unlikely. You might see 10:1 in some cases but the ~1030:1 seen with a file filled with a single symbol is something you are not going to see outside similarly artificial circumstances.

You can check how much data would result from unpacking a gzip file, without actually writing its uncompressed content to disk, with gunzip -c file.gz | wc --bytes - this will uncompress the file but not store the results, instead passing them to wc which will count the number of bytes as they pass then discard them. If compressed content is a tar file containing many many small files you might find that noticeably more disk space is required to unpack the full archive, but in most circumstances, the count returned from piping gunzip output through wc is going to be as accurate as you need.

Solution 2:

Quoted verbatim from https://stackoverflow.com/a/16794960/293815

The maximum compression ratio of the deflate format is 1032:1. This is because the longest run that can be encoded is 258 bytes. At least two bits are required for each such run (one bit for the length code and one bit for the distance code), hence 4*258 = 1032 uncompressed bytes can be encoded per one compressed byte.

You can get more compression by gzipping the result of gzip. Normally that doesn't improve compression, but for very long runs it can.

By the way, the LZ77 approach used by deflate is more general than run-length encoding. Instead of just a length, a length/distance pair is used. This allows copying a string from some distance back, or replicating a byte as in run-length for a distance of one, or replicating triples of bytes with a distance of three, etc.

Solution 3:

Usually you don't get more than 95% compression (so that 10kB gzipped data would decompress to ~200kB), but there are specially crafted files that expand exponentially. Look for 42.zip, it decompresses to few petabytes of (meaningless) data.