Why does zipping a zipped file not reduce its size?

Based on the idea that a zipped file is a new binary file, why can't I reduce a Zip's size by zipping it again and again – up to a very small resulting file?


Based on the idea that a zipped file is a new binnary file, why I can't reduce it's size by zipping it again and successively up to a very small file?

Because compression works on the basis of finding patterns and reducing data that is similar.

For example, RLE (Run-length Encoding) is a simple compression method where data is examined and runs of similar data are compressed down as so:

AAABCEEEJFFYYYYYYYYYYOOAAAAGGGGGAAA

becomes

3ABC3EJ2F10YOO4A5G3A

As you can see, by replacing repeated data with just the data and a count of how many times it occurs, you can reduce this specific example from 35 bytes, down to 20 bytes. That’s not a huge reduction, but it’s still 42% smaller. Moreover, this is a small, contrived example; larger, real-life examples could have even better compression. (The OO was left alone because replacing it with 2O would not save anything.)

Text files often compress really well because they tend to have a lot of patterns that can be compressed. For example, the word the is very common in English, so you could drop every single instance of the word with an identifier that is just single byte (or even less). You can also compress more with parts of words that are similar like cAKE, bAKE, shAKE, undertAKE, and so on.

So why can’t you compress a file that’s already compressed? Because when you did the initial compression, you removed the patterns.

Look at the compressed RLE example. How can you compress that further? There are no runs of identical data to compress. In fact, often when you try to compress a file that’s already compressed, you could end up with a larger file. For example, if you forced the above example to be re-encoded, you might end up with something like this:

131A1B1C131E1J121F11101Y2O141A151G131A

Now, the compression data (the run-counts) are themselves being treated like data, so you end up with a larger file than you started with.

What you could try is to use a different compression algorithm because it is possible that the output of one compression algorithm could possibly be prime for a different algorithm, however that is usually pretty unlikely.

Of course, this is all about lossless compression where the decompressed data must be exactly identical to the original data. With lossy compression, you can usually remove more data, but the quality goes down. Also, lossy compression usually uses some sort of pattern-based scheme (it doesn’t only discard data), so you will still eventually reach a point where there are simply no patterns to find.


A file that has been optimally compressed will have no patterns or anything that can be reduced.

Lets imagine a simple file that contains this.

AAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCC

If we compress it we could say that it is 20 A's, newline, followed by 20 B's, newline, followed by 20 C's. Or something like 20xA\n20xB\n20xC\n. Once we have done the first compression there is no new patterns to compress. Every bit if information is unique.


If all compressed files after compressing again reduce their sizes (or have sizes no larger than their parent) then at some point the size will become 0 which can't be true. If that's true we almost don't need file storages at all.

Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any lossless data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm, and for any lossless data compression algorithm that makes at least one file smaller, there will be at least one file that it makes larger. This is easily proven with elementary mathematics using a counting argument, as follows:

  • Assume that each file is represented as a string of bits of some arbitrary length.
  • Suppose that there is a compression algorithm that transforms every file into an output file that is no longer than the original file, and that at least one file will be compressed into an output file that is shorter than the original file.
  • Let M be the least number such that there is a file F with length M bits that compresses to something shorter. Let N be the length (in bits) of the compressed version of F.
  • Because N < M, every file of length N keeps its size during compression. There are 2N such files. Together with F, this makes 2N+1 files that all compress into one of the 2N files of length N.
  • But 2N is smaller than 2N+1, so by the pigeonhole principle there must be some file of length N that is simultaneously the output of the compression function on two different inputs. That file cannot be decompressed reliably (which of the two originals should that yield?), which contradicts the assumption that the algorithm was lossless.
  • We must therefore conclude that our original hypothesis (that the compression function makes no file longer) is necessarily untrue.

https://en.wikipedia.org/wiki/Lossless_compression#Limitations


I'd say, you can't compress arbitrary binary files to a great extent -- think of JPEG images, x264 videos and so on. Especially since you want to reconstruct your original file exactly (i.e. bit-by-bit) you need an lossless compression.1

The reason for this limited compression is stated in this Wikipedia article about Entropy which quantifies the expected value of the information contained in a message:

Entropy effectively bounds the performance of the strongest lossless (or nearly lossless) compression possible, which can be realized in theory by using the typical set or in practice using Huffman, Lempel-Ziv or arithmetic coding. (...)


1 The very strong "compression" of JPEG images is only possible, since some information is discarded (in a manner that the human eye can't recognize it at first glance; lossy compression).