Why doesn't ZIP Compression compress anything?

If you're compressing things that are already compressed (AVI, JPEG, MP3), you won't gain much other than packing everything in a single file.


Compression works by looking for repetitive patterns inside the items to compress. Also because you do not want to lose any data while compressing your files, the compression must be lossless(*).
Now with that in the back in your head, think about the way files (items) are stored on a computer. At the lowest level, they are all just a bunch of 0's and 1's.

The question can thus be transformed to: "How can I represent a bunch of 1's and 0's in a more compact way than the original representation?"

So lets start from the beginning, how can you compact the normal representation of a single bit (a single 1 or a single 0)?
The answer is really easy: you can't!... a single bit is represented in the most compact manner possible.

Fair enough, let us take a bigger example, how would you compress a binary string like 0111 0111 0100 0111?
Well because we already know that looking at the individual bits won't help us at all, we know that we have to look at a bigger scale. For example, let's take 4 bits at a time. We now see that the binary string "0111" will occur 3 times in the example, so why don't we represent that with a single bit: 0? but this still leaves 0100 in the dark, so let us represent that with "1"
We know have compressed the original to: "0010"

That's really good! However this is just the basic of basics of the "Huffman encoding algorithm", and in the real world it will be a little more complicated than that (and you would also need to store a table with the encoding information in it, but that's a bit to far for answering this question).

Now to really answer your question: why can't all data be compressed that good?, well let's take another example: "0001 0110 1000 1111", if we would use the same technique as above we would not be able to compress the data (no repetition is found), and thus would not benefit from compression...


(*) there are of course exceptions on this. The most known example of this is the compression used for MP3 files. here some information about the sounds will get lost while converting it from the raw, original file, to the MP3 format, this compression is thus lossy. Another example is the .JPG format for images


The process of compressing takes repeatable patterns and tokenizes them to shorter patterns. The output is then mostly non-repeatable and therefore cannot be compressed by much, if at all.


From the Limitations section of the Wikipedia article on Lossless Compression:

Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any (lossless) data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm. This is easily proven with elementary mathematics using a counting argument. ...

Basically, it's theoretically impossible to compress all possible input data losslessly.