Why does a zip file appear larger than the source file especially when it is text?

Solution 1:

As @kinokijuf said, there is a file header. But to expand upon that there are a few other things to understand about file compression.

The zip header contains all the necessary info for identifying the file type (the magic number), zip version and finally a listing of all the files included in the archive.

Your file probably wasn't compressed anyways. If you run unzip -l example.zip you will probably see that the file size is unchanged. 19 bytes would probably generate more overhead than would be saved if it were compressible at all by DEFLATE (the main compression method used by zip).

In other cases, PNG images for example, they are already compressed so zip will just store them. DEFLATE won't bother compressing anything already compressed.

If on the other hand you had a lot of text files, and their size was more than a few kilobytes each, you would get great savings by putting them all into a single zip archive.

You will get your best savings when compressing very regular, formatted data, like a text file containing a SQL dump. For example, I once had a dump of a small SQL database at around 13MB. I ran zip -9 dump.sql dump.zip on it and ended up with around a 1MB afterwards.

Another factor is your compression level. Many archivers by default will only compress at mid-level, going for speed over reduction. When compressing with zip, try the -9 flag for maximum compression (I think the 3.x manual says that compression levels are only supported by DEFLATE at this time).

TL;DR

The overhead for the archive exceeded any gains you may have gotten for compressing the file. Try putting larger text files in there and see what you get. Use the -v flag when zipping to see your savings as you go.

Solution 2:

Compression removes redundant information, which appears when the data is highly structured.

From this it should be apparent that already-compressed files cannot compress further, because the redundancy is already gone, but also that random data won't compress well, because it never had any structure or redundancy.

There a whole science, information theory, which deals with measuring the density of information (and mutual information) and uses redundancy and structure to perform compression, attacks on encryption, and error detection and recovery.

Why does a zip file appear larger than the source file especially when it is text?

Solution 1:

TL;DR

Solution 2:

Related

Recent Posts