Why is a 7zipped file larger than the raw file? [duplicate]

Solution 1:

It comes down to a concept called entropy. See Wikipedia.

The basic idea is that, if there existed a compression operation that could always make a file smaller, then logic dictates that said compression operation would be able to reduce any file to 0 bytes and still retain all the data. But this is absurd, because we know that 0 bytes can not convey any information at all. So we have just proven that there can not exist a compression algorithm that always makes its input smaller, because if that were the case, any information could be stored in 0 bytes -- but 0 bytes implies the absence of information, so you can't simultaneously have no information and all information. Hence, it's absurd.

Due to this theoretical concept, every compression program you ever use is going to increase the size of (or at best, maintain the same size of) some input. That is, for any compression algorithm you design or use, there will be certain inputs that will come out smaller, and some that will not.

Already-compressed data is generally a terrible candidate for further compression, because most lossless compression algorithms are based on the same theoretical principles. It is possible to compress poorly-compressed data even further; but this is less efficient than simply compressing it with the best-available algorithm from the original data to begin with.

For example, if you had a 100 MB text file and compress it using the regular Zip algorithm, it might get compressed down to 50 MB. If you then compress the Zip file with LZMA2, you might get it down to 40 or 45 MB, because LZMA has a higher compression ratio for most compressible data than Zip does. So it stands to reason that it can also compress Zip data, because Zip doesn't completely suck all the entropy out of it. But if you eliminate the Zip container entirely, you may be able to get it even smaller by compressing the raw text with LZMA2, potentially yielding something on the order of 30 - 35 MB (these are just "air numbers" to illustrate the concept).

In the case of that binary you're trying to compress, it's larger because the 7-Zip file format has to create its own internal structure and pack the already-compressed executable's data into the 7-Zip format. This contains things like a dictionary, a file header, and so on. These extra data are usually more than offset by the savings of compressing the data itself, but it appears that the executable you're trying to compress is already compressed with some form of LZMA; otherwise, it would likely shrink the size of the executable or very slightly increase it, rather than increasing it by 2 MB (which is a lot).

Solution 2:

The underlying compression algorithms used in 7z are lossless. Which means you can iteratively compress-decompress a file many times. Furthermore, after each iteration the file will remain exactly the same.

Unfortunately, you cannot expect a lossless compression algorithm be applied many times with always a positive result. There is a strict boundary which it cannot jump over. Roughly, this boundary depends on how closely an input sequence ensembles random data. Above all, lossless algorithms are used for files compression, Internet HTML data transfers, backups and other operations that expect an output file to be decompressed into exactly the same original input file.

In contrast to lossless compression, you may always expect a file size decrease after compression with lossful (or lossy) compression algorithms. The down side is that you cannot exactly restore an original file after a single compress-decompress iteration. These algorithms are most famous for audio/video/image transmissions and storage.

bzip2, LZMA, LZMA2 and other algorithms used by 7z format are all lossless. Therefore there will be a limit after which it can no longer compress. On top of that, executable images (.exe) are usually highly compressed files. 7zip as many others compression tool embeds some metadata, which in fact can make the output file larger.

Brain teaser: what if we did have a lossless algorithm that can always decrease a file's size?

In this case, you shall always see that the compressed file is smaller than the input file. See a comment bellow why it's not possible.

Solution 3:

If the original executable was already compressed (or contained heavily compressed data or noncompressible data) then compressing it will increase the size.

Solution 4:

Most compression algorithms use whats called a symbol table, basicly just peices of the file it uses as elements it CAN compress. This, of course, creates some overhead in the file but usually results a much smaller file.

In already compressed files, it still creates a set of symbols, but there's very little that can be reduce the size on. In your case, the the symbol table of the already compressed file is probably in the neighborhood of 2 MB or probably more if it did manage to do some compressing.