How does file compression work?

Lossless Compression

Lossless compression is where no data is lost. Everything that is entered can be retrieved perfectly. This works well for text or binary files where the smallest error will be noticed.

File compression works by taking the file and scanning for patterns, and translating those patterns into something else which takes up less space.

For example "AAAAAAAA" could be turned into "8A".

Granted that's not how it works exactly because then you have the problem what if "8A" was in the plaintext. You would uncompress the file and it would be wrong. A good place to start is either Wikipedia or the LZW Data Compression Algorithm.

There is some simply psuedo-code for this copied below:

STRING = get input character
WHILE there are still input characters DO
    CHARACTER = get input character
    IF STRING+CHARACTER is in the string table then
        STRING = STRING+character
    ELSE
        output the code for STRING
        add STRING+CHARACTER to the string table
        STRING = CHARACTER
    END of IF
END of WHILE
output the code for STRING

All compression uses a lookup dictionary which is used to compress and decompress the file. The bigger the dictionary, the more you can compress it, although you do run into the Law of Diminishing Returns.

It's also worth noting that compression doesn't always yield a smaller file. There are situations (with small files, or when compressing random data) that you will not get a smaller file after compression. There have been some fun challenges pertaining to the ability to compress random data.

"Lossy" Compression

The above mostly pertains to lossless compression. Other types of compression used in video / audio applications such as MP3, JPG, and h.264 are examples lossy compression.

Lossy compression works by discarding data that is least likely to be noticed. In audio this is sounds about 30,000 Hrz and below 100 Hrz, along with other various things. In picture (static) it removes various things and merges pixles together, along with discarding data.

Lossy compression is a form of transform coding. It averages out data to reduce the overall size. For example a block of 10 pixels in an image, all of slightly different colors may be merged together to one color and thus compressed.

In video compression often instructions will be placed to only redraw pixels that have changed since the last frame, or keyframe.


Compression works by finding patterns in the data, then replacing these patterns with a special smaller patterns. Decompression is the inverse: find the special patterns, and replace them with the larger patterns they represent. Knowing what patterns are probable is important; for example, patterns found in text may be quite different than those found in images. Some compression techniques are lossy; they do not guarantee the expansion will recover the input exactly. This is usually fine for analog data, such as music and images, if the loss is small enough. But data such as text must be compressed with lossless techniques.

It is important to realize that it is impossible to compress, without loss, random data by even a single bit. Consider a file with N bits of binary data. There are 2^N possible files. If you compress any of these files by a single bit, so the compressed file is N-1 bits in size, there are only 2^(N-1) possible compressed representations. In other words, each possible compressed file must represent more than one possible uncompressed file. Without a unique compressed representation, the decompression algorithm cannot guarantee lossless decompression.