Compress Similar Files Efficiently

Solution 1:

The Lempel-Ziv-Welch (LZW) compression algorithm is inherently computationally intensive, with the majority of the work itself being actually computing the dictionary. This is literally just how LZW works.

The algorithm itself adds one new dictionary entry for every next "symbol" it scans, and thus during every single iteration, a new entry is added to the dictionary. In effect, the dictionary becomes the compressed copy of the file, and thus is actually the only thing the LZW compression spends any significant time computing in the first place.


If you used something like Huffman encoding, dictionary re-use would indeed be possible (at the expense of a possibly sub-optimal compression rate/size). However, most modern compression algorithms & tools use the LZW algorithm for efficiency and speed (Huffman compression would require two passes over the data [one to generate the Huffman tree/table, another to actually compress the data], whereas LZW can be completed in a single pass).

Solution 2:

Unlike the DEFLATE algorithm, 7-Zip's LZMA uses solid compression by default, which takes advantage of inter-file redundancy. This will work with default settings as long as the files are small enough.

With the default settings of 2 GB for Solid Block size, a 16 GB file is actually compressed as 8 separate chunks.

As @Breakthorugh already said, the dictionary gets generated on the fly. You can verify this empirically by setting Solid Block size to Solid (compress all files at once) and Non-solid (compress each file separately).

Increasing the Solid Block size will actually result in a slow-down, but it may result in a much better compression ratio. For example, compressing two identical files will result in an archive almost twice as big with non-solid compression.