Is there a compression or archiver program for Windows that also does deduplication? [closed]
I'm looking for an archiver program that can perform deduplication (dedupe) on the files being archived. Upon unpacking the archive, the software would put back any files it removed during the compression process.
So far I've found:
- http://www.exdupe.com/
- http://archiver.reasonables.com/
Anyone aware of any others?
This would probably be an awesome addition to 7-zip.
Almost all modern archivers do exactly this, the only difference is that they refer to this as a "solid" archive, as in all of the files are concatenated into a single stream before being fed to the compression algorithm. This is different from standard zip compression which compresses each file one by one and adds each compressed file to the archive.
7-zip by its very nature effectively achieves de-duplication. 7-Zip for example will search for files, will sort them by similar file types and file names and so two files of the same type and data will be placed side by side in the stream going to the compressor algorithms. The compressor will then see a lot of data it has seen very recently and those two files will see a large increase in compression efficiency compared to compressing the files one-by-one.
Linux has seen a similar behaviour for a long time through the prevalence of their ".tgz" format (or ".tar.gz" to use it's full form) as the tar is simply merging all the files into a single stream (albeit without sorting and grouping of files) and then compressing with gzip. What this misses is the sorting that 7-zip is doing, which may slightly decrease efficiency but is still a lot better than simply blobbing a lot of individually compressed files together in the way that zip does.
7-Zip, zip, gzip and all other archivers do not detect identical areas that are far away from eachother, such as just a few megabytes or above, inside the same file or placed at different positions inside different files.
So no, normal archivers do not perform as well as exdupe and others, in some siturations. You can see this if you compress some virtual machines or other stuff.
There is no point in using deduplication with a compression process. Most compression algorithms create what is called a 'dictionary' that will look for most common, or reused bits of data. from there it will just reference the dictionary entry instead of writing the whole "word" over again. In this way most compression processes already cut out redundant or duplicate data from all of the files.
For example if you take a 1 MB file and copy it 100 times with a different name each time (totaling 100 MB of disk space), then you compress it in a 7zip or zip file, you will have a 1 MB total zip file. This is because all of your data was put into one dictionary entry and referenced 100 times, which takes up very little space.
This is a very simple explanation of what happens, but the point is still conveyed well.