How does one check huge files identity if hashing is CPU bound?
Solution 1:
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' \
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input -
parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
Solution 2:
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
Solution 3:
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.