md5sum on large files

Solution 1:

To verify contents by only sampling the first megabyte of a file will likely not detect if some of the larger files have been corrupted, damaged or altered in one way or another. The reason for that is you're only giving the hashing algorithm one megabyte of data when there might be hundreds of other megabytes that could be off. Even one bit in the wrong position would give a different signature.

If data integrity is what you want to verify, you're better off with the CRC32 algorithm. It's faster than MD5. Although it it is possible to forge/modify a file to appear to have the correct CRC32 signature, it is not likely that random bits of corruption will ever do that.

Update:

Here's a nice one-liner to do the 1 megabyte based md5 checksum on every file:

find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"

Replace md5sum with cksum if you feel like it. Notice that I chose to include the filename in the output. That's because the filename string does not get passed on when you're not giving md5sum the whole file.