How do I efficiently generate and validate file checksums?

I'd like to be able to capture and validate checksums for large-scale collections of files, typically nested within a complex directory hierarchy.

Does every single file need a checksum? Are there ways to leverage the existing directory structure to, say, validate only a node in the file tree and not necessarily every file within?


Solution 1:

The most efficient way to use checksums is to make the computer do it all. Use a filesystem such as ZFS which checksums (actually it uses hashes, which are stronger than a checksum) all data when it's written, and verifies them every time the data is read. Of course, the downside is that ZFS doesn't know when deleting or overwriting a file is a mistake and when it's normal operation, but because ZFS uses copy-on-write semantics for everything, you can use it's snapshotting feature to mitigate the risk.

ZFS can also automatically restore data that fails a hash check by using any redundancy you've set up, whether raid5-style parity, drive mirrors or duplicate copies (add the copies=N property to any ZFS filesystem and it'll store N copies of any data you write). It also stores the hashes in a Merkle tree, where the hash value of a file depends on the hashes of the blocks, the hash of a directory entry depends on the hash values of the files and directories it contains, the hash of a filesystem depends on the hash of the root directory, etc.

Regardless of what solution you end up with, you'll invariably find that the process is limited by the speed of your disks, not by the speed of your CPU.

Also, don't forget to take into account the BER of your disks. They are, after all, mere plates of spinning rust. A consumer-level drive has a an error rate of 1 incorrectly-read bit for every 10^14 bits read, which works out to 1 bit out of every 11 terabytes you read. If you have an 11 terabyte data set and you compute the hash of every file in it, you will have computed one of those checksums incorrectly and permanently damaged one block of one of the files in the data set. ZFS, however, knows the hash of every block it wrote to every disk in your pool, and therefore knows which block was lost. It can then use the redundancy (parity, mirrors or extra copies) in your pool to rewrite the data in that block with the correct values. These safety features also apply when you use zfs send or receive to copy data from your primary system to the backups.

Ben brings up a good point in the comments however. ZFS doesn't expose any of the hash values that it computes to the user, so data that enters or leaves a ZFS system should be accompanied by hashes. I like the way the Internet Archive does this with an xml file that accompanies every item in the archive. See https://ia801605.us.archive.org/13/items/fakebook_the-firehouse-jazz-band-fake-book/fakebook_the-firehouse-jazz-band-fake-book_files.xml as an example.

Solution 2:

I would generate checksum for each file. Checksums are very small, and generating checksum for the whole directory would require you to process every file as well (at least if you are not speaking about directory checksum, made only from directory entries - I would make them as well, to ensure no data is deleted).

Assume you have one checksum for the whole archive. You know the data is corrupted, but you don't know if this is only one file, and, more important, which of them. Having separate checksums give you more flexibility. You can detect single file that is corrupted, and replace it from the file from other backup (which can, in turn, have other file corrupted).

In that way your data is more likely to survive.

Solution 3:

Maybe this is a good time to bring up BagIt. This is a very simple yet powerful file packaging format intended for archiving, long term preservation, and transfer of digital objects. Users include the Library of Congress and the California Digital Library.

A BagIt tool (they exist in several programming languages) puts your files into a certain directory structure and does the checksumming/hashing for you. That is all.

PS: Of course, BagIt tools can also verify bags against the included checksums/hashes, and you can add some metadata to bags. But that's as complex as bags get.