How reliable are SHA1 sum and MD5 sums on very large files?

Solution 1:

MD5 and SHA-1 are both fine to detect accidental damage/changes to files. The probability of an accidentally changed file having the same MD5 digest is one in 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456. The probability of an accidental SHA-1 collision is even smaller, one in 2^160. If we're talking about finding accidental matches among a collection of files (known as the birthday problem), you'd need about 2^64 = 18 billion billion before a MD5 collision becomes likely. Note that the size of the files does not matter; it's the number of files involved that matters.

But neither MD5 nor SHA-1 is sufficient to protect against malicious substitution of files, or to provide a reliable unique ID for files. For example, if you use either one, someone could give you one file, have you calculate the hash digest, then trick you by swapping it for another file with the same hash. Or submit two files with the same hash, which might confuse your system.

BTW, the accidental/malicious distinction is a bit loose. Suppose someone found the two PDFs that Google produced with the same SHA-1 hash, thought "That's cool! I should save these for later", and then tried to use your system to store and distribute them... thus breaking the system sort-of by accident. If something like that is conceivable, you're better off going with SHA-256 instead.

EDIT: BitErrant is similar to what I described in the last paragraph: it's an exploit agains BitTorrent, taking advantage of the fact that BitTorrent uses SHA-1 checksums as IDs for chunks of files.