How accurate is `md5sum`?

MD5 is broken for this purpose against an intelligent adversary. It is possible to maliciously construct two different blocks of data that produce the same MD5 hash.

However, it is entirely suitable (though there are almost certainly better ways) to use MD5 to protect against inadvertent data corruption in transit or in storage. While it is conceivable that such an event could cause the MD5 hash to be the same, the probability is so low that it's almost unimaginable that it would be a probability worth worrying about. Failures caused by background radiation, tunneling, static, and dozens of other sources would be orders of magnitude more probable.

Even if you had a quadrillion units of data, the probability that a mismatched MD5 would produce an MD5 hash belonging to one of those quadrillion units is much less than one in a quadrillion.


MD5 is a hash. It basically maps the entire content of a file into a small string which is 16 bytes long IIRC.

There will obviously be multiple files which hash to the same MD5 sum. Therefore, a matching MD5 sum is no guarantee of an exact match between files.

There is no threshold as such because the of the way hashes work. So an MD5 sum can detect even a single bit change. However, lots of single bit changes together may cause the MD5 hash to be the same. It is therefore quite reasonable to use MD5 to validate file integrity against random corruption but no if malicious intent is possible as someone could modify a file while making sure the MD5 hash is the same.


An MD5-Hash consists of 128bits. A single flipped bit in the source flips (on average) 64 bits in the hash.

Probability of two hashes accidentally colliding is 1/2^128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all hashes then thanks to birthday paradox probability is a bit higher. To have 50% chance of any hash colliding you need 2^64 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Source: porneL, https://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions