Why is it good practice to compare checksums when downloading a file?

Solution 1:

As has been noted by others, there are many possibilities for data corruption where any checksum at the transport layer cannot help, such as corruption happening already before the checksum is calculated at the sending side, a MITM intercepting and modifying the stream (data as well as checksums), corruption happening after validating the checksum at the receiving end, etc.

If we disregard all these other possibilities and focus on the specifics of the TCP checksum itself and what it actually does in terms of validating data integrity, it turns out that the properties of this checksum are not at all comprehensive in terms of detecting errors. The way this checksum algorithm was chosen rather reflects the requirement for speed in combination with the time period (late 1970's).

This is how the TCP checksum is calculated:

Checksum: 16 bits

The checksum field is the 16 bit one's complement of the one's complement sum of all 16 bit words in the header and text. If a segment contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16 bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.

This means that any corruption that balances out when summing the data this way will go undetected. There are a number of categories of corruption to the data that this will allow but just as a trivial example: changing the order of the 16 bit words will always go undetected.


In practice, it catches many typical errors but does not at all *guarantee* integrity. It's also helped by how the L2 layer also does integrity checks (eg CRC32 of Ethernet frames), albeit only for the transmission on the local link, and many cases of corrupted data never even get passed to the TCP stack.

Validating the data using a strong hash, or preferably a cryptographic signature, is on a whole different level in terms of ensuring data integrity. The two can barely even be compared.

Solution 2:

There is probably a zillion reasons why one should check the md5sum but a few do come to my mind:

  • Malicious activity - your ISO could have been tampered with on the way from the server
  • The page itself is spoofed (its best to have the md5sums signed as well :) )
  • Broken download (despite TCP error correction) (check this out)
  • ISO burnt incorrectly

And it only takes a few seconds anyway.

Solution 3:

TCP/IP does guarantee data integrity*. But it does not guarantee that 100% of a file has been downloaded. There can be many reasons why this could happen. For example: It is possible that you can mount an ISO that misses one or two bytes somewhere in the middle. You won't have a problem with it until you need one or two particular files that are corrupt. Comparing checksums ensure that you really did download the whole file.

* see comment