Hash Code and Checksum - what's the difference?
My understanding is that a hash code and checksum are similar things - a numeric value, computed for a block of data, that is relatively unique.
i.e. The probability of two blocks of data yielding the same numeric hash/checksum value is low enough that it can be ignored for the purposes of the application.
So do we have two words for the same thing, or are there important differences between hash codes and checksums?
I would say that a checksum is necessarily a hashcode. However, not all hashcodes make good checksums.
A checksum has a special purpose --- it verifies or checks the integrity of data (some can go beyond that by allowing for error-correction). "Good" checksums are easy to compute, and can detect many types of data corruptions (e.g. one, two, three erroneous bits).
A hashcode simply describes a mathematical function that maps data to some value. When used as a means of indexing in data structures (e.g. a hash table), a low collision probability is desirable.
There is a different purpose behind each of them:
- Hash code - designed to be random across its domain (to minimize collisions in hash tables and such). Cryptographic hash codes are also designed to be computationally infeasible to reverse.
- Check sum - designed to detect the most common errors in the data and often to be fast to compute (for effective checksumming fast streams of data).
In practice, the same functions are often good for both purposes. In particular, a cryptographically strong hash code is a good checksum (it is almost impossible that a random error will break a strong hash function), if you can afford the computational cost.