MD5 and SHA1 checksum uses for downloading

I notice that when downloading a lot of open source tools (Eclipse, etc.) there are links for MD5 and SHA1 checksums, and didn't know what these were or what their purpose was.

I know these are hashing algorithms, and I do understand hashing, so my only guess is that these are used for hashing some component of the download targets, and to compare them with "official" hash strings stored server-side. Perhaps that way it can be determined whether or not the targets have been modified from their correct version (for security and other purposes).

Am I close or completely wrong, and if wrong, what are they?!?!

Thanks!


You're almost completely right. The only correction is that they are hashes of the whole file.

Sometimes, files can be corrupted during download whatever way is used to transfer them. Hashes are there to make sure that the file is intact. This is especially useful to users with bad Internet connections. Back when I was using fax modem, I'd often get problems with corrupt downloads.

Some download managers (like GetRight, if I remember correctly), can even automatically calculate the hash of the file and compare it to known value.

Another interesting point is security. A potential problem with open source tools is how much you can trust the distributer. Often programs such as Eclipse are the main tool used by software companies and therefore it is extremely important for them to move from the developer to the user intact. Since programs are open source, it is possible to for example make infected version which would look normal, but leak source code to some remote server or infect programs made by the software with a virus (I think this actually happened to some version of Delphi) or something similar. For that reason, it is important to have official correct hash which can be used to check if the distributed file is what is claims to be.

Some thoughts about distribution channels. Often free software can be found on large amount of sites and most popular sites like SourceForge, for example have large number of mirrors. Let's say there's a server in Barland which mirrors a large software distribution site. FooSoft uses the program distributed by site and they are in Republic of Baz which is right next to Barland. If someone wanted to infiltrate FooSoft, he could modify just the copy at Barland mirror and hope that geolocation software would then make sure that FooSoft gets the modified versions. Since versions from other mirrors are fine, chances are lower that malware would be detected. You could also make malware detect computer's IP address and activate only if it's from a certain range, and that way lower chances of discovery and so on.


MD5 and SHA1 are not just checksums. They are cryptographic checksums. This means that in theory, two different files might have the same checksum, but the probability of this is very small, almost 0. As consequence, you use the reverse: different checksums mean you get different content with a probability of almost 1. So cryptographic checksums are used to detect changes in files. These can be malicious changes done on purpose or just errors that happened during downloads.