md5sum of large files gives different results sometimes

I have an AMD quad core, 8 gb RAM, 1 SSD EXT2 (2 months old), 2 HDD EXT4, approximately 1 year old. I'm using Ubuntu 10.04 x86-64 and when I compute the md5sum of large files (9 GB) sometimes I get different values than the one stored on a reference file.

Upon restarting and switching off the PC then I get the expected results no matter how many times I repeat it. But this is random.

I've turn on ECC (the fastest possible settings) and the issue seems to be rarer, but I've run memtest86+ for 6+ hours without a glitch (and with ECC off!).

Any idea? Should I update the BIOS of my motherboard (Asus EVO-something...don't remember it now)? I've tried all the rest apart this, but genuinely don't know what to do anymore...

Any suggestion is appreciated!


Solution 1:

Is your RAM all the same? I had this happen after I bought more ram and got some that was faster than what was already in the box. According to the specs for the mobo it should have worked with mixed speeds, basically clocking to the lowest common denominator. Each set would work fine by themselves if I took out the other, but together something would happen and while the box would work for the most part, there were clearly problems. I did the checksums just like you described and had the same mismatches. Even ran memtest overnight and had the same result. I eventually wound up just taking the loss of the ram and scrapped the smaller of the two sets.

Solution 2:

If turn off and restart helps and ECC makes it rarer, I guess it's an overheating problem. See Enabling hardware sensors in Linux on how to use embedded MB sensors (typically, it's CPU and MB). HDDs usually have temperature among their SMART attributes.

DIMMs don't have sensors so you have to either touch them, make guesses or use an additional piece of hardware with sensors on wires that can be placed anywhere - like this front panel.

Solution 3:

Sometimes draining the capacitors can help. Unplug your machine and hold the power button for a few seconds. It sounds like witchcraft, but it works. (Sometimes.)

Also make sure your PSU is behaving properly; bad power supplies can cause bit errors.

Finally, start removing PCI/AGP/etc. devices and see if one of them is messing things up.