What does ECC RAM failure look like

For Non-ECC memory I have a decent idea of what a failure looks like; certain random things start going wrong (e.g. PNG checksums fail validation once and then not the next time), that sort of thing. But I'm relatively new to ECC RAM. What do I expect when ECC RAM fails? I know if there's a single-bit flipped, it should just automatically correct that, but how would I know if there are more serious issues or if the module needs to be replaced?

I found one report that suggested that the system might spontaneously shut off or fail to power on, but it's not clear to me why that would be the case.


Solution 1:

Linux kills the programs using memory pages with bits flipped beyond recovery (thus one ECC word with 2 flips), using a SIGBUS signal. Then it blacklists that page so that it won't be reused.

When encountering corrected faults repeatedly (typically not the case with transient flips, but with hard faults that persist after correction), pages are migrated transparently to another physical page, but using the same virtual addresses. This is done through a "leaky bucket" counter, that counts ECC errors per page over the last X units of time.

These approaches are respectively called hard and soft page offlining. You can read more and access error statistics/logs through mcelog, which is part of all Linux kernels starting at version 2.6. Note that you can set it so that your kernel will panic and reboot the machine at each error, if you so wish.

This also exists under the name of memory page retirement in Solaris systems, and other OSs undoubtedly have their own version of it though I don't know the names or references of the top of my head.

In short, the hardware reports the errors and the OS mitigates their effects. So chances are you won't get a lot of symptoms, but you may ask your OS or tools for statistics.

Solution 2:

I have only really seen an ECC error happen once. This was on a Dell server and the problem shows up in several places:

  • In the vCenter server (it was a VMWare host) via IPMI
  • In the BIOS log files - typically, you can enter setup upon boot to see logged errors
  • Via the Dell remote management tools or similar for other systems

Given the amount of memory present in many servers, a single error may not hurt a running program enough to cause a crash. It may only cause subtle problems or perhaps none at all if the error occurs in RAM that has not been allocated to a particular program.

Also, since it is ECC RAM, single bit errors are correctable and won't hurt running programs. The point of having ECC RAM is more or less the two things given above:

  • The ability to detect and log errors
  • The ability to correct single bit errors

So - you look for errors in management tools and generally, you should get less errors than you would using non-ECC RAM. Replacing modules is always advised if errors start cropping up, i.e. if the error rate is increasing above a (very low) background incident rate. Hope that clears it up a bit.

EDIT: Also see this question for more on the same subject.

EDIT 2: I actually just had another one - here is a screen shot for those interested in what it looks like in VMWare vCenter (Hardware IDs scrambled): Memory error as displayed in vCenter