How seriously should I take ECC correctable error warnings?

I have a pile of Sun X2200-M2 servers. These servers have ECC memory.

In some of these servers, I am getting warnings in the eLOM about "correctable ECC errors detected", eg:

# ssh regress11 ipmitool sel elist
   1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
   2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted

...some more frequently than others.

The kernel on this particular system is throwing EDAC errors as well, although with far more frequency than the eLOM is recording ECC events:

EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x48cb94, offset 0x10, grain 8, syndrome 0xf654, row 5, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error

Now if the server is detecting Uncorrectable ECC, the system resets, so clearly that's bad and removing/replacing the identified stick or pair corrects the issue.

But I am thinking that if the error is Correctable, then there's no immediate issue -- I can treat this as a warning and be prepared to pull the stick/pair if an uncorrectable error starts occurring?


Solution 1:

Depends on how often you get the error. For a variety of reasons ECC should have to correct single-bit errors about once a year on average. If you're getting them significantly faster than that, or if they're multi-bit errors, you should be worried (I would replace the RAM ASAP).

Also, ECC isn't perfect. It's possible that the cumulative error passes ECC; that would show up as an OS crash or similar problem.