Evaluating uncorrectable ECC errors and fallback methods
I run a server which has just experienced an error I've not encountered before. It emitted a few beeps, rebooted, and got stuck at the startup screen (the part where the bios shows its logo and begins listing information) with the error:
Node0: DRAM uncorrectable ECC Error
Node1: HT Link SYNC Error
After a hard reset the system booted fine and has yet to report anything on edac-util.
My research tells me that even with ECC memory and a system in ideal conditions, an uncorrectable error is still possible and probably will likely occur during the lifespan of the system at some point; some reports suggest at least once a year or sooner.
The server runs CentOS 6.5 with several ECC modules. I am already in the process of trying to diagnose which module threw the error to make an assessment whether this is a fault or the result of something as unavoidable such as a cosmic ray.
My research also suggests that when the system halts like this, there is nowhere for a log to be written and that the only reliable way to do this is to have the system attached to another with the log being written out through a serial port.
Besides the usual edac-util, memtest, stress testing, and precautionary replacement, is there anything else I should take into consideration when addressing this error?
I was unable to find any record of this crash in any of the CentOS logs I searched, which goes along with my belief that it is not possible to log this error to a local disk. The error was only reported to me by the bios after an automatic reboot. Is it advisable to be writing system logs out to serial at all times to log these kinds of errors?
Is this kind of failure avoidable using a single system or is this only possible using an expensive enterprise solution?
What can I do to provide fallback measures in these failure cases for a single production server; as in, the production server itself does not span multiple machines but a fallback server can exist.
Well, this isn't a fully-integrated system like an HP, Dell or IBM server, so the monitoring and reporting of such a failure isn't going to be present or consistent.
With the systems I've managed, disks fail the most often, followed by RAM, power supplies, fan, system boards and CPUs.
Memory can fail... There isn't much you can do about it.
See: Is it necessary to burn-in RAM for server-class hardware?
Since you can't really prevent ECC errors and RAM failure, just be prepared for it. Keep spares. Have physical access to your systems and maintain the warranty of your components. I definitely wouldn't introduce "precautionary replacement" into an environment. Some of this is a function of your hardware... Do you have IPMI? Sometimes hardware logs will end up there.
This is one of the value-adds of better server hardware. Here's a snippet from an HP ProLiant DL580 G4 server where the ECC threshold on the RAM was exceeded, then progressed to the DIMM being disabled... then finally the server crashing (ASR) and rebooting itself with the bad DIMM deactivated.
0004 Repaired 22:21 12/01/2008 22:21 12/01/2008 0001
LOG: Corrected Memory Error threshold exceeded (Slot 1, Memory Module 1)
0005 Repaired 20:41 12/06/2008 20:43 12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0006 Repaired 21:37 12/06/2008 21:41 12/06/2008 0002
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0007 Repaired 02:58 12/07/2008 02:58 12/07/2008 0001
LOG: POST Error: 201-Memory Error Single-bit error occured during memory initialization, Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.
0008 Repaired 19:31 12/08/2009 19:31 12/08/2009 0001
LOG: ASR Detected by System ROM
If the DIMM has uncorrectable error I'd recommend replacing it. If it is only correctable errors in a low rate you can probably live with it and in any case for correctable errors it will be harder to get a refund.
If you want to see if there is a record try to access the IPMI SEL records, with ipmitool sel elist
or an equivalent tool.
The other alternative is to setup a Linux crash kernel to boot into and save the dmesg, this can also catch the information on the hardware failure.
The third alternative is to log the serial console of the server to somewhere persistent, it will also include the clues for a server crash of software or hardware kind.