RHEL: What happens when memory starts to fail?

I'm getting ecc warnings from some server RAM. It's a pretty old machine so there isn't any warranty on these parts.

If this were Windows I would expect to see BSOD.

What can I expect from RH5.x?


Solution 1:

On a RHEL system, you'll see an accumulation of errors in your kernel ring buffer output (dmesg), as well as /var/log/messages. Once the ECC threshold has been exceeded, applications may simply crash. The server could warm-boot. You may have a kernel panic. The machine check exception log will have indicators. I've even seen cases where the system reboots and disables the bad DIMM.

If this is enterprise server hardware, the system's event log may fill up with errors. The server's watchdog may time-out and force a cold-boot of the system.

At this point, you know you have a problem... So the right solution is to replace the bad DIMM. In most cases, the failure mode isn't pretty, so it's best to avoid the pain. Be glad that the ECC RAM gave you warnings.

Solution 2:

The Linux equivalent of the BSOD is the kernel panic. When the kernel finds a situation it really can't deal with (eg, a file system corruption error leading to conditions like trying to free an inode which is already free), it prints panic warnings to just about everywhere, usually via syslog, and halts the processor(s).

If the memory is failing undetectably, then sooner or later the kernel will come up against such a condition, and panic.

I googled for examples and found many; the one at http://www.google.co.uk/imgres?imgurl=https://www.virtualbox.org/raw-attachment/ticket/9305/rec.jpeg&imgrefurl=https://www.virtualbox.org/ticket/9305&h=908&w=1229&sz=248&tbnid=JzS7Yn9aNlPUXM:&tbnh=90&tbnw=122&zoom=1&usg=__gtpppLj2_g4OvWr-d5QA8DlK7a0=&docid=IqDtDvjAV31hEM&sa=X&ei=zcDmUcORFsao0QXrzYGADA&ved=0CEwQ9QEwBA&dur=1342#imgdii=JzS7Yn9aNlPUXM%3A%3BAjx9NoISgkV-XM%3BJzS7Yn9aNlPUXM%3A is a nice example of the genre; you can see the line with the timestamp 7.568856 where the kernel formally announces it's given up.

Note also that it's not syncing the file systems, which is a sensible precaution when it can no longer be sure of its own integrity. This can make these conditions hard to debug, as the lack of sync means the log message will never make it into local log files. This in turn is one of the main reasons I use remote syslogging: the error will still be sent to the remote loghost, and can often be found there.