How do I get notified of ECC errors in Linux?

Solution 1:

The Linux kernel supports the error detection and correction (EDAC) features of some chipsets. On a supported system with ECC the status of your memory controller is accessible via sysfs:

/sys/devices/system/edac/mc

The directory tree under that locations should correspond to your hardware, e.g.:

/sys/devices/system/edac/mc/mc0/csrow2/power
/sys/devices/system/edac/mc/mc0/csrow0/power
/sys/devices/system/edac/mc/mc0/dimm2/power
/sys/devices/system/edac/mc/mc0/dimm0/power
/sys/devices/system/edac/mc/mc1/power
...

Depending on your hardware, you might have to explicitly load the right edac driver, cf.:

find /lib/modules/$(uname -r) -name '*edac*'

The edac-utils package provides a command line frontend and a library for accessing that data, e.g.:

edac-util -rfull          
mc0:csrow0:mc#0memory#0:CE:0
mc0:csrow2:mc#0memory#2:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0
mc1:noinfo:all:UE:0
mc1:noinfo:all:CE:0

You can setup some kind of cron-job that periodically calls eac-util and feeds the results into your monitoring system, where you can then configure some notifications.

In addition to that, running mcelog is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), as well. I mean, even brief periods of CPU throttling due to higher temperature are reported as MCE.

Solution 2:

mcelog will monitor the memory controller and report memory error events to syslog, and in some configurations can offline bad memory pages. This is, of course, in addition to its usual use to monitor machine check exceptions and a variety of other hardware errors.

Most Linux distributions have a service set up to run it as a daemon, e.g. for EL 6:

chkconfig mcelog on
service mcelog start