ECC chipkill errors: which DIMM?

We often get DIMMs in our servers going bad with the following errors in syslog:

May  7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
May  7 09:15:31 nolcgi303 kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac
May  7 09:15:31 nolcgi303 kernel: MC0: CE - no information available: k8_edac Error Overflow set
May  7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error

We can use the HP SmartStart CD to determine which DIMM has the error but that requires taking the server out of production. Is there a cunning way to work out which DIMM's bust while the server is up? All our servers are HP hardware running RHEL 5.


In addition to using the EDAC codes, you can use the CLI only HP utilities to determine this while the machine is online. The cli versions are far more lightweight than the web based ones and do not require you to open ports or have a daemon constantly running.

hpasmcli will give you the cartridge and module #'s of the failed modules. A little quicker than analyzing EDAC.

Example:

hpasmcli -s "show dimm"

DIMM Configuration
------------------
Cartridge #: 0
Module #: 1
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 2
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 3
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Cartridge #: 0
Module #: 4
Present: Yes
Form Factor: 9h
Memory Type: 13h
Size: 1024 MB
Speed: 667 MHz
Status: Ok

Status will change for failed modules.


MC0, row 2, and channel 0 are significant. Try replacing DIMMA1 on CPU0.

By way of example, I had to identify a bad DIMM in a Linux server with 16 fully populated DIMM slots and two CPUs. These are the errors I saw on the console:

EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "": k8_edac
EDAC MC1: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC1: extended error code: ECC chipkill x4 error

The bad DIMM in my server was DIMMA0 on CPU1.

EDAC stands for Error Detection And Correction and is documented at http://www.kernel.org/doc/Documentation/edac.txt and /usr/share/doc/kernel-doc-2.6*/Documentation/drivers/edac/edac.txt on my system (RHEL5). CE stands for "correctable errors" and as the documentation indicates, "CEs provide early indications that a DIMM is beginning to fail."

Going back to the EDAC errors above I saw on my server's console, MC1 (Memory Controller 1) means CPU1, row 1 is referred to as csrow1 (Chip-Select Row 1) in the Linux EDAC documentation, and channel 0 means memory channel 0. I checked the chart at http://www.kernel.org/doc/Documentation/edac.txt to see that csrow1 and Channel 0 correspond to DIMM_A0 (DIMMA0 on my system):

            Channel 0       Channel 1
    ===================================
    csrow0  | DIMM_A0       | DIMM_B0 |
    csrow1  | DIMM_A0       | DIMM_B0 |
    ===================================

    ===================================
    csrow2  | DIMM_A1       | DIMM_B1 |
    csrow3  | DIMM_A1       | DIMM_B1 |
    ===================================

(As another example, if I had seen errors on MC0, csrow4, and Channel 1, I would have replaced DIMMB2 on CPU0.)

Of course, there are actually two DIMM slots called DIMMA0 on my server (one for each CPU), but again the MC1 error corresponds to CPU1, which is listed under "Bank Locator" in the output of dmidecode:

[root@rce-8 ~]# dmidecode -t memory | grep DIMMA0 -B9 -A8
Handle 0x002E, DMI type 17, 27 bytes.
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU0
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer:  
        Serial Number:  
        Asset Tag:  
        Part Number:  
--
Handle 0x003E, DMI type 17, 27 bytes.
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU1
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer:  
        Serial Number:  
        Asset Tag:  
        Part Number:

(On my workstation, dmidecode actually shows the Part Number and Serial Number for my DIMMs, which is very useful.)

In addition to looking at errors on the console and in logs, you can also see errors per MC/CPU, row/csrow, and channel by examining /sys/devices/system/edac. In my case the errors were only on MC1, csrow1, channel 0:

[root@rce-8 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow7/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:6941652
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch1_ce_count:0

I hope this example is helpful for anyone trying to identify a bad DIMM based on EDAC errors. For more information, I highly recommend reading all of the Linux EDAC documentation at http://www.kernel.org/doc/Documentation/edac.txt