How to find faulty memory module from MCE message?
I am trying to understand MCE message to find which memory module is bad on a server. This message appears in /var/log/kern.log
in one server that freezes two times today.
Apr 13 22:39:22 mbox kernel: [36247975.116860] sbridge: HANDLING MCE MEMORY ERROR
Apr 13 22:39:22 mbox kernel: [36247975.116867] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090
Apr 13 22:39:22 mbox kernel: [36247975.116869] TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0
Apr 13 22:39:22 mbox kernel: [36247975.951013] EDAC MC0: 1 CE memory read error
I suspect a bad memory module. The server is a 2x Xeon E5-2650 with 8x8Go memory modules (8 memory slots for each cpu)
Here is the memory module population from lshw
:
*-memory:0
description: System Memory
physical id: 2d
slot: System board or motherboard
*-bank:0
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-197.A
vendor: Kingston
physical id: 0
serial: B83AE5C2
slot: P1_DIMMA1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM Synchronous [empty]
product: Dimm1_PartNum
vendor: Dimm1_Manufacturer
physical id: 1
serial: Dimm1_SerNum
slot: P1_DIMMA2
width: 64 bits
*-bank:2
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 2
serial: EC309238
slot: P1_DIMMB1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:3
description: DIMM Synchronous [empty]
product: Dimm4_PartNum
vendor: Dimm4_Manufacturer
physical id: 3
serial: Dimm4_SerNum
slot: P1_DIMMB2
width: 64 bits
*-bank:4
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 4
serial: E9305438
slot: P1_DIMMC1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:5
description: DIMM Synchronous [empty]
product: Dimm7_PartNum
vendor: Dimm7_Manufacturer
physical id: 5
serial: Dimm7_SerNum
slot: P1_DIMMC2
width: 64 bits
*-bank:6
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 6
serial: E7305738
slot: P1_DIMMD1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM Synchronous [empty]
product: Dimm10_PartNum
vendor: Dimm10_Manufacturer
physical id: 7
serial: Dimm10_SerNum
slot: P1_DIMMD2
width: 64 bits
*-memory:1
description: System Memory
physical id: 3f
slot: System board or motherboard
*-bank:0
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-197.A
vendor: Kingston
physical id: 0
serial: B63A08C3
slot: P2_DIMME1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM Synchronous [empty]
product: Dimm1_PartNum
vendor: Dimm1_Manufacturer
physical id: 1
serial: Dimm1_SerNum
slot: P2_DIMME2
width: 64 bits
*-bank:2
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 2
serial: EA309638
slot: P2_DIMMF1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:3
description: DIMM Synchronous [empty]
product: Dimm4_PartNum
vendor: Dimm4_Manufacturer
physical id: 3
serial: Dimm4_SerNum
slot: P2_DIMMF2
width: 64 bits
*-bank:4
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 4
serial: E7305938
slot: P2_DIMMG1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:5
description: DIMM Synchronous [empty]
product: Dimm7_PartNum
vendor: Dimm7_Manufacturer
physical id: 5
serial: Dimm7_SerNum
slot: P2_DIMMG2
width: 64 bits
*-bank:6
description: DIMM DDR3 1333 MHz (0,8 ns)
product: 9965516-048.A
vendor: Kingston
physical id: 6
serial: E7305B38
slot: P2_DIMMH1
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM Synchronous [empty]
product: Dimm10_PartNum
vendor: Dimm10_Manufacturer
physical id: 7
serial: Dimm10_SerNum
slot: P2_DIMMH2
width: 64 bits
*-memory:2 UNCLAIMED
physical id: 7
*-memory:3 UNCLAIMED
physical id: 9
As you can notice, there is no memory module on the bank #5 which. So my question is : do you agree this message is about memory failure? And if so, how can I find which module as to be replaced?
These errors are coming from the EDAC - Error Detection And Correction.
edac_mc
class of the device.
The events that you are receiving are CE events (Correctible Errors). These are indications that a DIMM is beginning to fail.
EDAC has not reported any specific information about which memory row or channel it refers to so it's difficult to tell which one to replace until that one fail.
but have a look at /sys/devices/system/edac/mc/mc*
and this might tell you a little bit more about which row / DIMM that might be the faulty one.
For example
ls -s /sys/devices/system/edac/mc/mc0
total 0
0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb
0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count
0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
look at the ce_count
field.
On a side note:
The system can still continue to operate, but with less safety. Preventive maintenance and proactive part replacement of memory DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE (uncorrectible error) events and system 'panics'.
More info on EDAC here:
https://www.kernel.org/doc/Documentation/edac.txt
Some vendors say that several correctable errors during a certain period of time is of no harm.
For example, Oracle says replace a DIMM when one of the following events takes place:
More than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.
The DIMM fails memory testing under BIOS due to Uncorrectable Memory Errors (UCEs).
UCEs occur and investigation shows that the errors originated from memory.
Notice 24 errors in 24 hours.
https://docs.oracle.com/cd/E19150-01/820-4213-11/dimms.html
Also,
If more than one DIMM has experienced multiple CEs, other possible causes of CEs must be ruled out by a qualified Sun Support specialist before replacing any DIMMs.
On the last point, HP says something similar that it might be just the server firmware that misdetect memory errors. They say in many cases firmware upgrade fixes false positive alerts. This might be especially true if you started receiving MCEs from different DIMMs.
It can help to install mcelog and run it as a daemon, it can help provide better reports. They are still cryptic but there is slightly more information to go with to find the culrpit DIMM.
mcelog can also handle real-time issues by disabling pages with excessive memory errors and thus giving you more chance to keep the machine running longer until you can recover it.