Hardware error messages from syslogd
I have a 64-core AMD server running CEntOS on which I was running a long job. In the midst of the output, I see these lines. It appears to be a memory error. How severe is this and what exactly does it indicate?
Message from syslogd@heracles at Nov 7 21:00:02 ...
kernel:[Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc10410040080a13
Message from syslogd@heracles at Nov 7 21:00:02 ...
kernel:[Hardware Error]: Northbridge Error (node 4): DRAM ECC error detected on the NB.
Message from syslogd@heracles at Nov 7 21:00:02 ...
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
on the NB
The NB is the North Bridge. Old computers used many chips. Eventually these got integrated in about 3 larger generic chips (386/486 time) and later in two. One of those dealt with the CPU, the RAM and other high speed devices. The other ('South bridge') dealt with slow peripherals).
DRAM ECC error detected
Dynamic memory is just main memory (as opposed to cache which is usually made from static memory). ECC is memory which is designed to detect and correct single bit corruption.
The message you get is that the NB tried to read some memory, but detected that it was partially corrupt.
In that case it can either shut down the machine (remember the old fashioned `Parity error: System halted'), or it can correct it, or it can ignore it. In this case it seems to have corrected it and it threw a warning.
A single error on memory is no reason to panic. These things happen. Rarely, but they do happen. And with ECC you get a proper warning rather than unexplained crashes or corrupt data.
In extremely fast environments (e.g. on die accessed cache) they are not even that uncommon. Usually the computer will retry and correct itself. If that fails it will throw a MCE.
If these things keep occurring: Check if the DIMMS are seated properly. Did they collect a lot of dust? Do they pass memtest? Etc etc.