ECC errors in L3 cache - critical or not?
On a linux server (8x Quad-Core AMD 8378), I'm getting the following errors:
[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c294c00001d018b
[Hardware Error]: Northbridge Error (node 4): ECC error in L3 cache tag.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: Machine check events logged
This has happened three times during the last month, but never before (server running for 3 years).
From a quick google-search, it seems this is a serious matter.
However, the vendor support technician said:
I have seen these errors MANY times, and unless you are overclocking your CPU - or have had a fan failure or similar - it is VERY unlikely to be a processor problem. It is more likely that the kernel is misreporting the error.
So - is this a critical error and I should order new parts (replace CPU?) or ignore it?
Many thanks.
Best practice: Keep your own spare parts, when possible.
As for machine check exceptions, these are reported by the hardware; the kernel is just passing the message on to you, so that you can take action before the hardware problem gets out of hand and results in a real disaster.
The only instance I was able to find of a kernel "misreporting" a machine check exception was the following. In this case, it was a flaw in the processor causing the problem, not the kernel.
Intel Xeon processor E7 family processors have an issue in which some c-state transitions can cause false correctable Machine Check Exception (MCE) errors to be reported from MCE bank 6 to the user. On some E7 processor family systems, this resulted in "floods" of MCE errors. This patch disables MCE error reporting for bank 6.
Bottom line: It sounds to me like the vendor is trying to avoid replacing your defective hardware.