ECC errors in L3 cache - critical or not?

On a linux server (8x Quad-Core AMD 8378), I'm getting the following errors:

[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c294c00001d018b
[Hardware Error]: Northbridge Error (node 4): ECC error in L3 cache tag.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: Machine check events logged

This has happened three times during the last month, but never before (server running for 3 years).

From a quick google-search, it seems this is a serious matter.

However, the vendor support technician said:

I have seen these errors MANY times, and unless you are overclocking your CPU - or have had a fan failure or similar - it is VERY unlikely to be a processor problem. It is more likely that the kernel is misreporting the error.

So - is this a critical error and I should order new parts (replace CPU?) or ignore it?

Many thanks.


Best practice: Keep your own spare parts, when possible.

As for machine check exceptions, these are reported by the hardware; the kernel is just passing the message on to you, so that you can take action before the hardware problem gets out of hand and results in a real disaster.

The only instance I was able to find of a kernel "misreporting" a machine check exception was the following. In this case, it was a flaw in the processor causing the problem, not the kernel.

Intel Xeon processor E7 family processors have an issue in which some c-state transitions can cause false correctable Machine Check Exception (MCE) errors to be reported from MCE bank 6 to the user. On some E7 processor family systems, this resulted in "floods" of MCE errors. This patch disables MCE error reporting for bank 6.

Bottom line: It sounds to me like the vendor is trying to avoid replacing your defective hardware.