Interpreting mcelog output for bad DIMM
Solution 1:
I never did find a clear interpretation of the mcelog data, but my best guess worked out, and I figured I should follow up for posterity.
- I assumed
CPU 1
meant the second CPU, helpfully labeled as 2 on the motherboard diagram. - I assumed
MEMORY CONTROLLER MS_CHANNEL3_ERR
indicated channel 3 on that CPU's memory controller. As above, that channel controls slots 4, 8 and 12, and only slot 4 had a chip in it. - I ignored everything else.
I had someone swap out that DIMM, and, presto! No more streams of Machine Check errors.