Interpreting mcelog output for bad DIMM

Solution 1:

I never did find a clear interpretation of the mcelog data, but my best guess worked out, and I figured I should follow up for posterity.

  • I assumed CPU 1 meant the second CPU, helpfully labeled as 2 on the motherboard diagram.
  • I assumed MEMORY CONTROLLER MS_CHANNEL3_ERR indicated channel 3 on that CPU's memory controller. As above, that channel controls slots 4, 8 and 12, and only slot 4 had a chip in it.
  • I ignored everything else.

I had someone swap out that DIMM, and, presto! No more streams of Machine Check errors.