How do I interpret MCE messages?
Solution 1:
You might like to try replacing the DIMM in question (CPU 0, SOCKET 8) and seeing whether the MCE messages continue to be generated.
The mcelog package comes configured with some default thresholds for various MCE events that occur over time. Check out /etc/mcelog/mcelog.conf
for details. For memory page errors the threshold is 10 events over 24 hours. (I'm not really sure where this number comes from but it's probably a reasonable reference point). Your post mentions 77 correctable events over 24 hours against a whole bunch of pages, so it's pretty likely that the DIMM has developed a problem which may or may not turn into something more serious.
I wouldn't be too upset about receiving inconsistent information from different sources. In general I have found that anything at the firmware level is pretty platform specific (i.e particular to that particular hardware model). My rule of thumb for firmware-related problems is that the vendor tools are usually the most accurate, but the least usable. The more generic open source tools are easier to work with, but may not provide enough information to show exactly what's going on.