MCE Error Codes/Pink Screen - Should they be a cause for concern?
So I recently purchased a server grade system along with all server grade peripherals. I'm licensed for ESXi 6 and have all recent patches installed. System has been running around 2 weeks now and all of a sudden I had a complete crash.
I've interpreted this error code as "Internal Timer Error". I've forwarded the info to SuperMicro but to be honest I'm not very confident with their responses so far. My interpretation was that the system simply should not crash - for the reason that it's a Xeon with ECC memory running ESXi.
Is it possible that this was some one off error and shouldn't happen again? How would you handle this? Looking for some advice from those who have seen these types of errors and what they end up actually doing.
You see this error (MCE, machine check exception) precisely because it has ECC RAM.
You have some broken hardware somewhere, most likely a memory stick but possibly one or more processors (CPU 10 perhaps?) or something in between. Invoke your support contract.
It can be other bits of the hardware also, but every time I have seen this it has been faulty ECC RAM experiencing multiple-bit faults. If the MCE decoded as "internal timer error", the next most likely thing is a faulty CPU or mainboard.
Yes, it's a cause for concern. The server crashed!
Check your RAM and your CPU socket pins (if you hand-assembled the server).
That's about all the info you'll get. You can open a support case with VMware and they'll analyze the crash dump for you.