VMware lockup CPU spike

After a CPU usage spike, the host server for VMWare ESXi 5.5 became unresponsive regarding the DRAC, Network, and cluster membership.

The host is a blade module is Dell PowerEdge M820 in a Dell M1000e chassis with 4 x Xeon E5-4620s and 128 GB RAM and local SSD's in RAID 6.

All VM's are Server 2008 R2. There is one SQL server that uses the SSD RAID for data. Otherwise the VM's are stored on a QNAP with a 10 Gbit link.

The resources are not over committed.

No hardware failures have ever been logged or indicated on the blade module or the QNAP.

The server needed to be cold reboot from the M1000e DRAC in order to become functional again.

This appears to be a VMWare failure of some sort that hard locked the hardware, however the logs pre-lockup are missing 3 month prior to kicking it.

Since the restart -VMWare and server hardware have not reported or indicated any issues.

Has anyone else experienced anything like this? Any ideas, thoughts, suggestions?

Solution 1:

This is likely a problem with your Windows VM(s). Can you tell us which network driver(s) the Windows VMs are using? Intel e1000? Intel e1000e? VMware vmxnet3?

If they're not using the VMware vmxnet3, you're running into an awful bug that manifests itself in host crashes (PSODs). See the corresponding Knowledge Base article #2059053

Here's a trace of a crash on a 5.5 ESXi host following heavy network activity between a Windows Server 2008R2 and a Windows Server 2012 virtual machine.

The fix is to migrate to the vmxnet3 driver. This bites many people because e1000/e1000e are the defaults when creating Windows virtual machines.

note the "e1000" references in the trace... enter image description here