Intermittent 100% CPU on all VMs

We're a small shop, running a Dell T420 (dual CPU, only one present, 6 cores) w/32GB RAM as our main server. We have only 5 VMs, one of which is our WSE 2012 DC.

From time to time, and at a rate for which we've not been able to establish a reliable pattern, all of our VMs concurrently spike to 100% CPU. The host remains quiet at 4-5%. A host warm boot doesn't provide relief, but a cold boot at least puts things back in the box until the problem reoccurs.

Sometimes we can get a week or more of calm seas out of it; sometimes only a day. An unreliable pattern seems to be that it kicks off sometime during an extended idle period, i.e. overnight. An examination of the server's temperature logs first led us to suspect overheating, but further investigation into recent incidents have spoiled that lead.

We also found descriptions of similar problems on the Dell forums, with claims of resolution by installing the latest round of Dell updates. We recently engaged in a project to do just that (as an aside, it was quite an adventure getting ~700GB of VHDs safely off of and then back onto that machine), but to our utter dismay it didn't help.

We're absolutely befuddled. So is Microsoft support (or at least first tier support is, even though they try not to act like it). I'm including below our SystemInfo output.

Does anyone know where to start looking?

Thanks

===================================

Host Name:                 SERVER1
OS Name:                   Microsoft Hyper-V Server 2012 R2
OS Version:                6.3.9600 N/A Build 9600
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Server
OS Build Type:             Multiprocessor Free
Registered Owner:          Windows User
Registered Organization:   
Product ID:                06401-029-0000043-76293
Original Install Date:     4/3/2014, 4:07:15 PM
System Boot Time:          5/4/2014, 1:56:47 PM
System Manufacturer:       Dell Inc.
System Model:              PowerEdge T420
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 45 Stepping 7 GenuineIntel ~2200 Mhz
                           [Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20 GHz] (manually added)
BIOS Version:              Dell Inc. 2.1.2, 1/20/2014
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC-09:00) Alaska
Total Physical Memory:     32,723 MB
Available Physical Memory: 12,716 MB
Virtual Memory: Max Size:  37,587 MB
Virtual Memory: Available: 17,129 MB
Virtual Memory: In Use:    20,458 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    OIT
Logon Server:              \\SERVER1
Hotfix(s):                 31 Hotfix(s) Installed.
                           [01]: KB2843630
                           [02]: KB2862152
                           [03]: KB2868626
                           [04]: KB2876331
                           [05]: KB2883200
                           [06]: KB2884846
                           [07]: KB2887595
                           [08]: KB2892074
                           [09]: KB2893294
                           [10]: KB2894179
                           [11]: KB2898514
                           [12]: KB2898871
                           [13]: KB2901101
                           [14]: KB2901128
                           [15]: KB2903939
                           [16]: KB2904266
                           [17]: KB2908174
                           [18]: KB2909210
                           [19]: KB2911106
                           [20]: KB2913760
                           [21]: KB2916036
                           [22]: KB2917929
                           [23]: KB2919394
                           [24]: KB2919442
                           [25]: KB2922229
                           [26]: KB2923300
                           [27]: KB2923768
                           [28]: KB2928193
                           [29]: KB2928680
                           [30]: KB2930275
                           [31]: KB2939087
Network Card(s):           3 NIC(s) Installed.
                           [01]: Broadcom NetXtreme Gigabit Ethernet
                                 Connection Name: NIC1
                                 DHCP Enabled:    No
                                 IP address(es)
                           [02]: Broadcom NetXtreme Gigabit Ethernet
                                 Connection Name: NIC2
                                 DHCP Enabled:    Yes
                                 DHCP Server:     192.168.1.12
                                 IP address(es)
                                 [01]: 192.168.1.135
                                 [02]: fe80::915b:8de0:712e:29f1
                           [03]: Hyper-V Virtual Ethernet Adapter
                                 Connection Name: vEthernet (External NIC 1_Internal)
                                 DHCP Enabled:    No
                                 IP address(es)
                                 [01]: 192.168.1.11
                                 [02]: fe80::2d35:f582:4958:9eb2
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

== EDIT ======================

I've found the solution to this issue; I waited for over a year to make sure we didn't encounter any more instances of the problem.

Moderators: I'd like to request a reopening of the question, so that I can post the answer.


After over a year of waiting so as to prove the solution as valid, I'm finally able to post this answer.

Dell's default BIOS settings have C-States enabled, which puts the computer in low-power mode during idle times. This is what causes the VMs to spiral into 100% CPU usage on a Hypervisor host (VMWare, Citrix included).

The solution is to set the System Profile setting in the BIOS to Performance, as opposed to Performance per watt [OS] or Performance per watt [DAPC] (the latter being the default).

The relevant Dell documentation, pp3:

http://en.community.dell.com/techcenter/extras/m/white_papers/20161975/download

And this reply from one of the few Dell support engineers who's familiar with the issue:

The short version is: C-States disable additional processor cores during idling times. For VMs that are tied to a core (this is OS controlled, I do not believe it's configurable), this could result in them locking up, as they're attemping to perform actions with resources that no longer exist in their eyes.

Generally speaking, C-States are generally used on items like backup servers, secondary role servers (Backup dns, dhcp, Domain controllers, etc) so that way the backup servers can remain on, but in a low power mode to save energy.

Addtional Documentation can be found here:

http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface

In a nutshell, power idling on a Dell server should always be turned off (set to Performance) for Hypervisor hosts.

Thanks to Eddy Simons at Kitsap Bank for helping me to find this solution.


It's unclear as to what the problem is; you already know that. We have no chance of telling you what the cause is.

However, you can run some tests:

  • Build VM 1

    • Run a CPU intensive task on this VM constantly
      (Perform millions of complex mathematical calculations per second)
  • Build VM 2

    • Run a RAM intensive task on this VM constantly
      (Create a giant array in memory, delete it, repeat)
  • Build VM 3

    • Run a DISK intensive task on this VM constantly
      (Read/write/delete millions of lines to/from a file)
  • Build VM 4

    • Run a NETWORK intensive task on this VM constantly
      (Copy files to/from a SMB share)

Wait until the problem occurs again, observe performance data on each of these servers.
Which was most affected?
Were any not affected at all?

My guess is that your disks suck and the CPU is waiting for IO operations to complete before continuing, which can cause some applications to flatline the CPU.