Intermittent 100% CPU on all VMs
We're a small shop, running a Dell T420 (dual CPU, only one present, 6 cores) w/32GB RAM as our main server. We have only 5 VMs, one of which is our WSE 2012 DC.
From time to time, and at a rate for which we've not been able to establish a reliable pattern, all of our VMs concurrently spike to 100% CPU. The host remains quiet at 4-5%. A host warm boot doesn't provide relief, but a cold boot at least puts things back in the box until the problem reoccurs.
Sometimes we can get a week or more of calm seas out of it; sometimes only a day. An unreliable pattern seems to be that it kicks off sometime during an extended idle period, i.e. overnight. An examination of the server's temperature logs first led us to suspect overheating, but further investigation into recent incidents have spoiled that lead.
We also found descriptions of similar problems on the Dell forums, with claims of resolution by installing the latest round of Dell updates. We recently engaged in a project to do just that (as an aside, it was quite an adventure getting ~700GB of VHDs safely off of and then back onto that machine), but to our utter dismay it didn't help.
We're absolutely befuddled. So is Microsoft support (or at least first tier support is, even though they try not to act like it). I'm including below our SystemInfo output.
Does anyone know where to start looking?
Thanks
===================================
Host Name: SERVER1 OS Name: Microsoft Hyper-V Server 2012 R2 OS Version: 6.3.9600 N/A Build 9600 OS Manufacturer: Microsoft Corporation OS Configuration: Standalone Server OS Build Type: Multiprocessor Free Registered Owner: Windows User Registered Organization: Product ID: 06401-029-0000043-76293 Original Install Date: 4/3/2014, 4:07:15 PM System Boot Time: 5/4/2014, 1:56:47 PM System Manufacturer: Dell Inc. System Model: PowerEdge T420 System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: Intel64 Family 6 Model 45 Stepping 7 GenuineIntel ~2200 Mhz [Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20 GHz] (manually added) BIOS Version: Dell Inc. 2.1.2, 1/20/2014 Windows Directory: C:\Windows System Directory: C:\Windows\system32 Boot Device: \Device\HarddiskVolume1 System Locale: en-us;English (United States) Input Locale: en-us;English (United States) Time Zone: (UTC-09:00) Alaska Total Physical Memory: 32,723 MB Available Physical Memory: 12,716 MB Virtual Memory: Max Size: 37,587 MB Virtual Memory: Available: 17,129 MB Virtual Memory: In Use: 20,458 MB Page File Location(s): C:\pagefile.sys Domain: OIT Logon Server: \\SERVER1 Hotfix(s): 31 Hotfix(s) Installed. [01]: KB2843630 [02]: KB2862152 [03]: KB2868626 [04]: KB2876331 [05]: KB2883200 [06]: KB2884846 [07]: KB2887595 [08]: KB2892074 [09]: KB2893294 [10]: KB2894179 [11]: KB2898514 [12]: KB2898871 [13]: KB2901101 [14]: KB2901128 [15]: KB2903939 [16]: KB2904266 [17]: KB2908174 [18]: KB2909210 [19]: KB2911106 [20]: KB2913760 [21]: KB2916036 [22]: KB2917929 [23]: KB2919394 [24]: KB2919442 [25]: KB2922229 [26]: KB2923300 [27]: KB2923768 [28]: KB2928193 [29]: KB2928680 [30]: KB2930275 [31]: KB2939087 Network Card(s): 3 NIC(s) Installed. [01]: Broadcom NetXtreme Gigabit Ethernet Connection Name: NIC1 DHCP Enabled: No IP address(es) [02]: Broadcom NetXtreme Gigabit Ethernet Connection Name: NIC2 DHCP Enabled: Yes DHCP Server: 192.168.1.12 IP address(es) [01]: 192.168.1.135 [02]: fe80::915b:8de0:712e:29f1 [03]: Hyper-V Virtual Ethernet Adapter Connection Name: vEthernet (External NIC 1_Internal) DHCP Enabled: No IP address(es) [01]: 192.168.1.11 [02]: fe80::2d35:f582:4958:9eb2 Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
== EDIT ======================
I've found the solution to this issue; I waited for over a year to make sure we didn't encounter any more instances of the problem.
Moderators: I'd like to request a reopening of the question, so that I can post the answer.
After over a year of waiting so as to prove the solution as valid, I'm finally able to post this answer.
Dell's default BIOS settings have C-States enabled, which puts the computer in low-power mode during idle times. This is what causes the VMs to spiral into 100% CPU usage on a Hypervisor host (VMWare, Citrix included).
The solution is to set the System Profile setting in the BIOS to Performance, as opposed to Performance per watt [OS] or Performance per watt [DAPC] (the latter being the default).
The relevant Dell documentation, pp3:
http://en.community.dell.com/techcenter/extras/m/white_papers/20161975/download
And this reply from one of the few Dell support engineers who's familiar with the issue:
The short version is: C-States disable additional processor cores during idling times. For VMs that are tied to a core (this is OS controlled, I do not believe it's configurable), this could result in them locking up, as they're attemping to perform actions with resources that no longer exist in their eyes.
Generally speaking, C-States are generally used on items like backup servers, secondary role servers (Backup dns, dhcp, Domain controllers, etc) so that way the backup servers can remain on, but in a low power mode to save energy.
Addtional Documentation can be found here:
http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface
In a nutshell, power idling on a Dell server should always be turned off (set to Performance) for Hypervisor hosts.
Thanks to Eddy Simons at Kitsap Bank for helping me to find this solution.
It's unclear as to what the problem is; you already know that. We have no chance of telling you what the cause is.
However, you can run some tests:
-
Build VM 1
- Run a CPU intensive task on this VM constantly
(Perform millions of complex mathematical calculations per second)
- Run a CPU intensive task on this VM constantly
-
Build VM 2
- Run a RAM intensive task on this VM constantly
(Create a giant array in memory, delete it, repeat)
- Run a RAM intensive task on this VM constantly
-
Build VM 3
- Run a DISK intensive task on this VM constantly
(Read/write/delete millions of lines to/from a file)
- Run a DISK intensive task on this VM constantly
-
Build VM 4
- Run a NETWORK intensive task on this VM constantly
(Copy files to/from a SMB share)
- Run a NETWORK intensive task on this VM constantly
Wait until the problem occurs again, observe performance data on each of these servers.
Which was most affected?
Were any not affected at all?
My guess is that your disks suck and the CPU is waiting for IO operations to complete before continuing, which can cause some applications to flatline the CPU.