When is sustained 100% CPU utilization not a worry?
Please help refine a discussion going on in our shop.
Consider the following scenario. There is a Microsoft VPC running several apps and services (Windows 2003 server). The server has two or three critical roles. Every so often, CPU utilization goes to 100% on a sustained basis. One of the culprits in this is a legacy application, for which the only real solution, at this time, is to restart the service. After this, CPU utilization returns to something reasonable (on average, 60-80%). However, less often, when the server is at 100% CPU, another service appears to be using the lion's share, a security application that parses logs. Our operations team's impulse is to restart that as well when the CPU becomes pegged. Our security team points out that this is pointless, as this service is running at BelowNormal priority, so effectively it is not depriving any other process of CPU. Security argues that the 100% CPU usage in those cases should really not be considered a critical condition. If a BelowNormal priority process is using most of the CPU, then there is actually no CPU deficit at all. Operations, on the other hand, is skeptical that 100% CPU utilization could really be a condition without adverse consequences, and doesn't want to ignore it. Who is right? Is Security right that it's nothing to worry about or Operations that we ought to do something?
Solution 1:
In cases like this, you need to go beyond task manager and looking at % CPU usage. That does not tell you if something is adversely affecting performance. For a case like this, the next step would be to use Performance Monitor to view System\Processor Queue Length. This tells you if processes are waiting for the CPU to become idle possibly affecting performance. This is similar to what you see in the top or load commands in Unix.
This article has a good description of the performance metrics to look at when troubleshooting these issues. It was originally for NT4, but still applies to newer versions.
Here is a more recent article from the Windows Performance Team talking about how to hunt down performance issues with the CPU.
Solution 2:
How are you measuring the CPU%? If this is a virtual machine, Perfmon may not always deliver accurate results. Is there a possibility the spike is related to activity on the host machine? Virus scans, auto-updaters, lots of other things could affect a guest vm and make it look like 100% cpu from the VM's perspective, it may be 100% of a much smaller CPU slice.
Solution 3:
Processing huge amounts of log data is something that SHOULD peg the CPU. If it doesn't then your process is likely IO bound. As long as the meter goes back down when processing is complete (and the machine is reasonably responsive to it's other duties when pegged) it's nothing to worry about.