Intermittent high CPU (100%) on production webserver
X-Post from StackOverflow:
https://stackoverflow.com/questions/9465123/intermittent-high-cpu-100-on-production-webserver
We have a web cluster with 3 web-servers, each with 24 cores & 24GB mem.
Our application is latest patched ASP.NET 4.0, With MVC3, on IIS 7.5 - In it's own application pool.
Very intermittently, (Maybe once every 2/3 days) one of the webservers will stop serving requests, and all 24 cores will show 100% CPU (memory & disk look normal).
The few times when IIS manager isn't completely frozen, the active running requests don't seem to offer any useful information, with a pretty random spread across a large number of site areas/requests.
Once a server has died, we are able to take it out of load - and after maybe 5 minutes of no-longer serving requests, the CPU activity will drop back to normnal - making us think it isn't an infinite loop.
A memory dump of the worker process (around 4GB is size!) doesn't seem to show any of our code/namespaces anywhere in any of the managed stack traces - but simply .Net begin request stuff (It's possible I'm using WinDbg wrong - and not loading our symbols correctly - but the stack traces don't show any missing/unnamed method calls - so I'm quite confused)
Our servers are normally processing 1000 req/sec quite happily, so this is all very strange.
One weird thing we noticed in Perfmon - was the Contention Rate / sec goes to like 800. We don't have any fancy multi-threaded code in our app, and the only locks we have are in our caching code (Which hasn't changed in ages).
Any advice/tips on how to further diagnose this issue would be most appreciated.
Cheers.
Solution 1:
Dave, A few thoughts to start you:
I am assuming it's the w3wp.exe that is eating your resources. If not, it might be worth running some PAL reports to get some better insight into the overall health of the server: http://pal.codeplex.com/ Sometimes I'll even run PAL even if it is an IIS problem... PAL can spot all sorts of problems that you never would think about.
Check Performance Monitor (both before and during your spike)... try to figure out if your ASP.Net Applications Request/Sec is higher during the "slow response" periods... I find that to be the fastest way to tell you if you are handling more requests than normal.
Try to figure out if there is one (or a few) pages that are taking longer to load. Be sure IIS stats are being logged, and then look for an increase in the time-taken. Checkout Log Analyer (http://www.iis.net/community/default.aspx?tabid=34&g=6&i=1864).
Oh, and don't forget the StackExchange mini profiler http://code.google.com/p/mvc-mini-profiler/ once you figure out what URL is causing the problem.
Also, don't overlook any .NET error catching you have in place :-)
Let us know what you see. -Chris
Solution 2:
Use DebugDiag 1.2 to perform the analysis of the dump:
https://www.microsoft.com/download/en/details.aspx?id=26798
It's useful to be aware that any process that is capable of using more than one thread can push utilization to 100% on all processors of a server. This includes native code and even core os components.
When you say "latest patched", to me that means with Windows Update, which does not get a lot of the more serious bugfixes for Windows 2008 R2.
In particular, if the application is accessing any files on remote shares, it would be a good idea to have the file system hotfixes applied:
List of currently available hotfixes for the File Services technologies in Windows Server 2008 and in Windows Server 2008 R2
http://support.microsoft.com/kb/2473205
Solution 3:
Check if it's being targeted by a HashDos attack - and set up request limits.