What steps to take when trying to resolve unresponsive/hung/broken IIS web site?

Solution 1:

I've found the following guidance works pretty well as a general collection guide.

Determine Symptoms

Try to establish (as quickly as possible) the surface area of the problem:

  • Connectivity? (Telnet is good; if you get an error page returned in the browser, something's obviously working - eliminate connectivity first)

  • General App Pool failure, or specific to a content type? (Do ASPX files work/not work, but .HTM work? Do you have canary files for each app and content type?)

  • Specific in-app failure, hang, or crash? (Most of this is for hangs and app failures; crashes dictate their own methodology: get a crash dump, debug it)

As a rule, always write it down, as you might be dealing with multiple symptoms, and being able to refer back to your notes on an earlier incident can be invaluable.

Collect Data

aka "Collect Temporal Data" - You have a limited window to collect certain data while there's an outage. Some data - like the process memory - is ephemeral and will disappear if you take corrective action first. Other data - like logs - might take time to copy, but you could just as easily get them afterwards. So understand what data you need to collect NOW vs post-restoration.

  1. Grab whatever time-sensitive/timely data you will need to resolve the issue later. Don't worry about persistent stuff - Event Logs and IIS logs stick around, unless you're a compulsive clearer, in which case: stop it. (Those that don't have an Event Log of last week are doomed to repeat it)

  2. Determine the affected worker process (and dump it)

    • APPCMD LIST WP can help with this, or the Worker Processes GUI at the Server level.
    • If using the GUI, don't forget to look at the Current Requests by right-clicking the worker process - if you get it, it'll show you which module (DLL) the requests are jammed in, which can help you guess a cause early.

    • Determine the scope (i.e. just one App Pool, multiple App Pools, two with dependencies - this depends on your app and website layout)

    • Grab a memory dump of the worker process - once you've identified which App Pool has the problem, identify the relevant Worker Process, and use Task manager to create a memory dump by right-clicking that process. Note the filename for later.

    • Note On Task Manager bitness: You need to use the same bitness of Task Manager as the Worker Process you're attacking with it - if you dump a 32-bit WP (w3wp*32) with 64-bit Task Manager, it's not going to be interpretable. If dumping a 32-bit process on 64-bit Windows, you need to exit Task Manager, run %WINDIR%\SYSWOW64\TaskMgr.exe to get the 32-bit version, then dump with the same bitness. (a ten second detour, but you must do it at the time).

Restore Service

You've now got all the point-in-time info you think you need for diagnostics, so it's time to get the website customers back in business.

  1. Recycle the minimum number of Worker Processes in order to restore service.

    • Don't bother stopping and starting Websites, you generally need the App Pool to be refreshed in order to get the site working again, and that's what a Recycle does.

    • Recycling the App Pool is 9/10 times enough.

    • Note that recycling appears to happen on the next request to come in (even though the existing WP has been told to go away), so a worker process may not immediately reappear. That doesn't mean it hasn't worked, just that no requests are waiting.

    • IISReset is usually a tool used by people that don't know better. Don't use it unless you need every website to terminate and restart all at once. (It's like trying to hammer a nail into a wall with a brick. It might work, but you kinda look like an idiot, and there's going to be collateral damage at some point).

    • You may have other app dependencies - app pools depending on other app pools, or databases, or external systems... What you have to do to restore service tells you something about the scope of the problem. Last in the list is a full reboot, but unless a kernel-level driver really got messed up, that's typically not necessary, it's just that you can't determine which thing is necessary and it's a useful catch-all...

Determine Cause i.e. look at and think about the data you've collected.

  1. Take the logs and the memory dump, look for commonalities, engage the app developers, debug the dump with DebugDiag (or newer) or WinDBG, and so on.

Set up for next time

Do you know you've fixed it? If not, and especially if nothing else seems to have changed, think about what you might be able to capture if you're better set up if it happens again.

  1. Don't assume it's the last occurrence - develop a plan for what you'll need to collect next time, based on this time.

    • For example, if the requests are all for the same URL, implement some additional instrumentation or logging, or a Failed Request Tracing rule that'll help identify the spot on the page that experiences a problem.

    • Performance monitor logs are helpful (if in doubt, get a perfmon log too).

    • Look at other tools which might be useful - ProcDump, XPerf/WPT/WPR, and so on. If all you have is a hammer, every problem has to be a nail

    • Think about whether "papering over" the issue is acceptable while seeking actual root cause - if the outage is really bad, something like adjusting the recycling settings for the App Pool might be acceptable to minimize the likelihood, or the duration (except where that conflicts with being able to troubleshoot it)...