List of Sysadmin Troubleshooting "Fire Drills"

One of the hardest things to do is to train sysadmins to solve problems (think) in a consistent way, especially when under pressure, the emergency bells are ringing, etc.

For some training sessions, I'd like to come up with a collection of "Fire Drills" with some simple but reasonable steps attached to it that could narrow down the issue. For example:

Web Site Down

  1. Narrow it down - down for internal network, external, or both? From one location or global?
  2. DNS - Does it resolve?
  3. Port - Is it open? Is it responding? (Use Telnet)
  4. Host Headers - Correct?
  5. Web Server - Errors in event viewer?

It would be incredibly helpful if you could add one of your 'drills', too. Other ways of training sysadmin thinking are also welcomed.


Sysadmin-ing (I made this word up) is kinda a type of 'general medicine'. You have to be strong with the OS, hardware, network, security and sometimes with development (you need at least to understand the languages you are working with).

One good way to train sysadmins is to generate break-and-fix sessions. I did it once to test new applicants for a job: they had to put the server up from scratch (so you can check their grasp on installing/partitioning), configure servers and services, do a little basic hardening. After that I go there and mess it up. Minor changes on the hosts files, corrupted or incorrect passwd or shadow, you name it, and see if the candidates could solve the problem in a logical way in good time.

I agree with your drills idea, but I think that they maybe should go a little deeper. Like, if you reach step 5 on the web site down scenario, where to go from there.

I suggest a drill in the ways you suggested yours:

Users behind a proxy/nat can't browse www anymore

  1. Check if it's only one user or more
  2. Check connectivity to the proxy (ping, open ports, etc)
  3. Check if the proxy machine is responsive (load problems, etc)
  4. Check the logs
  5. Check processes/disks on proxy machine (too many processes, disk full)
  6. Check proxy processes/filtering rules/nat rules

But as I said, after step 6 you are pretty much dealing with a non-standard problem, and there's when the sysadmin skills shine.


I've never managed sysadmins but I am one, and I've had to deal with this-is-not-a-drill situations affecting hundreds of servers losing thousands of dollars a minute many times. In my experience, nothing can replace an in-depth and intuitive (i.e., coming from real understanding and experience) knowledge of the entire flow-chart (so to speak) of what happens from browser to web server and back, and then specifically what happens in a particular web application from the time a request comes in to when a response goes out.

If you find your sysadmin can't give you the entire flow, generally, from browser to server and back, after training, I'd suggest he or she is not worth keeping in a sysadmin capacity.

If I were giving this "fire drill", I'd probably leave it free-form, give a time limit, and have the sysadmin write down his/her thought process and what he/she would check from top to bottom. You can't expect perfection there, but it would be a good start to find gaps in intuitive knowledge.

Also, don't let sysadmins put themselves in a box. To say, "That's the database; the DBA should troubleshoot that while I troubleshoot other things," for instance, lets a sysadmin get away with not intuitively knowing the flow of an application from start to finish and, thus, not understanding it completely. At the very least, a sysadmin should be able to eliminate all/most other possibilities and when his/her knowledge is expended, know exactly whom to call for help. (Knowing when and whom to call for help is an indispensable skill of its own.)