Performing root-cause analysis
I want to learn more about how to perform root-cause analysis. More times than not, our department tells the user to try rebooting (thier Windows XP system), which actually "fixes" a good number of problems. When I am in a hurry (and sometimes getting paid hourly contributes to this) I might try to find a workaround in order to get the problem solved quickly instead of actually performing root-cause analysis.
Most of the time I am looking in log files or the event viewer for this information. Sometimes I will use the Sysinternals tools or occasionally run a packet sniffer. I probably don't use the Sysinternals programs as much as I should. Some specific insight on how you use which pf these tools, when and why would also be helpful.
I know this is a wide open question but could you please briefly explain your methodology, tools, etc. that you use? It looks like a lot of admins on SF use a more in-depth process which I would like to learn more about. If this helps narrow down the question any, I would be most interested in tools, tips, tricks, etc. relevant to Windows servers & clients within an AD environment.
Solution 1:
Figuring out the root cause of a problem depends on the problem -- Your initial instinct to look at log files/sysinternals tools/packet sniffers is generally correct.
I would add running the MS Malicious Software Removal Tool and a good AV program on Windows systems (and ensuring that they don't have something like CyberDefender or other AV-trojan-malware.
The folks at Stack Exchange are proponents of the "5 Whys" method (http://en.wikipedia.org/wiki/5_Whys, also this nice short PDF that shows it in action). It is a pretty valuable tool for doing root cause analysis.
Beyond that I'll paint two broad categories and some of the questions I usually ask/things I check:
Mysterious behavior not related to the network
e.g. "Word keeps crashing on me"
Basic questions to ask:
- What Changed?
(Dont take "nothing" for an answer -- it is the first lie. New software, patches, etc. all count.) - What were you doing when you had the problem?
(Try to extract as much detail as possible here -- in my example above "I hit the hotkey for insert initials and the program crashed") - Did it ever work before?
(If so, start looking at stuff from (1) above) - Can you reproduce the problem on your system?
(If so that's a good sign: A tech support call to the vendor may help. If not you'll need to look at the user's system for the rest of these questions.) - What's different about the user's environment than your environment?
- Is the user's hardware suspect (Run a memory test, look for SMART errors from the hard drive, etc.)
- If you've gotten this far (hardware checks out, software checks out, no viruses, no malware) go visit the user for a day. Observe their work habits.
My company once had a mysterious system lock-up that related to clicking the mouse at a specific frequency (We still don't know why, but we had to watch a user doing it and practice for a day in order to be able to reproduce it reliably)
Problems related to the network
A lot of this is similar, but with some more specific guidance.
- What Changed?
(Yeah, you always start there) - What is broken?
- Can you reach web pages? Is it just one that's down? If so Is it down for everyone or just you?
- Can you ping stuff on the internet by name?
How about by IP? How far does the traceroute get?
- When is it broken?
- Always the same time of the day?
- For a brief period every N days?
- Randomly (is it REALLY random? Plot it on a calendar...)
- Is there something odd about the remote site?
- Look at DNS - If it's round-robin'd there could be remote-side breakage
- Are we talking about the other end of a VPN? What's up with the VPN (logs!)?
- Is there something odd about the local site?
- Check your local firewall
- Check any "filtering software"
- Check with your ISP to see if there are any known issues
- Check sites like http://www.internetpulse.net/ for known network-wide issues
- Check out the user's machine
(TCP settings, etc. - Usually not the problem, but sometimes.)
Solution 2:
In addition to the excellent responses so far, I would add:
Identify the date/time of issue onset. This may seem obvious, but I have seen far too many issues where this was not documented and later on incorrect assumptions were made. This correlates well with the "what changed" step.
Is the issue reproducible or intermittent? This is critical, as reproducible symptoms are far easier and quicker to resolve than those that are intermittent. If it is reproducible, ensure the steps are documented.
-
Identify the symptom(s). Note that we distinguish between "symptom", which is a manifestation of the root cause, and the actual problem/root cause.
- Are there any other activities that can reproduce the symptom?
- What other symptoms are there?
- If the issue is intermittent, can we identify an activity that will cause it to occur?
- Under what circumstances can we prevent the symptom from occurring? Does the issue occur only when logged on using a network account, but work ok if logged on locally? Does the issue occur when logged as a normal user, but work ok if logged on with elevated privileges? Does it occur only on one system, but another system that should be similar not exhibit the symptom?
Localize the issue to a likely faulty functional component. If there is an error in a web application, is it in the application code, the web server, the operating system hosting the web server, the network, or the remote end? This is best-guess at this point so that resources are focused on the likely cause, so ensure that others know that this is theory/conjecture.
Question your assumptions, and try to gather empirical data to support to support assumptions and conclusions. It's pretty bad feeling to tell someone that there isn't a problem with x, and it is discovered later that there actually is. Usually when there is an incorrect solution, there could have been data to support a correct solution.