How do I troubleshoot when I have no clue where to start?
I am looking for hints, tips and answers on how to get started on troubleshooting when:
- The problem is intermittent
- The problem could lie literally anywhere - operating system; free source software; my own software developments; purchased software; crumbs on the keyboard; the specific combination of software I am currently running; Maxwell's demon; the little blue men actually running the machine have gone on strike; etc.
- I have expertise only in a few of the areas that are potential candidates for the cause of the problem.
The specific problem I am having is detailed below as an example, but I am not seeking answers to my current problem, but rather where and how to start on tackling such problems.
I am currently encountering a problem with my new machine. On a few occasions the machine has just frozen; not accepting keystrokes, mouseclicks, or anything except the power on/off switch. Invariably I have been merely browsing the web; I have had a few (<= 6 other applications) running. None of these applications are major; and represent a mix of commercial programs and open source programs, typically migrated from Unix of some variety.
My machine is a Windows 7 I7 quad core laptop.
EDIT:
Although I stated that the actual problem description was only an example, some of the comments are concentrating on solving this problem. Unfortunately, as it was only an example, the information given is correct but not complete. To avoid having people wasting their time on trying, remotely, to aid with the actual problem, I am giving some other information about my setup. As I originally said, I am not seeking answers to this specific problem.
My machine is a high powered laptop; is my main machine; is used for development and technical writing, communications - email, web, FTP, etc, and for photo editing and indexing. A rigorous and extensive suite of hardware test programs,including CPU tests, multiple memory tests, and tests on all other components are run on it at least monthly. Also run at least monthly are a full virus scan; a full spyware scan; a disk cleanup; and a disk defragmentation.
The disk contains approximately 3*10^6 files; disk usage is 300 Gb leaving 150 Gb free. Memory is 8 Gb. While the machine can get slightly warm when I am running a full complement of major development tools, I have encountered the problem only when using the machine very lightly - web browsing plus Textpad plus Graphviz plus a Firebird database plus a lightweight database browser (Flame Robin). In these circumstances even the fan is not slightly warm. I have made no changes to software, operating system or hardware over the period I have encountered the problem. There have been a number of automatic updates occur - Microsoft, Adobe, and Lenovo mostly but not exclusively.
This background puts into context (I hope) my reasons for asking this question the way I did. I am now going to start investigating the various logs mentioned in the answers as a first step in trying to narrow the field of investigation. And I am going to try an exercise one of the characteristics suggested in the answers I have received so far - patience - in my investigation.
Get a better idea.
You ain't going to win a battle without sufficient field information.
Describe your problem in detail so that you have a good idea of it, who knows it just happens once.
Track back in time what happened before and together with the problem, both you and your computer.
Think of the possible causes because sometimes it might be something that's not obvious.
Get more information whenever you have no idea of what's happening, this could range from Events, to SysInternals Tools, to Performance Analysis, to Debugging, to any other tool in your expertise.
Test your assumptions to be sure that your thoughts don't filter the cause away.
Divide and conquer.
Because that's how military defeat their opponent even when outnumbered.
Eliminate the possible causes, or you'll have a problem keeping track of the problem. This way, you will get closer and closer to the root cause of the problem, it allows you to solve the problem a lot easier.
For example, with hardware, disconnect and remove anything that you don't need for fixing your problem. This way, you might disconnect the component causing the problem. And then it's again a matter of inserting half the components in, checking if it reoccurs and repeat splitting till you have the bad component...
Testing something on another computer, if available, is also a good benefit towards solving the problem.
For example, with software, rebooting into safe mode, disabling start-up entries also helps. This also applies to enabling/disabling settings, trying the default configuration and so on...
Let's put it to the test.
I am currently encountering a problem with my new machine. On a few occasions the machine has just frozen; not accepting keystrokes, mouseclicks, or anything except the power on/off switch. Invariably I have been merely browsing the web; I have had a few (<= 6 other applications) running. None of these applications are major; and represent a mix of commercial programs and open source programs, typically migrated from Unix of some variety.
That's a proper description by itself, it doesn't just happen once either.
-
You know what happened together with the problem,
but haven't thought of things you or your computer did before the problem.I can't tell this, but you, your event log and recently modified files/folders could tell.
-
Possible cause is most likely to be CPU related, because it's the component that processes things.
More specific this could be a process, a driver or failing hardware (perhaps temperature problems?).
-
I know it's CPU, but don't know what. Events don't show this, Process Explorer would hang on DPC.
So, next step, I let trace analysis run which I close after the hang has occured.
I look into the trace, and I see that driver X is causing the problem!
No real assumptions are made. The CPU assumption is handled by our Divide & Conquer approach...
So, this is where I start dividing to conquer the problem, I stop once solved:
Problem with current version of the driver?
Update the driver to the latest version.Problem with newest versions of the driver?
Get a new trace. Update the driver to an older version different from the initial.Problem with the device? Configuration problem in the registry?
Get a new trace. Reinstall and/or disable the device if possible.Problem is random, is it the processor heating up?
Check the processor temperature, replace fan if needed.Problem is not the processor, are there other hardware and software influences?
Remove hardware and disable software from running, to nail down third-party influence.Problem is not in a removable part, it should be replaced.
In the worst case, if all else fails, you need to go for a replacement.
Getting new traces and removing hardware gives us more information, so we know where to look next.
Good logs and intuition - really.
- From day 1, keep track of everything you do to the system: app & OS updates, new installs, new or removed hardware or connections, the thunderstorm that "didn't cause a problem".
- When you first noticed the issue:
- What had you been doing?
- What else unusual happened recently?
- What have you done differently recently?
- From then on, keep aware of what you're doing so the next time it happens, you have a better handle on what had just preceded it.
- Snapshot the system logs.
- See if you can you reproduce it. Until you can reproduce it, you can't find it.
- Start partitioning the system: safe mode vs. running live, new account vs. your regular account, different keyboard and mouse than your regular ones (esp. bluetooth vs. wired), does it happen within a few minutes of starting or waking vs. only after an hour more of running (think thermal).
I usually start with the event logs and any logs that a program mmay create on its own. Programs will sometimes crete a log in the program folder.
Once you can identify the time, search the logs for events. Naturally windows logs may present with Stop errors that will be easy to identify.
Check all drivers and make sure they are current.
Patience will likley be required in large doses.
In addition to all the good advice already given, if log files aren't giving you a lot to go on, a proper memory test of the machine is often worthwhile - faulty memory can cause all sorts of strange intermittent freezes and crashes. The built in memory test is much more akin to a memory count it's extremely rare the power on test catches a memory fault.
Google for Windows Memory Diagnostic and burn it to a CD. It's old but it's one of the better memory tests, and it's free.