Any good strategies for dealing with 'not reproducible' bugs? [closed]

Very often you will get or submit bug reports for defects that are 'not reproducible'. They may be reproducible on your computer or software project, but not on a vendor's system. Or the user supplies steps to reproduce, but you can't see the defect locally. Many variations on this scenario of course, so to simplify I guess what I'm trying to learn is:

What is your company's policy towards 'not reproducible' bugs? Shelve them, close them, ignore? I occasionally see intermittent, non reproducible bugs in 3-rd party frameworks, and these are pretty much always closed instantly by the vendor... but they are real bugs.

Have you found any techniques that help in fixing these types of bugs? Usually what I do is get a system info report from the user, and steps to reproduce, then search on keywords, and try to see any sort of pattern.


Solution 1:

  • Verify the steps used to produce the error

Oftentimes the people reporting the error, or the people reproducing the error, will do something wrong and not end up in the same state, even if they think they are. Try to walk it through with the reporting party. I've had a user INSIST that the admin privileges were not appearing correctly. I tried reproducing the error and was unable to. When we walked it through together, it turned out he was logging in as a regular user in that case.

  • Verify the system/environment used to produce the error

I've found many 'irreproducible' bugs and only later discovered that they ARE reproducible on Mac OS (10.4) Running X version of Safari. And this doesn't apply only to browsers and rendering, it can apply to anything; the other applications that are currently being run, whether or not the user is RDP or local, admin or user, etc... Make certain you get your environment as close to theirs as possible before calling it irreproducible.

  • Gather Screenshots and Logs

Once you have verified that the user is doing everything correctly and still getting a bug, and that you're doing exactly what they do, and you are NOT getting the bug, then it's time to see what you can actually do about it. Screenshots and logs are critical. You want to know exactly what it looks like, and exactly what was going on at the time.

It is possible that the logs could contain some information that you can reproduce on your system, and once you can reproduce the exact scenario, you might be able to coax the error out of hiding.

Screenshots also help with this, because you might discover that "X piece has loaded correctly, but it shouldn't have because it is dependent on Y" and that might give you a hint. Even if the user can describe what were doing, a screen shot could help even more.

  • Gather step-by-step description from the user

It's very common to blame the users, and not trust anything that they say (because they call a 'usercontrol' a 'thingy') but even though they might not know the names of what they're seeing, they will still be able to describe some of the behaviour they are seeing. This includes some minor errors that may have occured a few minutes BEFORE the real error occurred, or possibly slowness in certain things that are usually fast. All these things can be clues to help you narrow down which aspect is causing the error on their machine and not yours.

  • Try Alternate Approachs to produce the error

If all else fails, try looking at the section of code that is causing problems, and possibly refactor or use a workaround. If it is possible for you to create a scenario where you start with half the information already there (hopefully in UAT) ask the user to try that approach, and see if the error still occurs. Do you best to create alternate but similar approaches that get the error into a different light so that you can examine it better.

Solution 2:

Short answer: Conduct a detailed code review on the suspected faulty code, with the aim of fixing any theoretical bugs, and adding code to monitor and log any future faults.

Long answer: To give a real-world example from the embedded systems world: we make industrial equipment, containing custom electronics, and embedded software running on it.

A customer reported that a number of devices on a single site were experiencing the same fault at random intervals. Their symptoms were the same in each case, but they couldn't identify an obvious cause.

Obviously our first step was to try and reproduce the fault in the same device in our lab, but we were unable to do this.

So, instead, we circulated the suspected faulty code within the department, to try and get as many ideas and suggestions as possible. We then held a number of code review meetings to discuss these ideas, and determine a theory which: (a) explained the most likely cause of the faults observed in the field; (b) explained why we were unable to reproduce it; and (c) led to improvements we could make to the code to prevent the fault happening in the future.

In addition to the (theoretical) bug fixes, we also added monitoring and logging code, so if the fault were to occur again, we could extract useful data from the device in question.

To the best of my knowledge, this improved software was subsequently deployed on site, and appears to have been successful.

Solution 3:

Error-reporting, log files, and stern demands to "Contact me immediately if this happens again."