Is there a simple, safe way to trigger a GPU lockup on a susceptible computer?

Solution 1:

Excellent question.

Workloads

The /usr/share/xdiagnose/workloads directory has a set of workloads designed to exercise your graphics system to trigger lockups.

$ ls /usr/share/xdiagnose/workloads/
README                       do_monitor_rotation_loop
do_chws_loop*                do_screensaver_loop*
do_cpu_spin_loop             do_video_loop*
do_disk_write_loop           do_vtswitch_loop*
do_glx_loop*                 repro.sh
do_kernel_compile_loop       run_workloads
do_monitor_disable_loop*     youtube-loop.html
do_monitor_resolution_loop*  youtube-reload.html

Note that to run them you need to pass 'run'. E.g.:

$ do_glx_loop run

With no args the scripts will display usage. Partly that's for safety (in case people just blindly run the scripts), but mostly it's to keep the scripts' API tidy.

The ones I've starred are probably the best ones to start with. I would start by running just one script at a time and let it go a few hours. If your system survives that well enough, then try running two or more simultaneously.

Note I haven't tested these super heavily myself, so can't promise they're bug free. But they're quite short and simple scripts so hopefully easy to fix up, and well patches are very much welcomed.

Also note that they quite likely may trigger lockups unrelated to the one you're trying to solve. GPU lockups all generally look identical to the untrained eye since they have the exact same symptoms, more or less.

Logs

If you're on Intel Graphics, there is a /sys/kernel/debug/dri/0/i915_error_state that you want. This is a snapshot of the register state at time of hang, and the top of it contains some error codes. IPEHR, PGTBL_ER, ESR, EIR. Match those codes up to see if you have the same or similar error.

If you're not on Intel Graphics (as in this case you're not), or if you're not seeing i915_error_state files generated, then dmesg and /var/log/kern.log are what to look at. Sometimes with gpu lockups they will indicate what the GPU lockup was caused by or in.

The open source -ati driver has radeontool and avivotool, which capture register states. These are primarily for the opensource -ati, but the tools should also work with -fglrx. I've never seen it requested for an -fglrx bug, but it certainly can't hurt.

Testing

For all drivers, the next step is usually to start testing either newer or older versions of the driver. For proprietary drivers, you can check the x-updates ppa but probably you'll have to download and manually install the driver from the vendor website (and mess up your system's packaging in so doing). For FOSS drivers like -intel, -nouveau, -ati that means testing either newer kernels or newer mesa. We provide packaged builds of newer kernels at http://kernel.ubuntu.com/~kernel-ppa/mainline/. For mesa, there are various PPAs such as xorg-edgers. I'm also in process of preparing an 8.0.3 update for precise, which we believe fixes a number of lockups for Intel Graphics.

In any case, don't just stop when you find a version that works. Try other versions in between your working version and the broken one. If you can narrow the bracket down to two adjacent versions, that can be hugely helpful to the developers in isolating what patch caused the regression.

Contributing

As you go through the troubleshooting you might spot errors, or might come up with improvements for the scripts or docs. Contributions to any of these are warmly welcomed. With the wiki docs, please do just go ahead and edit! I try to update them at least once a year, but I don't always get around to it, and the next guy to visit the page will certainly appreciate your effort at improving them.

For changes to the scripts themselves, also quite welcomed. Send me changes however you feel comfortable - as patches, a bzr or git branch, or even just copies of the script. If you plan to do a lot of changes, a bzr branch with a merge proposal is the preferred way; tutorials on how to do this are available at code.launchpad.net, or feel free to catch me on IRC if you have questions.

Or, if you're not ready to dig into coding but would like to flag errors or areas where more functionality is needed, you can file bug reports the usual way (ubuntu-bug xdiagnose).

Quick Fixes

If you're not interested in doing any of the above debugging, here's some random tips:

For proprietary drivers, try uninstalling and purging them completely from your system, then reinstalling from scratch. This unfortunately "solves" a lot of bugs...

For the FOSS drivers, there are various kernel switches you can play around with. For 3D/mesa bugs, there is also driconf to tweak various settings.

Finally

Finally, one request... please don't file bug reports to Launchpad about "random freezes" until you've done at least a little sleuthing such as described above. Otherwise, you'd just be adding to the noise.

We do try to fish out well researched bug reports; we find these to give higher bang for the buck, and are a lot more likely to end up with an actual fix for the distro.