Black screen, nvidia-modeset : ERROR: GPU:0 Idling display engine time out
How do I troubleshoot/fix this issue?
Using a nvidia GPU, the system is able to show the motherboard firmware logo and Grub screen. After a Ubuntu version is selected, either 18.04 or 20.01, Ubuntu login screen can't be shown. Instead, I see a black screen appears complaining that nvidia-modeset : ERROR: GPU:0 Idling display engine time out
three times (see attached photo), followed by a pure black screen with the GPU revving up to full fan speed continuously and becoming very hot. I had to press the power button to shut down the system.
This GPU had worked well prior to this incident. The Ubuntu system is able to properly boot up when the GPU is removed and when the Intel CPU's integrated graphics is plugged in to the monitor. IGPU is disabled.
Installed nvidia packages:
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-470 470.57.02-0ubuntu0.18.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVIDIA libcompute package
ii libnvidia-decode-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVENC Video Encoding runtime library
ii libnvidia-extra-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-470:amd64 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii libnvidia-ifr1-470:i386 470.57.02-0ubuntu0.18.04.1 i386 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nvidia-compute-utils-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-dkms-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-driver-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-470 470.57.02-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 470.57.01-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-470 470.57.02-0ubuntu0.18.04.1 amd64 NVIDIA binary Xorg driver
Solution 1:
I had this GPU tested on a Windows system, which was able to display boot screen, login screen and desktop. However, the display artifacts persisted. Also, I suspect Windows was able to step down the resolution.
I came across this youtube video showing the same display artifacts and using NVidea MOD and MATS found the issue originated at one of the GPU VRAMs. Replacing the VRAM fixed the display issue.
As this GPU has been well maintained, I wondered if the GPU display fault was due to faulty interconnects. I came across this other youtube video that showed that reheating the GPU board with a heat gun for 6 to 8 mins had a 10% success rate of fixing the GPU card. He recommended this treatment as a last resort. I heated the GPU side of the card with a heat gun for around 4 mins. Thereafter, I flipped the card over and heated it for another 2 mins or so. After the GPU card cooled down, I tested it and found that its functionality is restored. The reheating procedure fixed the GPU card. Earlier, the GPU card was cleaned but was not heated treated; that procedure alone did not fix the GPU.