GPU server freezes during GPU idling
Solution 1:
It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this; nvidia-settings
should work fine from the framebuffer console or even in a script you write that runs at startup.
Check to be sure:
# nvidia-settings -q GpuPowerMizerMode
Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1.
Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2.
'GPUPowerMizerMode' can use the following target types: GPU.
For eight GPUs just write a simple script, something like:
for n in $(seq 0 7); do
nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1"
done
and run it at startup in whatever manner you find convenient.
I can't say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.