Relation between GPU utilization and graphic card's power consumption

I wonder what the relation between GPU utilization and graphic card's power consumption is.

E.g. in the screenshot below, GPU 2's utilization is 92%, while the power usage is 129 watts out of 250. Why isn't the power usage around 250 * 0.92 = 230 watts?

enter image description here


The load factor shows how much more of the same computation could be done, not how much of the chip's total processing capability is being used for that computation.

For example, your 92% shows that on average, the GPU did something during 920,000 out of every 1 million clock cycles. It doesn't mean that 92% of every single circuit of every single shader processor was active, let alone 92% of every single circuit on the whole board (VRAM controller, DAC, shaders and raster units and branch predictors and texture lookup units and so on).

If your usage only takes advantage of a few GPU features, you might well run at 100% of the throughput of those features, while leaving half the chip asleep. But the half that's asleep couldn't be used for that type of work at all.


Usually, it means that your CUDA program is suboptimal. I'm now optimizing my CUDA program. I wrote several iterations of it, improving the performance in each iteration. So surprisingly, in each iteration, it was reporting 100% of GPU Load. But power consumption was different in each iteration. In the latest iteration, with the increase of power consumption from 40% to 70%, my program has been improved 7 times (!!!) in terms of the wall time it takes to compute what I need.

GPU mostly stalls on memory operations. I optimized for better caching (i.e. less global memory hits), and I got the following changes of sensors:

  • Gpu load: stays at 100%
  • Memory controller load: increased from 20% to 25%
  • Power consumption: increased from 40% to 70%
  • Wall time to perform the computation: decreased 7 times

Unfortunately, the source code is proprietary, so I can't give it to you to try yourself. However, you can get some idea of what the bottleneck in my program does: it is a loop with one memory read from an array (ith item), an addition and a multiplication, and an assignment of float.