Is there a reasonable/safe level of CPU load in the context of high performance computing?

I understand the meaning of load average for servers in general, but do not know what to expect for servers built and used for high performance computing.

Is the usual convention of load <= # of cores applicable in this environment?


I'm curious given my system-specific details, where typically load >> # of cores for each node:

  • 24 physical cores, hyperthreading for 48 virtual cores (relatively new hardware)
  • load averages: typically 100-300

The nodes have high uptime with usually high CPU usage/load. There are very few hardware failures, especially for CPUs, but I don't know what to expect over the lifetime of the nodes given high load.

Example top output:

top - 14:12:53 up 4 days,  5:45,  1 user,  load average: 313.33, 418.36, 522.87
Tasks: 501 total,   5 running, 496 sleeping,   0 stopped,   0 zombie
%Cpu(s): 33.5 us, 50.9 sy,  0.0 ni, 15.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 19650371+total, 46456320 free, 43582952 used, 10646443+buff/cache
KiB Swap: 13421772+total, 78065520 free, 56152200 used. 15164291+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                       
 85642 user  20   0   36.5g   7.6g 245376 S  1566  4.0   1063:21 python                                                                                                                                        
 97440 user  20   0   33.1g   5.3g  47460 S  1105  2.8 512:10.86 python                                                                                                                                        
 97297 user  20   0   31.0g   4.0g  69828 S 986.4  2.1 430:16.32 python                                                                                                                                        
181854 user  20   0   19.3g   5.0g  19944 R 100.0  2.7   2823:09 python 
...

Output of iostat -x 5 3 on the same server:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          50.48    0.00   12.06    0.38    0.00   37.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda             350.41   705.68   58.12   22.24  2126.25  3393.61   137.36     6.02   74.93    9.10  246.94   1.19   9.56
dm-0              0.00     0.00    4.87    8.70   511.41   516.65   151.59     0.31   22.55   28.40   19.28   2.62   3.56
dm-1              0.00     0.00  403.67  719.23  1614.71  2876.92     8.00     8.83    7.10    7.38    6.95   0.08   9.05
dm-2              0.00     0.00    0.00    0.00     0.02     0.01    65.03     0.00    3.74    3.82    1.00   2.12   0.00


Solution 1:

Load average shows the queue of threads ready to run. In Linux, this includes also threads waiting for disk. It could happen that a broken NFS server could increase the load average to insane numbers. This doesn't mean that the CPUs are hogged.

So the load average is showing just one side of the story and can not be taken alone, that's why I was asking for top output.

Some workloads are note parallelizable. This means that all steps will run on the same core one after another. Real problems are usually partially parallelizable.

In performance you have some goals and limitations. Like low latency, throughput, cost (initial cost and operational ones)...

If you are interested in throughput and low cost, having a high queue, could be normal. All your CPU cores will be at 100% usage all the time.

Solution 2:

Load average is merely a symptom, a useful metric easy for the operating system to report. A doctor can't diagnose what's wrong a human patient with just a fever symptom, they have a lot more questions about what's going on. Similarly with computer patients, need a lot more context as to how they are performing.

Load average can vary quite a bit between systems. Some platforms do not report tasks likely doing I/O in load average, which is different than how Linux does it. Some hosts can have load average per core be in the dozens and not fall over. Some applications are very latency sensitive, and load per core greater than one is visible in poor user response time.

In addition to OS level metrics, collect application specific performance benchmarks, and trend them over time. Generic examples:

  • How many operations per CPU core is a HPC system doing?
  • Is user request response time acceptable?
  • How many queries per second on a database?
  • Is the system keeping up with typical batch processing?

Getting some measure of the useful work a system does is necessary to put the OS metrics into context. Your system appears to be doing useful work even at a relatively high load average. Contrast to a fork bomb, which drives load to unusable high levels, but as a denial of service attack does nothing useful.