High CPU utilization but low load average

On Linux at least, the load average and CPU utilization are actually two different things. Load average is a measurement of how many tasks are waiting in a kernel run queue (not just CPU time but also disk activity) over a period of time. CPU utilization is a measure of how busy the CPU is right now. The most load that a single CPU thread pegged at 100% for one minute can "contribute" to the 1 minute load average is 1. A 4 core CPU with hyperthreading (8 virtual cores) all at 100% for 1 minute would contribute 8 to the 1 minute load average.

Often times these two numbers have patterns that correlate to each other, but you can't think of them as the same. You can have a high load with nearly 0% CPU utilization (such as when you have a lot of IO data stuck in a wait state) and you can have a load of 1 and 100% CPU, when you have a single threaded process running full tilt. Also for short periods of time you can see the CPU at close to 100% but the load is still below 1 because the average metrics haven't "caught up" yet.

I've seen a server have a load of over 15,000 (yes really that's not a typo) and a CPU % of close to 0%. It happened because a Samba share was having issues and lots and lots of clients started getting stuck in an IO wait state. Chances are if you are seeing a regular high load number with no corresponding CPU activity, you are having a storage problem of some kind. On virtual machines this can also mean that there are other VMs heavily competing for storage resources on the same VM host.

High load is also not necessarily a bad thing, most of the time it just means the system is being utilized to it's fullest capacity or maybe is beyond it's capability to keep up (if the load number is higher than the number of processor cores). At a place I used to be a sysadmin, they had someone who watched the load average on their primary system closer than Nagios did. When the load was high, they would call me 24/7 faster than you could say SMTP. Most of the time nothing was actually wrong, but they associated the load number with something being wrong and watched it like a hawk. After checking, my response was usually that the system was just doing it's job. Of course this was the same place where the load got up over 15000 (not the same server though) so sometimes it does mean something is wrong. You have to consider the purpose of your system. If it's a workhorse, then expect the load to be naturally high.


Load is a very deceptive number. Take it with a grain of salt.

If you spawn many tasks in very quick succession which complete very quickly, the number of processes in the run queue is too small to register the load for them (the kernel counts load once every five seconds).

Consider this example, on my host which has 8 logical cores, this python script will register a large CPU usage in top (about 85%), yet hardly any load.

import os, sys

while True:
  for j in range(8):
    parent = os.fork()
    if not parent:
      n = 0
      for i in range(10000):
        n += 1
      sys.exit(0)
  for j in range(8):
    os.wait()

Another implementation, this one avoids wait in groups of 8 (which would skew the test). Here the parent always attempts to keep the number of children at the number of active CPUs such it will be much busier than the first method and hopefully more accurate.

/* Compile with flags -O0 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <err.h>
#include <errno.h>

#include <sys/signal.h>
#include <sys/types.h>
#include <sys/wait.h>

#define ITERATIONS 50000

int maxchild = 0;
volatile int numspawned = 0;

void childhandle(
    int signal)
{
  int stat;
  /* Handle all exited children, until none are left to handle */
  while (waitpid(-1, &stat, WNOHANG) > 0) {
    numspawned--;
  }
}

/* Stupid task for our children to do */
void do_task(
    void)
{
  int i,j;
  for (i=0; i < ITERATIONS; i++)
    j++;
  exit(0);
}

int main() {
  pid_t pid;

  struct sigaction act;
  sigset_t sigs, old;

  maxchild = sysconf(_SC_NPROCESSORS_ONLN);

  /* Setup child handler */
  memset(&act, 0, sizeof(act));
  act.sa_handler = childhandle;
  if (sigaction(SIGCHLD, &act, NULL) < 0)
    err(EXIT_FAILURE, "sigaction");

  /* Defer the sigchild signal */
  sigemptyset(&sigs);
  sigaddset(&sigs, SIGCHLD);
  if (sigprocmask(SIG_BLOCK, &sigs, &old) < 0)
    err(EXIT_FAILURE, "sigprocmask");

  /* Create processes, where our maxchild value is not met */
  while (1) {
    while (numspawned < maxchild) {
      pid = fork();
      if (pid < 0)
        err(EXIT_FAILURE, "fork");

      else if (pid == 0) /* child process */
        do_task();
      else               /* parent */
        numspawned++;
    }
    /* Atomically unblocks signal, handler then picks it up, reblocks on finish */
    if (sigsuspend(&old) < 0 && errno != EINTR)
      err(EXIT_FAILURE, "sigsuspend");
  }
}

The reason for this behaviour is the algorithm spends more time creating child processes than it does running the actual task (counting to 10000). Tasks not yet created cannot count towards the 'runnable' state, yet will take up %sys on CPU time as they are spawned.

So, the answer could really be in your case that whatever work is being done spawns large numbers of tasks in quick succession (threads, or processes).