How can I chose a max load threshold depending on the number of available cores?

I have a pipeline that runs some computationally intensive tasks on a Linux machine. The script that launches these checks the current load average and—if it is above a certain threshold— waits until the load falls below it. This is on an Ubuntu virtual machine (running on an Ubuntu host, if that's relevant) which can have a variable number of cores assigned to it. Both our development and production machines are VMs running on the same physical server and we manually allocate cores to each as needed.

I have noticed that even when the VM has as few as 20 cores, a load of ~60 isn't bringing the machine to its knees. My understanding of how Linux load average works was that anything above the number of CPUs is indicative of a problem, but apparently things are not quite as clearcut as all that.

I'm thinking of setting the threshold at something like $(grep -c processor /proc/cpuinfo) x N where N>=1. Is there any clever way of determining the value N should take so as to both maximise performance and minimize lag?

In other words, how can I know what maximum load average a machine can support before performance starts to degrade? I had naively expected that to be the number of CPUs (so, N=1) but that doesn't seem to hold up. Since the number of cores can vary, testing possible combinations is both complicated and time consuming and, since this is a machine used by various people, impractical.

So, how can I decide an acceptable max average load threshold as a function of the number of available cores?

Solution 1:

Load is a very often misunderstood value on Linux.

On Linux it is the measurement of all tasks in the running or uninterruptible sleep state.

Note this is tasks, not processes. Threads are included in this value.

Load is calculated by the kernel every five seconds and is a weighted average. That is the minute load is the average of 5/60, the five minute 5/300 and the fifteen 5/900.

Generally speaking, load as a pure number has little value without a point of reference and I consider the value often misrepresented.

Misconception 1: Load as a Ratio

In other words, how can I know what maximum load average a machine can support before performance starts to degrade?

This is most common falsehood people make of load in Linux. That it can be used to measure CPU performance against some fixed ratio. This is not what load gives you.

To elaborate - people have an easy time understanding CPU utilization. This is utility over time. You take work done, then divide it by work possible.

Work possible in this regards is a fixed known value normally represented as a percentage out of 100 - thats your fixed ratio.

Load however has no constraint. There is no fixed maximum which is why you are having this difficulty understanding what to measure against.

To clarify what load is sampling does have a unfixed maximum, which is the total number of tasks currently present in the system when the sample is taken (this has no real bearing on what CPU work is being done).

Load as its calculated has no fixed maximum given its thrown into a weighted average and no recording of the number of tasks is given when weighting is measured.

Because I like food, an analogy you could give is that utilization is how fast you can eat your plate and load is - on average - how many plates you have left to devour.

So, the difference between CPU utility and load is subtle but important. CPU utility is a measure of work being done and load is a measure of work that needs to be done.

Misconception 2: Load is an Instant Measurement

The second fallacy is that Load is a granular measurement. You can read a number and get a understanding of the systems state.

Load is not granular but represents the general long term condition of the system. Not only is it sampled every five seconds (so misses running tasks that occur within the 5 second window) but is measured as averages over 1, 5 and 15 minutes respectively.

You cant use it as an instant measure of capacity, but a general sense of a systems burden over a longer period.

The load can be 100 and then be 10 only 30 seconds later. Its a value you have to keep watching to work with.

What can Load tell you?

It can give you an idea of the systems working trend. It is being given more than it can cope or less?

If the load is less than the number of CPUs you have this (normally) indicates you have more CPU capacity than work.
If the load is greater or equal to the number of CPUs and is trending upwards, its an indication that the system has more work than what it can handle.
If the load is greater or equal to the number of CPUs and is trending downwards, its an indication that the system is getting through the work faster than you are giving it stuff to do.

Because of the uninterruptible sleep state, this does muddy the load value as a pure scheduling score of work - but gives you some indication of how much demand there is on the disk (its still work that needs to be done technically).

Load also offer clues to anomalies on a system. If you see the load at 50+ it suggests something is amiss.

Load additionally can cause people to be concerned without reason.

Commonly known, disk activity can inflate load.
The load can be artificially inflated if lots of processes are bound to one CPU which is being waited on.
Tasks with a very low priority (niceness) will often wait a long time inflating load by 1 for that particular process.

In Summary

I find load a very woolly value, precisely there are no absolutes with it. Its measurement you get on one system is often meaningless in reference against another.

Its probably one of the first things I'd see in top purely to check for an obvious anomaly. Basically I'm using it almost like a thermometer - like a general condition of a system only.

I find its sampling period way too long for most workloads I throw at my systems (which run in the order of seconds generally, not minutes). I suppose it makes sense for systems that execute long running intensive tasks, but I dont really do much of that.

The other thing I use it for is long term capacity management. Its a nice thing to graph over long periods of time (months) as you can use it to understand how much more work you are handling compared to a few months ago.

Finally, to answer your question about what to do in your scenario. Quite honestly, the best suggestion I would offer is rather than consider using load as a factor as to when to run - use nice to execute your process giving other processes priority over it. This is good for a few reasons.

You only give a small amount of CPU time to this process when other processes are busy.
If there is nothing on the CPU or a CPU is idle your task spends 100% of the time on it.
All the processes in the process group inherit the same niceness.

With a niceness of 0 (the default) each process gets a weight of 1024. The lower the weight, the less time on the CPU is offered to the process. Here is a table of this behaviour.

Nice  Weight
0     1024
1     820
2     655
3     526
4     423
5     335
6     272
7     215
8     172
9     137
10    110
11    87
12    70
13    56
14    45
15    36
16    29
17    23
18    18
19    15

So to compare, in a scenario where you have 2 processes waiting to run - if you renice a process +10 it gets approximately 1/10th of the CPU time a priority 0 process has. If you renice it +19 it would get 1/100th of the CPU time a priority 0 process has.

It should be noted you'll probably see your load at 1 at least during the duration of your pipeline.

I imagine this would be a more elegant solution to your problem.

Solution 2:

From Wikipedia:

However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system

In other words, the load average reported by Linux includes any processes waiting for I/O (eg: disk or network). This means that if your application is somewhat I/O intensive you will have a high load average (ie: many processes are waiting for I/O) with low CPU utilization (they sleep while waiting for I/O).

This, in turn, will led to a system that is responsive even with an overloaded load average.