What Warning and Critical values to use for check_load?
Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.
1 min load avg = load_core_1 + load_core_2 + ... + load_core_n
5 min load avg = load_core_1 + load_core_2 + ... + load_core_n
15 min load avg = load_core_1 + load_core_2 + ... + load_core_n
where 0 < avg load < infinity
.
So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.
check_load
now have
-r, --percpu
Divide the load averages by the number of CPUs (when possible)
which means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With -r
the warning and critical intervals becomes 0 <= load avg <= 1
. Ie. you don't have to modify your warning and critical values from server to server.
OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.
Though its an old post, replying now because I knew check_load threshold values are bigtime headache for the newbies.. ;)
A warning alert, if CPU is 70% for 5min, 60% for 10mins, 50% for 15mins. A critical alert, if CPU is 90% for 5min, 80% for 10mins, 70% for 15mins.
*command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7*
All my findings about CPU load:
Whats meant by "the load": Wikipedia says:
All Unix and Unix-like systems generate a metric of three "load average" numbers in the kernel. Users can easily query the current result from a Unix shell by running the uptime command:
$ uptime
14:34:03 up 10:43, 4 users, load average: 0.06, 0.11, 0.09
From the above output load average: 0.06, 0.11, 0.09
means (on a single-CPU system):
- during the last minute, the CPU was underloaded by 6%
- during the last 5 minutes, the CPU was underloaded 11%
- during the last 15 minutes, the CPU was underloaded 9%
.
$ uptime
14:34:03 up 10:43, 4 users, load average: 1.73, 0.50, 7.98
The above load average of 1.73 0.50 7.98
on a single-CPU system as:
- during the last minute, the CPU was overloaded by 73% (1 CPU with 1.73 runnable processes, so that 0.73 processes had to wait for a turn)
- during the last 5 minutes, the CPU was underloaded 50% (no processes had to wait for a turn)
- during the last 15 minutes, the CPU was overloaded 698% (1 CPU with 7.98 runnable processes, so that 6.98 processes had to wait for a turn)
Nagios threshold value calculation:
For Nagios CPU Load setup, which includes warning and critical:
y = c * p / 100
Where:
y = nagios value
c = number of cores
p = wanted load procent
for a 4 core system:
time 5 min 10 min 15 min
warning: 90% 70% 50%
critical: 100% 80% 60%
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4
For a single core system:
y = p / 100
Where:
y = nagios value
p = wanted load procent
time 5 min 10 min 15 min
warning: 70% 60% 50%
critical: 90% 80% 70%
command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7
A great white paper about CPU Load analysis by Dr. Gunther http://www.teamquest.com/pdfs/whitepaper/ldavg1.pdf In this online article Dr. Gunther digs down into the UNIX kernel to find out how load averages (the “LA Triplets”) are calculated and how appropriate they are as capacity planning metrics.
Unless the servers in question have an asynchronous workload where queue depth is the important service metric to manage then its honestly not even worth monitoring load average. Its just a distraction from the metrics that matter like service time (service time, and service time).
A good complement too Nagios is a tool like Munin or Cacti, they will graph the different kinds of workload your server is experiencing. Be it load_average, cpu usage, disk io or something else.
Using this information it is easier to set good threshold values in Nagios.
Do you know at what load average your system's performance is affected? We had servers at my last job that would consistently sit at 35-40 load average, but were still responsive. It's a measurement you have to do a bit of detective work to get accurate numbers for.
You might want to instead measure some other metrics on the system, like average connect time for SSH or http; this might be a better indicator of how much load your system is under.