Icinga - Very high check latency in distributed environment

I have a distributed Icinga setup set up as follows:

CENTRAL

Receives passive check results only

DISTRIBUTED A

227 hosts

835 services

DISTRIBUTED B

67 hosts

243 services

The CENTRAL server sits below 1 second average check latency at all times. DISTRIBUTED B currently sits at around ~10 seconds average check latency but even this is climbing as we add more checks.

DISTRIBUTED A has some serious check latency issues (up to 700 seconds at times, less just after a reload but it builds back up) that I can't seem to pin down. Here's a current icingastats output:

Icinga Stats 1.10.3
Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 02-11-2014
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/lib/icinga/status.dat
Status File Age:                        0d 0h 0m 3s
Status File Version:                    1.10.3

Program Running Time:                   1d 17h 30m 44s
Icinga PID:                             1160
Used/High/Total Command Buffers:        0 / 11 / 4096

Total Services:                         839
Services Checked:                       839
Services Scheduled:                     839
Services Actively Checked:              839
Services Passively Checked:             0
Total Service State Change:             0.000 / 6.250 / 0.007 %
Active Service Latency:                 644.742 / 776.293 / 729.813 sec
Active Service Execution Time:          0.010 / 20.163 / 0.720 sec
Active Service State Change:            0.000 / 6.250 / 0.007 %
Active Services Last 1/5/15/60 min:     18 / 274 / 717 / 839
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              835 / 2 / 1 / 1
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            227
Hosts Checked:                          227
Hosts Scheduled:                        227
Hosts Actively Checked:                 227
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.000 / 772.310 / 726.904 sec
Active Host Execution Time:             0.006 / 0.338 / 0.030 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        14 / 22 / 196 / 227
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  227 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     14 / 28 / 192
   Scheduled:                           14 / 26 / 188
   On-demand:                           0 / 2 / 4
   Parallel:                            14 / 27 / 190
   Serial:                              0 / 0 / 0
   Cached:                              0 / 1 / 2
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  31 / 276 / 702
   Scheduled:                           31 / 276 / 702
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

This doesn't seem to be an external check buffer issue as it's always 0. I've played around with reaper settings and have tried assorted combinations of max reaper check time (5,10,30) and reaper frequency (1,5,10) and nothing seems to bring the time down.

Checking status.dat, it's not as if some checks are driving the average up. All service checks and host checks are showing a latency around the average (700+ seconds). Check execution times across the board are low. A vast majority are >1 second. From there, there are 143 checks that take more than 1 second but less than 2 seconds. There are 50 checks that take 4+ seconds. 4 checks are above this point, taking 8, 10, 17 and 20 seconds respectively. These numbers don't seem to indicate an actual check time issue to me.

The server itself isn't struggling resource wise, both CPU and memory are all fine. Also worth noting is that CENTRAL and DISTRIBUTED A servers are on the same physical infrastructure, albeit different VMs.


I'm not sure this will fully address your issue, but here are some places where to look.

You appear to be using Icinga v1, which means you have an Icinga core that is fully sequential. It means it runs check after check. If your checks are taking a bit too much time, it will create a latency. Furthermore, if you have some action to perform after a check, this will also delay the next service check (like NSCA sending or whatever) to a point where it can indeed totally kill your performances. So, this is something that you won't measure directly because this is not a matter of machine load, but a matter of Icinga load.

One of the solution to release your Icinga instance load is to use extra tools. For distributing checks, you can use the mod gearman for instance. This is often used to make nagios/icinga setup scale. If you use NSCA, we developed a tool to make sending asynchronous to release Icinga from this burden.

I hope this will help.