High load on a nagios server -- How many service checks for a nagios server is too many?

Solution 1:

You need to figure out where your bottleneck is...

I run a nagios monitor that checks 400+ hosts with http, ping and ssh checks. (along with a lot of other passive checks and nscd)

This is on a 2xQuadCore server with 4 SAS disks in RAID10.

I suspect you're having IO contention, as writing to lots of rrds is very inefficient.

You need to figure out which process is taking up your resources. (cacti, nagios or something else)

For IO checking, I like iotop. Install iotop (the 9.04 package works on 8.04)

But otherwise top should also help you find your load hog.

Cacti once a minute is pretty aggressive. (I run mine at 5m intervals)

One approach I've heard of for rrd write contention is to put your rrd stores on a ramdisk/tmpfs. (be sure to rsync that every now and then to persistent storage)

Good luck.

Solution 2:

Unless it's cacti generating most of the load then you should be able to run many more checks than that on your hardware.

I'm running nagios on a FreeBSD virtual machine running on Microsoft Virtual Server on a dog-slow old PC (Pentium 3 1GHz with a slow PATA disk). The virtual machine has only 128MB RAM, and performance is dire.

However the load average is about 0.2, running 158 checks across 42 hosts.

Solution 3:

On an old PIII with 256MB of RAM I'm actively monitoring about 230 different services. The same machine is also running MRTG and HylaFAX for all our incoming faxes and is doing so quite comfortably.

Solution 4:

You should be able to run a boatload of nagios checks with that hardware. We run a similar setup with about 70 checks and Nagiosgraph - the major difference is added RAM (it's cheap, so I'd bump up the box to 2Gb).

Try running top or ps -aux to see if the CPU is overloaded, but I doubt it. You may also want to check the nagios parallelization docs to see if your install is trying to run too many checks at once rather than serializing them.