Nagios server best practices?
I run a medium-sized Nagios server. It monitors roughly 40 servers with 180 services currently and is only growing by the day.
I migrated from an old Nagios setup that was configured in a very esoteric fashion, forcing me to reconfigure everything from scratch.
Now that the server is running and works for most of what we need it for, I'm looking into making it a bit more scalable; currently each hosts is its own file in /etc/nagios/hosts/
, and each host has all of its services in the same file. This is obviously not optimal, but neither is obfuscating all of my configuration into hundreds of different files.
So my question is this: to any experienced Nagios admins out there, what is the best way to make use of hostgroups/servicegroups without over-complicating the configuration?
Hostgroups and templates.
Templates let you define classes for your hosts and services, e.g. "normal service", "critical service", "low-priority host". They also serve as a useful way to divide responsibilities if you've got multiple teams with different responsibilities, so you can have a "linux host" template and a "windows host" template, with each one defining the appropriate contact info.
You can use multiple templates on a single resource, so you can compose appropriately-orthogonal templates. For example, you can have
host foo {
use windows-host,normal-priority-host
...
}
which would pull in the contact info (and escalations) for the Windows team and the polling rates and thresholds for a "normal" host.
Hostgroups let you group together all of the checks for a subset of your hosts. Have things like "baseline-linux-hosts" that check load, disk space, ssh
ability, and whatever other things should be on every host you monitor. Add groups like "https-servers" with checks for HTTP connectivity, HTTPS connectivity, and SSL certificate expiration dates; "fileservers" with checks for NFS and SMB accessibility and maybe more aggressive disk checks; or "virtual-machines" with checks for whether the VM accessibility tools are running properly.
Put each host and hostgroup in its own file. That file should contain the host or hostgroup definition first, followed by the definitions of the services that apply to it.
If you use the cfg_dir
directive in your nagios.cfg
file, Nagios will search recursively through that directory. Make use of that. For a setting of cfg_dir=/etc/nagios/conf.d
, you can have a directory tree like the following:
- /etc/nagios/conf.d/
- commands.d/
- http.cfg
- nrpe.cfg
- smtp.cfg
- ssh.cfg
- hosts.d/
- host1.cfg
- host2.cfg
- host3.cfg
- hostgroups.d/
- hostgroup1.cfg
- hostgroup2.cfg
- commands.d/
I tend to make a directory for each resource type (commands, contactgroups, contacts, escalations, hostgroups, hosts, servicegroups, timeperiods) except for services, which get grouped in with the hosts or hostgroups that use them.
The precise structure can vary according to your organizational needs. At a past job, I used subdirectories under hosts.d
for each different site. At my current job, most of the Nagios host definitions are managed by Puppet, so there's one directory for Puppet-managed hosts and a separate one for hand-managed hosts.
Note that the above also breaks out commands into multiple files, generally by protocol. Thus, the nrpe.cfg
file would have the commands check_nrpe
and check_nrpe_1arg
, while http.cfg
could have check_http
, check_http_port
, check_https
, check_https_port
, and check_https_cert
.1
I don't typically have a tremendous number of templates, so I usually just have a hosts.d/templates.cfg
file and a services.d/templates.cfg
file. If you use them more heavily, they can go into appropriately-named files in a templates.d
directory.
1 I like to also have a check_http_blindly
command, which is basically check_http -H $HOSTADDRESS$ -I $HOSTADDRESS$ -e HTTP/1.
; it returns OK even if it gets a 403 response code.
Make extensive use of service and hostgroups, and templating. Create hostgroups, and assign services to the hostgroups. Use servicegroups for dependencies, escalations, and logical grouping in the web UI.
If you have groups for everything, adding a new host is just 3 or 4 lines: name, address, template(s), and (optionally) hostgroups. Everything can be templated.
Be sure to read the docs on inheritance, and also the time-saving tricks page. Multiple inheritance can get tricky, but when used correctly it's a huge time-saver.