How can I make Status Information for Nagios services easier to read?
I'm running Nagios in an environment with several servers, each with several services on them. There are a few custom checks, but it's nice to use existing checks if possible. I'm using NRPE plugin check check_disk to check each mounted file system for utilization:
command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p / -p /var -C -u GB -w 200 -c 100 -r '^/mounts[^/]+$'
It's handy to have these all checked as a single service ("Disks"), but when one of these goes to warning mode, it's hard to read the output in the Status Information line:
DISK WARNING - free space: / 6 GB (9% inode=92%): /var 125 GB (67% inode=99%): /mounts/vol0 1152 GB (16% inode=99%): /mounts/vol1 1096 GB (15% inode=99%): /mounts/vol2 126 GB (1% inode=99%): /mounts/vol3 228 GB (3% inode=99%): /mounts/vol4 3245 GB (44% inode=99%): /mounts/vol5 108 GB (1% inode=99%):
In the above case, the check is warning because /, /mounts/vol2, and /mounts/vol5 are below threshold. An operator has to wade through each value to find the value exceeding set levels. Also, if one in critical and the others are warning, it would be nice to show them differently, either by marking them, or by putting them on different lines.
Is there a straightforward way to do this, without creating a new command for every mount point? Or am I missing some other fundamental method of Nagios magic to make this friendly?
Try the --errors-only flag which should greatly reduce the amount of text spit out by this plugin.
-e, --errors-only Display only devices/mountpoints with errors
This seems to do the trick for me. Note the drastic difference in the output:
# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10%
DISK WARNING - free space: / 37167 MB (96% inode=98%); /dev/shm 244 MB (100% inode=99%); /boot 84 MB (18% inode=99%); /home 21253 MB (99% inode=99%);
But with the --errors-only
flag, it's now clear that my problem is with /boot
:
# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK WARNING - free space: /boot 94 MB (20% inode=99%);
If there are no problems on the system, the output is very short:
# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK OK
(Note: I have removed everything after the first | for clarity. The Nagios web interface also trims this output before it is displayed on the screen.)
Also see this discussion on the Debian bugtacker: nagios2: complains about disk space in an uncomprehensible way.
The standard way is to have everything on one line. You only have two options:
define a check for each disk (I know is not what you want but I still find this the best solution)
write your own plugin or a wrapper around check_disks which parses the output: you can then for example put the disks below the threshold in the status lines or shorten the output to include only the relevant disks.
You can write the wrapper in any language but given the task I would suggest a scripting language (e.g., Perl). There are guidelines on how to develop plugins: http://nagiosplug.sourceforge.net/developer-guidelines.html
As @Matteo mentioned, I think also that you should define a check for each partition. But here's an example of wrapper to sort disk usage in descending order:
check_disk -w 20% -c 10% -p /dev/sda1 -p /dev/sdb2 -p /dev/sdb4 |
awk -F"|" '{ print $1 }' | awk -F": " '{ print $2 }' | \
tr ";" "\n" | sed 's/^ //' | sort -k4,4n
PS: My check_disk
plugin returns a list separated by ;
instead of :
as you showed.
You might consider check_multi, it combines the ability to show a single status line, with the ability to look at more details by actually having each disk checked independently. You can see from some of the screenshots how it'd work for you. In the example of disk checks, you'd have one check_multi check which displays "1 warning, 2 OK", when you click on that service, you'd see 3 separate checks, showing which disk is in warning with details about that disk in particular, while still showing the other 2 clearly as well.