Nagios plugin to take process snapshot when load is high
Solution 1:
You can do it with event handlers.
First, add an event handler for your Load average definition:
define service{
use generic-service
host_name xx
service_description Load_Average
check_command check_nrpe!check_load
event_handler processes_snapshot!xx
contact_groups admin-sms
}
The processes_snapshot
command is defined in commands.cfg
:
define command{
command_name processes_snapshot
command_line $USER1$/eventhandlers/processes_snapshot.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}
And second, write an event handler script (processes_snapshot.sh
):
#!/bin/bash
case "$1" in
OK)
;;
WARNING)
/usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
;;
UNKNOWN)
;;
CRITICAL)
/usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
;;
esac
exit 0
The command processes_snapshot
is defined in nrpe.cfg
on the xx
host as belows:
command[processes_snapshot]=top -cSbn 1 | tail -n +8 | sort -rn -k11 | head > /tmp/proc_snap.txt
PS: I haven't tested this config.
Solution 2:
Here's what I did to get a process list snapshot directly in the notification emails, based on the idea by @quanta. It may contain paths specific to the way Nagios is installed on Debian/Ubuntu machines:
-
Created a wrapper script
/usr/local/sbin/check_load
that calls the original and appends the process snapshot if the exit code is 1 (WARNING) or 2 (CRITICAL):#!/bin/sh /usr/lib/nagios/plugins/check_load "$@" || { rc=$? echo # http://nagios.sourceforge.net/docs/3_0/pluginapi.html # | separates long output from perfdata COLUMNS=1000 top -cSbn 1|sed -e 's/|/<BAR>/g' -e 's/ \+$//' exit $rc }
This sets COLUMNS to a large number so the process names/command lines won't be truncated to 40 characters, run top in batch mode for one iteration (
-bn 1
), asks for full command lines (-c
) and cumulative CPU times (-S
) to be shown, then makes sure top's output isn't truncated at the first|
character by replacing it with<BAR>
.I find top's default sort order to be adequate -- attempting to re-sort by cumulative CPU time like was suggested in @quanta's answer puts system daemons like init or crond at the top, which doesn't help me figure out which CGI script was responsible for the CPU spike. Also this way I get to keep top's header.
Don't forget to
chmod +x /usr/local/sbin/check_load
-
Edit
/etc/nagios-plugins/config/load.cfg
and replace the check_load entrycommand_line /usr/lib/nagios/plugins/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
with
command_line /usr/local/sbin/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
Edit
/etc/nagios3/commands.cfg
and update the notify-service-by-email entry so it includes $LONGSERVICEOUTPUT$ in the generated emails. It's too long to paste here; basically find theInfo:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail
bit and change it toInfo:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$\n" | /usr/bin/mail
.Restart nagios:
service nagios3 restart
.
I haven't tried this with NRPE.
Solution 3:
I prefer:
command[processes_snapshot]=top -cSbn 1 | head -14 | tail -8