Nagios plugin to take process snapshot when load is high

Solution 1:

You can do it with event handlers.

First, add an event handler for your Load average definition:

define service{
    use                     generic-service
    host_name               xx
    service_description     Load_Average
    check_command           check_nrpe!check_load
    event_handler           processes_snapshot!xx
    contact_groups          admin-sms
}

The processes_snapshot command is defined in commands.cfg:

define command{
    command_name    processes_snapshot
    command_line    $USER1$/eventhandlers/processes_snapshot.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

And second, write an event handler script (processes_snapshot.sh):

#!/bin/bash

case "$1" in
    OK)
        ;;
    WARNING)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
    UNKNOWN)
        ;;
    CRITICAL)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c processes_snapshot
        ;;
esac

exit 0

The command processes_snapshot is defined in nrpe.cfg on the xx host as belows:

command[processes_snapshot]=top -cSbn 1 | tail -n +8 | sort -rn -k11 | head > /tmp/proc_snap.txt

PS: I haven't tested this config.

Solution 2:

Here's what I did to get a process list snapshot directly in the notification emails, based on the idea by @quanta. It may contain paths specific to the way Nagios is installed on Debian/Ubuntu machines:

  1. Created a wrapper script /usr/local/sbin/check_load that calls the original and appends the process snapshot if the exit code is 1 (WARNING) or 2 (CRITICAL):

    #!/bin/sh
    /usr/lib/nagios/plugins/check_load "$@" || {
        rc=$?
        echo
        # http://nagios.sourceforge.net/docs/3_0/pluginapi.html
        # | separates long output from perfdata
        COLUMNS=1000 top -cSbn 1|sed -e 's/|/<BAR>/g' -e 's/ \+$//'
        exit $rc
    }
    

    This sets COLUMNS to a large number so the process names/command lines won't be truncated to 40 characters, run top in batch mode for one iteration (-bn 1), asks for full command lines (-c) and cumulative CPU times (-S) to be shown, then makes sure top's output isn't truncated at the first | character by replacing it with <BAR>.

    I find top's default sort order to be adequate -- attempting to re-sort by cumulative CPU time like was suggested in @quanta's answer puts system daemons like init or crond at the top, which doesn't help me figure out which CGI script was responsible for the CPU spike. Also this way I get to keep top's header.

    Don't forget to chmod +x /usr/local/sbin/check_load

  2. Edit /etc/nagios-plugins/config/load.cfg and replace the check_load entry

    command_line    /usr/lib/nagios/plugins/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
    

    with

    command_line    /usr/local/sbin/check_load --warning='$ARG1$,$ARG2$,$ARG3$' --critical='$ARG4$,$ARG5$,$ARG6$'
    
  3. Edit /etc/nagios3/commands.cfg and update the notify-service-by-email entry so it includes $LONGSERVICEOUTPUT$ in the generated emails. It's too long to paste here; basically find the Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail bit and change it to Info:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$\n" | /usr/bin/mail.

  4. Restart nagios: service nagios3 restart.

I haven't tried this with NRPE.

Solution 3:

I prefer:

command[processes_snapshot]=top -cSbn 1 | head -14 | tail -8