How does an administrator generalize alerting when an event doesn't happen?

Often my users require me to be just as responsible for knowing if an event hasn't happened.

I've always had to build custom and brittle solutions with cron'ed shell scripts and lots of date edge case testing.

Centralized logging ought to allow for a better, more maintainable way to get a grip on what didn't happen in the last N hours. Something like logstash noticing and nagios alerting.

Update

toppledwagon's answer was so incredibly helpful . o O (Light. Bulb.) that I now have a dozen batch jobs being freshness checked. I wanted to do his thorough answer justice and follow up with how I've implemented his ideas.

I configured jenkins to emit syslogs, logstash catches them and sends status updates to nagios via nsca. I also use check_mk to keep everything DRY and organized in nagios.

Logstash filter

:::ruby
filter {
  if [type] == "syslog" {
    grok {
      match => [ "message", '%{SYSLOGBASE} job="%{DATA:job}"(?: repo="%{DATA:repo}")?$',
                 "message", "%{SYSLOGLINE}" ]
      break_on_match => true
    }
    date { match => [ "timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ] }
  }
}

The magic is in that double pair of patterns in grok's match parameter along with break_on_match => true. Logstash will try each pattern in turn until one of them matches.

Logstash output

We use the logstash nagios_nsca output plugin to let icinga know we saw the jenkins job in syslog.

:::ruby
output {
  if [type] == "syslog"
    and [program] == "jenkins"
    and [job] == "Install on Cluster"
    and "_grokparsefailure" not in [tags] {
      nagios_nsca {
        host => "icinga.example.com"
        port => 5667
        send_nsca_config => "/etc/send_nsca.cfg"
        message_format => "%{job} %{repo}"
        nagios_host => "jenkins"
        nagios_service => "deployed %{repo}"
        nagios_status => "2"
      }
   } # if type=syslog, program=jenkins, job="Install on Cluster"
} # output

icinga (nagios)

Finally, we have arrived at icinga (nagios) by way of nsca. Now we will need passive service checks defined for each and every job we want to notice didn't happen on time. That can be a lot of jobs so lets use check_mk to transform python lists of jobs into nagios object definitions.

check_mk is cool like that.

/etc/check_mk/conf.d/freshness.mk

# check_mk requires local variables be prefixed with '_'

_dailies = [ 'newyork' ]
_day_stale = 86400 * 1.5

_weeklies = [ 'atlanta', 'denver', ]
_week_stale = 86400 * 8

_monthlies = [ 'stlouis' ]
_month_stale = 86400 * 32

_service_opts = [
    ("active_checks_enabled", "0"),
    ("passive_checks_enabled", "1"),
    ("check_freshness", "1"),
    ("notification_period", "workhours"),
    ("contacts", "root"),
    ("check_period", "workhours"),
]

# Define a new command 'check-periodically' that sets the service to UKNOWN.
# This is called after _week_stale seconds have passed since the service last checked in.

extra_nagios_conf += """
  define command {
    command_name check-periodicaly
    command_line $USER1$/check_dummy 3 $ARG1$
  }

  """
# Loop through all passive checks and assign the new check-period command to them.

for _repo in _dailies + _weeklies + _monthlies:
    _service_name = 'deployed %s' % _repo
    legacy_checks += [(('check-periodicaly', _service_name, False), ['lead'])]


# Look before you leap - python needs the list defined before appending to it.
# We can't assume it already exists because it may be defined earlier.

if "freshness_threshold" not in extra_service_conf:
    extra_service_conf["freshness_threshold"] = []

# Some check_mk wizardry to set when the check has passed its expiration date.
# Results in (659200, ALL_HOSTS, [ 'atlanta', 'denver' ]) for weeklies, etc.

extra_service_conf["freshness_threshold"] += [
    (_day_stale,   ALL_HOSTS, ["deployed %s"   % _x for _x in _dailies]  ),
    (_week_stale,  ALL_HOSTS, ["deployed %s"  % _x for _x in _weeklies] ),
    (_month_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _monthlies] ),
]

# Now we assign all the other nagios directives listed in _service_opts

for _k,_v in _service_opts:
    if _k not in extra_service_conf:
        extra_service_conf[_k] =  []

    extra_service_conf[_k] += [(_v, ALL_HOSTS, ["deployed "]) ]

I setup passive checks in nagios for various events. Then at the end of the event the passive check is sent to nagios (either via wrapper script or built into the event itself.) If the passive check hasn't been received in freshness_threshold seconds, it will run check_command locally. check_command is setup as a simple shell script which returns critical and the information of the service description.

I don't have code examples handy, but if I could if interest is shown.

EDIT ONE added code examples:

This assumes that you have done the basic setup for NSCA and send_nsca (make sure password and encryption_method is the same in send_nsca.cfg on the client and nsca.cfg on the nagios server. Then start nsca daemon on the nagios server.)

First we define a template that other passive checks can use. This goes into services.cfg.

define service {
    name                    standard-passive-service-template
    active_checks_enabled   0
    passive_checks_enabled  1
    check_freshness         1
    max_check_attempts      1
    normal_check_interval   10
    retry_check_interval    5
    contact_groups          sysadmins
    notification_interval   0
    notification_options    w,u,c,r
    notification_period     24x7
    check_period            24x7
    check_command           check_failed!$SERVICEDESC$
    register                0
}

This says that if a notification hasn't come in, run check_failed with $SERVICEDESC$ as an argument. Let's define the check_failed command in commands.cfg.

define command {
    command_name     check_failed
    command_line     /usr/lib/nagios/plugins/check_failed $ARG1$
}

Here is the /usr/lib/nagios/plugins/check_failed script.

#!/bin/bash
/bin/echo "No update from $*. Is NSCA running?"
exit 2

Having an exit of 2 makes this service critical according to nagios (see below for all nagios service states.) Sourcing /usr/lib/nagios/plugins/utils.sh is another way, then you could exit $STATE_CRITICAL. But the above works even if you don't have that.

This gives the added notice of "Is NSCA running" because it might be the case that the service didn't check in properly OR it might be the case that NSCA has failed. This is more common than one might think. If multiple passive checks come in at once, check for a problem with NSCA.

Now we need a passive check to accept the results. In this example I have a specially crafted cron job that knows about all of the different types of raid controllers in our environment. When it runs it sends in a notification to this passive check. In this example, I don't want to be woken up in the middle of the night (edit notification_period as needed.)

define service {
    use                     standard-passive-service-template
    hostgroup_name          all
    service_description     raidcheck
    notification_period     daytime
    flap_detection_enabled  1
    freshness_threshold     7500 # 125 minutes
    notification_options    c
    is_volatile             0
    servicegroups           raidcheck
}

Now there's the cronjob that sends info back to the nagios server. Here's the line in /etc/cron.d/raidcheck

0 * * * *  root  /usr/local/bin/raidcheck --cron | /usr/sbin/send_nsca -H nagios -to 1000 >> /dev/null 2>&1

See man send_nsca for options, but the important parts are 'nagios' is the name of my nagios server and the string that is the printed at the end of this script. send_nsca expects a line on stdin of the form (perl here)

print "$hostname\t$check\t$state\t$status_info\n";

$hostname is obvious, $check in this case is 'raidcheck', $state is the nagios service state (0 = OK, 1 = warning, 2 = critical, 3 = unknown, 4 = dependent.) and $status_info is an optional message to send as the status info.

Now we can test the check on the command line of the client:

echo -e "$HOSTNAME\traidcheck\t2\tUh oh, raid degraded (just kidding..)" | /usr/sbin/send_nsca -H nagios

This gives us a nagios passive check that expects to be updated every freshness_threshold seconds. If the check isn't updated, check_command (check_failed in this case) is run. The example above is for a nagios 2.X install, but will likely work (maybe with minor modification) for nagios 3.X.