Nagios/Icinga: Don't show CRITICAL for DRBD partitions on standby node

I would advise not monitoring this on the host directly. In our environment we utilize Pacemaker to automate failovers. One of the things Pacemaker does for us is moves an IP address upon failover. This ensures our clients are always pointing at the primary, and helps make failovers seem transparent from the client side.

For Nagios we monitor a slew of services on each host to keep an eye on things, but then we have an additional "host" configured for the virtual/floating IP address to monitor the DRBD devices and services that are only running on the primary.


In my environment, we manage multiple services running on top of drbd devices (traditional, lxc containers, docker containers, databases, ...). We use the opensvc stack (https://www.opensvc.com) which is free and opensource, and provides automatic failover features. Below is a test service with drbd, and a redis application (disabled in the example)

First at the cluster level, we can see in the svcmon output that :

  • 2 nodes opensvc cluster (node-1-1 and node-1-2)
  • service servdrbd is up (uppercase green O) on node-1-1, and standby (lowercase green o) on node-1-2
  • node-1-1 is the preferred master node for this service (circumflex accent close to uppercase O)

At the service level svcmgr -s servdrbd print status, we can see :

  • on the primary node (on the left) : we can see that all ressources are up (or standby up; meaning they must remain up when service is running on the other node). And concerning drbd device, it is reported as Primary
  • on the secondary node (on the right) : we can see that only standby ressources are up, and the drbd device is in Secondary state.

To simulate an issue, I disconnected the drbd device on the secondary node, and that produce the following warnings

It is important to see that the service availability status is still up, but the overall service status is degraded to warn, meaning "ok, production is still running fine, but something goes wrong, have a look"

As soon as you are aware that all opensvc commands can be used with the json output selector (nodemgr daemon status --format json or svcmgr -s servdrbd print status --format json), it is easy to plug it into a NRPE script, and just monitor the service states. And as you saw, any issue on primary or secondary is trapped.

The nodemgr daemon status is better because it is the same output on all cluster nodes, and all opensvc services informations are displayed in a single command call.

If you are interested in service configuration file for this setup, I posted it on pastebin here