How do you monitor a monitoring server?

So we run Groundworks (with Nagios) on CentOS to monitor our various servers and processes. I have it setup to automatically send emails and SMS texts when things reach a WARNING or CRITICAL state. Normally this works perfectly. However, twice we've had problems with Postfix on that server where Postfix decides to stop sending email. The most recent time lasted 4 days because none of us noticed.

That leads me to a important question: how am I supposed to monitor my monitoring server?


Solution 1:

With a second monitoring server, of course. The second one can be much simpler, since all it needs to do is monitor the first. And it should be monitored by the main monitoring system in turn, of course.

If your group is part of a larger organization with separate IT infrastructures, you may be able to make arrangements for another group's monitoring service to watch yours.

You could also make sure the server sends an "it's okay" message every day, and get in the habit of looking for it. (That's only effective if you're not already overwhelmed with routine messages, of course.)

Solution 2:

Obviously your postfix should be monitored too, but thats another topic ;)

I use Nagios checker plugin for Firefox, it is always running in a status bar on any computer I use regularly.

In addition I have a custom script on the outside host that pings the nagios host and sends SMS if its not responding to pings.

So far (5+ years) it worked ok (knock on wood).

Solution 3:

For monitoring server monitoring (nagios in our case), the free or basic plan of Pingdom or alertfox works great.

Solution 4:

First thing: Let it send "I am alive" messages once or twice a day. Second, I run an old machine just for this purpose, which has another GSM modem, a small UPS etc. and a dedicated (direct) connection to the primary monitoring server. This one helps with point three as well: Make sure you check the status of your monitoring systems regularly. The small auxiliary monitoring system displays the status page of the primary system in my office all the time.