Geographically distributed, fault-tolerant and "intelligent" application/host monitoring systems

Solution 1:

not an answer really, but some pointers:

  • definitivly take a look at presentation about nagios @ goldman sachs. they faced problems you mention - redundancy, scalability: thousands of hosts, also automated configuration generation.

  • i had redundant nagios setup but at much smaller scale - 80 servers, ~1k services in total. one dedicated master server, one slave server pulling configuration from master at regular intervals few times a day. both servers covered monitoring of the same machines, they had health cross-check between each other. i used nagios mostly as framework for invoking custom product specific checks [ bunch of cron jobs executing scripts doing 'artificial flow controls', results ware logged to sql, nrpe plugins ware checking for successful / failed executions of those in last x minutes ]. all worked very nicely.

  • your quorum logic sounds good - a bit similar to my 'artificial flows' - basically go on, ipmplement your self ;-]. and have nrpe just check some kind of flag [ or sql db with timestamp-status ] how things are doing.

  • you'll probably want to build some hierarchy to scale - you'll have some nodes that gather overview of other nodes, do look at presentation from first point. default nagios forking for every single check is overkill at higher number of monitored services.

to answer some questions:

  • in my case environment monitored was typical master-slave setup [ primary sql or app server + hot standby ], no master-master.
  • my setup involved 'human filtering factor' - resolver group who was a 'backup' for sms notification. there was already paid group of technicians who for other reasons had 24/5 shifts, they got 'checking nagios mails' as additional task not putting too much load on them. and they ware in charge of making sure that db-admins / it-ops / app-admins ware actually getting up and fixing problems ;-]
  • i've heard lot's of good things about zabbix - for alerting and plotting trends, but never used it. for me munin does the trick, i have hacked simple nagios plugin checking if there is 'any red' [ critical ] color on munin list of servers - just an additional check. you can as well read values from munin rrd-files to decrease number of queries you send to monitored machine.

Solution 2:

What you are asking for sounds a lot like what Shinken has done for Nagios.

Shinken is a Nagios rewrite.

  • Modern language (Python)
  • Modern distributed programming framework (Pyro)
  • Monitoring Realms(multi-tenancy), HA, spares
  • Livestatus API
  • Nagios plugin compatible
  • Native NRPE execution
  • Business criticality of objects
  • Business rules can be applied to the state of objects (managing cluster or pool availability)
  • Graphing can use Graphite or RRDtool based PNP4nagios
  • Stable and being deployed in large environments
  • Big deployments can consider pairing it with Splunk for reporting or look into Graphite where RRDtool is not a good fit.

This should be food for thought.

Cheers