Choosing a monitoring system for a dynamically scaling environment: Nagios v. Zabbix [closed]

When operating in the cloud and scaling boxes automatically, there are certain monitoring issues that one experiences. Sometimes we might be monitoring 10 boxes and sometimes 100. The machines will scale up and down based on a demand.

Right now, I think the best solution to this is to choose a monitoring solution that will instantiation of targets via calls to an API. But, is this really the best? I like the idea of dynamic discovery, but that is also a problem in the cloud seeing that the targets are not all in the same subnet.

What monitoring solutions allow for a scaling environment like this? Zabbix currently has a draft API but I have been unable to fund a similar API for Nagios. Is there a similar API for Nagios?

Anyone have any alternate suggestions besides Nagios and Zabbix?


Solution 1:

Farmville, which claims to be adding hundreds of servers a week, uses Puppet, Nagios, and Munin to handle their scalable monitoring system. They probably use the Puppet facts to populate Nagios config files or to setup NRPE. With that many servers a config management tool like Puppet is practically a requirement.

A couple examples found via searching "puppet nagios":

http://blog.gurski.org/index.php/2010/01/28/automatic-monitoring-with-puppet-and-nagios/

http://projects.puppetlabs.com/projects/puppet/wiki/Nagios_Patterns

https://github.com/DavidS/puppet-nagios

Solution 2:

Use Zabbix. Their upcoming 2.0 release has alot of new features for things like this. The current version 1.8 has auto-registration.

The New Features doc talks about this feature:

4.2.2 Auto registration for active agents

Completely new in Zabbix 1.8, it is possible to allow active Zabbix agent auto-registration, after which server can start monitoring them. This allows to add new hosts for monitoring without any manual server configuration for each individual host.

The feature might be very handy for automatic monitoring of new Cloud nodes. As soon as you have a new node in the Cloud Zabbix will automatically start collection of performance and availability data of the host.

Solution 3:

No suggestions, but your logic is sound: In dynamic environments like the one you describe when a host comes up it needs to register with anything that needs to know about its existence (e.g. the monitoring system), and when it gets shut down it needs to un-register with things that need to know it's going away.

The question I would ask is do you need to monitor your "workhorse" servers? If they're compute nodes or similar and you know their configuration is stable & will "just work" when they get spun up monitoring the cloud itself (how many instances are running) may be just as good as tracking the individual machines, assuming your cloud provider lets you access such statistics easily.

Solution 4:

If you set up nagios to load directories of configuration files using "cfg_dir" you can simply add or remove a cfg-file when a node is added or removed, and restart nagios. No real need for an API, it can be set up with a few small shell scripts and SSH with key files.

I have no experience with Zabbix but I can recommend Nagios since it is pretty easy to configure, run and customize.

Solution 5:

for zabbix api, there's a commandline tool zabcon (http://trac.red-tux.net/wiki/zbx_api/interactive). it's not fully functional yet, but it should support some basic host and item operations - maybe you can work from that.