Nagios graphing solutions vs Munin/Cacti/Ganglia

I've got a nagios server setup for monitoring ~ 30 Windows servers. I want to add some trending charts. I've read that nagios graphing plugins are simple and many people use seperate, standalone charting/trending tools.

What are the restrictions of the nagios graphing plugins vs standalone products like ganglia/munin/cacti?

I'm interested in specific features and advantages that standalone packages offer and nagios graphing plugins don't.


Solution 1:

given that you already have a nagios installation, consider nagiosgraph or pnp4nagios.

nagiosgraph and pnp4nagios do a pretty nice job of plotting nagios performance data. nagiosgraph has a parameter-based approach to configuration, pnp4nagios has a template-based approach.

  • both automatically detect new hosts/services whenever the nagios configuration changes
  • both do graph zooming
  • both provide graphs when you mouseover specific hosts/services
  • both provide many ways to slice and dice your data
  • both detect and graph the critical and warning levels you have already defined in nagios
  • both can be embedded directly into the nagios frame for seamless, uncluttered navigation from current status to history and back

slicing and dicing the data are pretty important, imho. for example, you can view all services on a single host, or view all hosts with a specific service, or view arbitrary collections of graphs for arbitrary hosts and services.

installation is not trivial, but not difficult. a lot depends on how much you want to customize things. for example, nagiosgraph is 'install.pl' or 'rpm -i nagiosgraph.rpm' or 'dpkg -i nagiosgraph.deb'. pnp4nagios is './configure; make; make install'.

n2rrd can do some of these things as well, but it is not as polished and requires more work to configure.

rrdtool has quirks wrt data storage, and any system will have sampling issues. rrdtool does some data smoothing by default, but you can capture (and graph) maximums and/or minimums in addition to averages if necessary.

every rrdtool-based approach suffers from data/graph staleness since the schema in each rrd file is static and most systems use the rrd filename to identify the data. data are typically never lost when a hostname or service name changes; the rrd files still exist on disk. but some user interfaces provide ways to see 'stale' rrd files, others require manual housekeeping via command line. on many installations this is only an issue when initially configuring the system, but in dynamic environments (e.g. monitoring virtual machines whose lifetime is only a few months) it can become tedious.

one final note. there are actually two parts to trending: data collection and data display. if you go with a standalone graphing system rather than extending your existing nagios installation, then you might have to install additional components on your windows machines in order to collect the data.

Solution 2:

I concur with lynxman. NAGIOS is for immediate qualitative data (is X OK or not?); munin is for historical quantitative data (how full is X now, and how full has it been this year?). All my NAGIOS installations, some of which monitor several hundred services, are linked to munin systems to do the quantitative monitoring.

Note also that munin has specific hooks for feeding data into NAGIOS. It understands the concept of WARNING and CRITICAL thresholds, and where notification (and a view on the NAGIOS "big board") is required it's very very easy to have a single munin variable inform the state of a single NAGIOS service.

The usual workflow is that noone looks at the munin graphs until NAGIOS alerts that a threshold has been breached, but then the munin graphs become invaluable for finding out whether something has been slowly ramping up over time, or this is an out-of-the-blue increase, or we have a weekly up-and-down cycle which is slowly increasing in amplitude, or what.

As lynxman says, the UNIX way is "one task, one tool". Making a toolchain of munin and NAGIOS works very well for me to provide quantitative and qualitative monitoring as well as notifications. It also has the distinct advantage of keeping the interfaces clean: when you look at NAGIOS, you see a simple view of how well things are working right now, with no historical data cluttering up the view; when you look at munin, you see historical information pertinent to the issue ready for your analysis, without "host is down" or "sshd won't talk to me" errors cluttering the view.

Solution 3:

Nagios graphing plugins as you say are very restricted, they offer a very basic rrdtool interface and the UI design is a bit counter intuitive, it's basically a hack over nagios, tried to use that just for fun but it broke several times without warning.

Going for a standalone product (especially munin or ganglia) offers you a big range of services that nagios can't accomplish, as the unix mantra it's better to be good at just one thing than try to be good at many, nagios is amazing for monitoring and munin/ganglia/cacti are amazing at graphing.