Better/alternate front-end/graphs for munin? [closed]
I have a munin setup running and I'd like to leave my munin-node setup untouched while getting a longer and more detailed view of the logged data. I want to keep all logged data indefinite. An ideal solution would use something like the Annotated Time Line widget so that I could zoom in to any point in the history.
Edit: I've already found out that munin uses a lossy database so I'm expecting I'll need something that replaces it; i.e. unless I'm mistaken, any answer that doesn't replace Munin is most likely not useful to me.
What I'm hoping for is a drop in replacement for munin that can read the appropriate sections of the munin config files (e.g. the addresses of all the munin-nodes) and won't require any modification at all to the munin-node installs
Munin, like every tool of its type that I'm aware of, uses round robin database, or RRD, files to store its data. Here is an explanation of the basics of RRD. An RRD file is made up of Round Robin Archives, or RRAs. An RRA is "lossy" in two senses of the word, it combines multiple data points into one and it overwrites data after a certain amount is collected. You get to specify how this is done. For example, lets say I created an RRD file with the command
rrdtool create example.rrd \
[skip some necessary options]
--step 300
RRA:LAST:0.5:1:288 \
RRA:AVERAGE:0.5:12:168 \
RRA:AVERAGE:0.5:288:28
The step of 300 says we are collecting metrics, which rrdtool refers to as primary data points or PDPs, every 5 minutes. Each RRA line specifies four things, CF:xff:steps:rows.
1) The CF, or consolidation function. This determines how RRD combines multipe primary data points into consolidated data points, or CDPs. It can AVERAGE all the values, use the MIN imum value, use the MAX imum value, or just use the LAST value.
2) The "x files factor", is what ratio of the data must be missing before the CF will return a value of UNKNOWN rather that operating on the non-missing data.
3) The steps, which is how many primary data points are used to calculate the consolidated data point.
4) The rows, which is how many consolidated data points to keep.
In our example, the first RRA would keep your primary data points for one day, the second would average your primary data points every hour and keep the daily averages for one week, and the third would average your primary data points every day and keep the daily averages for four weeks.
If you want Munin to retain longer and more detailed data, use RRD files that have RRAs with lower steps and higher rows. This is controlled by the graph_data_size option. Munin has a human-readable syntax to make this easy to configure. The options in our earlier example would translate to
graph_data_size custom 5m for 1d, 1h for 1w, 1d for 4w
If you want to keep your primary data points for two years, you can take a shortcut and set graph_data_size to huge.
After changing this option, you have to delete your existing RRD files so Munin will create new ones with your new retention settings
I recently evaluated a bunch of trending / alerting tools.
At least on their agent / collector model, there seem to be 2 different models, the "nagios / request model" and the "syslog / reporting" model.
So in the active model you've got
Nagios: mostly for alerts but with some graphing functionality grafted on.
Zabbix: trending / alerting combined. Stores data in a back end SQL database (so data isn't lost / rounded as with RRD databases).
Munin: trending / with plugins to send data to nagios (ie you collect the data with munin then run a nagios program that looks at the local data so you don't need both a munin and nagios agent on the remote system).
The "syslog" model uses either a multicast or unicast UDP model where the monitored system sends a UDP packet to the collector every interval of time. The traffic is unsolicited; the reporting system just sends it every interval regardless of if the monitoring system is up or not.
collectd and ganglia both follow this model. I've never used ganglia but collectd has a little plugin that can report up / warn / critical status to nagios (and it also reports if it hasn't seen data from the host in 3 intervals of time so you see if a system crashed because it doesn't phone home).
Collectd has dreadful graphing / reporting tools out of the box but it outputs either / both RRD and CSV text files (name, time_t, value) so you can roll your own dashboard pretty easily.
I didn't play with ganglia too much.