I need to replace munin with something more scalable [closed]

It sounds like you may have two problems

On your monitoring server, recording the metrics for lots of servers requires more random i/o than your storage can provide. Even if all your metrics are being written to disk, the server may be too overloaded to actually generate graphs from them.
On your clients being monitored, the plugins which collect the metrics are too CPU and memory intensive and don't finish gathering data in time when the clients are experiencing heavy load.

I've used Munin in the past, but I am currently using collectd. The authors of collectd have put a lot of thought and effort into solving these problem. They have a well-designed system for writing the data to RRD files that ensures you don't lose data and can generate up-to-date graphs. There's also support for RRDCacheD. The daemon and the official plugins are written in C, so they use little memory or CPU time. On my client systems it's using less than 2MB of RAM and about a quarter of a second of CPU time every minute. On my monitoring server it is using 20MB of RAM and two-thirds of a second of CPU time every minute. Keep in mind that all my metrics are being gathered and sent to my monitoring server every ten seconds, rather than at intervals of minutes like munin.

Although being great tools, Munin and other RRDTool frontends (such as Cacti or Ganglia) have known i/o issues and are dificcult to scale when you monitor hundreads of nodes.

There are some techniques to deal with this i/o bottleneck though. One of these thecniques is to spread writes across a large number of disks to reduce i/o in each disk. On the other hand, many sysadmins use tmpfs filesystems to deal with this problem. RRDCached is also a recent and good option to deal with this and I'd recommend you take a look at this slides.

I'm not that familiar with Munin, but Cacti has a Boost plugin. This plugin caches data in memory and performs mass and on-demand updates to disk, instead of individual writes, thus reducing i/o. I'm pretty sure that Munin has also something like this.

If you can afford them, SSD disks are also good options.

Last but not least, you can also take a look at Reconnoiter. Recconoiter is a brand new fault detection and graphing/trending tool. Unlike most trending tools, Reconnoiter is not RRDTool based and tries to solve this specific issue. I'm not using Reconnoiter in production, but I've made some tests, and despite still being a little "green", looks really promising, especially regarding its scalability.

Hope this helps!

Check out Zabbix. It is one of the best Open Source performance monitoring tools out there. It scales well and has been used in environments with thousands of computers.

I need to replace munin with something more scalable [closed]

Related

Recent Posts