Monitoring system that scales to 1,000 hosts and 100,000 variables

Suppose I wanted to monitor 1,000 hosts. For each host, there is 100 or more variables I want to monitor: ping, disk IO/latency, RAM free/swap/etc, and so on. 100,000 data points every 5-10 minutes, stored for 5 years.

What system scales this large?

What if I had 10x the number of hosts? What would you select then?


You'll need to answer a few more questions before we can really give you a suggestion. For starters, are you wanting to store raw data for 5 years? Or is rolled-up data good enough? This matters more than you might think, and this feature alone may determine what your options are.

When you're talking about a 5 year time span, you're almost always talking about trending information that's going to rolled up and you'll lose precision over time. If you don't roll up the data, you're dealing with a monstrous volume of data and very few systems (both software and hardware) will be able to handle it.

Luckily, that's why RRDtool and Round Robin Databases (RRD) was invented. If you don't recognize it, that's okay. You may not know the name, but if you're looking at open source tools, you'll see practically everything built on top of it. Almost any open source program that is trending data over time and giving you pretty graphs is probably using RRDtool under the hood. RRDtool creates fixed sized databases that automatically roll up data and stores fixed precision to specified limits. For example, you might have it store 30 days worth of data at 5 minute precision, 90 days worth of data at 30 minute precision, 180 days worth of data at 1 hour precision, 365 days of data at 1 day precision, 3 years of data at 1 week precision, and 10 years of data at 1 month precision. It's all configurable, and ever time you add a new data point, it calculates the roll up data.

Now, once you figure out for sure what your data retention requirements are, you need to figure out how you're planning to monitor the systems. If there's a wide variety of devices, especially if there's a lot of network devices, SNMP is the standard. Also, there's a lot of devices that can't be monitored by anything other than SNMP, so at least some level of SNMP support is important (examples are UPS's, generators, printers, etc). If you have a lot of servers, you may want to go with an agent based system where you install a monitoring agent on each device to be monitored. This will often give you more detailed information, but significantly increases the management overhead required.

Next, you need to know what your projected growth is beyond "what handles X and what handles 10 times X". Even at the listed 1k hosts, 1k is a hugely different beast than 10k hosts. Lots of systems will handle 1k, but when you approach 10k, many times you'll need a distributed system to share the load. Also, you mention 100 variables per system that you want to monitor. . . are you sure about that? There's not all that many monitoring systems that support monitoring that many variables. That's a lot of information to be pulling from each device.

Finally, you need to consider much more than the monitoring system when you start approaching large scales. Pulling back 100 variable data bits form 1k (or 10k) devices with a 5 minute resolution is going to require some pretty serious bandwidth. Be prepared for that, or you could find that your monitoring system is negatively impacting your network. This is particularly important if you have your systems spread across multiple sites and you're crossing WAN links.

There are a few Open Source systems that stake a claim to being competitive in this large network monitoring scale, but not many. Nagios has been around for a long time and has been known to monitor 1k+ systems. Zenoss offers both an open source core product and a commercially supported product, and is attempting to challenge some of the "big hitters". Zabbix is fully open source with the company backing it offering support.

When it comes to the large companies with thousands of devices/systems that need monitoring, though, the biggest players are CA's Spectrum/eHealth/Unicenter, IBM's Tivoli suite, HP's OpenView. Each of these can handle huge scales, but also come with huge price tags.

Note: My Day job is the implementation and maintenance of network monitoring tools, where we monitor over 5k network devices and 8k servers. Finding tools that work well at these scales is hard.


Nagios seems this is the default answer to these type of questions but there are some installations on this scale using it.

On top of scaling well it's flexible and easy to customize.