What server monitoring tools will scale to 10K-100K nodes? [closed]

I've encountered many distributed system monitoring tools that scale to 1000s of nodes, however, there seem to be none that demonstrate or even claim to be able to handle 10 000s or even 100 000s of nodes. Theoretically this should be possible with a hierarchical clustered network architecture. Has anyone encountered a monitoring system that makes such a claim or a white paper/academic paper that discusses a theoretical implementation?


Assuming that this isn't made up http://users.nagios.org/directory/Yahoo,-Inc/details says that Yahoo uses it for 100,000 machines but has 2000 instances deployed. And I assume that DNX would suit for "management" of the instances.

Also just found Merlin which seems to be able to monitor/check 153000 hosts in ~6s rather than 1hr


I've worked with two tools in the past.

  • Zabbix is a free and open source software. It is claimed on their website that it has been tested with 10.000 nodes.
  • NetIQ Security Manager (or NetIQ Application Manager) is a closed and expensive software. It is very easy to scale up but you will need several servers to do so (database and collectors mainly).

How many hosts you can monitor with a single host will be massively dependant on the kind of checks you are running, how long each check takes, and if the queue can run concurrently.

I've seen Smokeping run against huge amounts of hosts. Same with nagios for simple setups. Guys in my sister company have nagios running against a few hundred machines doing 10-20 checks per hosts, another few hundred routers doing a series of snmp checks, and then some other "network" equipment doing a mix of snmp and custom script monitoring. All in all it is over 10k checks on the machine. Only time there are issues is when the snmp checks start lagging.

Also take a look at Zenoss. There is a few version and it does scale.