What am I looking for in a Monitoring Solution?

This is a Canonical Question about Monitoring Software.

Also Related: What tool do you use to monitor your servers?

I need to monitor my servers; what do I need to consider when deciding on a monitoring solution?

There are a lot of monitoring solutions out there. Everyone has their preference and each business has its own needs, so there is no correct answer. However, I can help you figure out what you might want to look for in choosing a monitoring solution.

What are Monitoring Systems For?

In general monitoring systems serve two primary purposes. The first is to collect and store data over time. For example, you might want to collect CPU utilization and graph it over time. The second purpose is to alert when things are either not responding or are not within certain thresholds. For example, you might want alerts if a certain server can't be reached by pings or if CPU utilization is above a certain percentage. There are also log monitoring systems such as Splunk but I am treating those as separate for this.

These two primary roles sometimes come in a single product, other times and more common is to have a product dedicated to each purpose.

What are the primary Components and Features in Monitoring Systems?

Pollers:
All monitoring systems need some sort of poller to collect the data. Not all data is collected in the same way. You should look at your environment and decide what data you need and how it might be collected. Then make sure the monitoring system you choose supports what you need. Some common methods include:

SNMP (Simple Network Management Protocol)
WMI (Windows Management Instrumentation)
Running Scripts (For example, running a script on the machine that is being monitored or running a script from the monitoring box itself which uses its own polling method). These can include things like Bash Scripts, Perl Scripts, executable, and Powershell Scripts
Agent Based Monitoring. With these a process runs on each client and collects that data. This data is either pushed to the monitoring server or the monitoring server polls the agent. Some admins are okay with Agents, others don't like them as it can leave a larger footprint on the server being monitored.
Focused APIs (i.e. VMWare API or the ability to run SQL queries)

If you have mostly one OS in your environment or a primary OS, certain systems might have more options that others.

Configuration:
In monitoring systems there tends to be a lot of object reuse. For example, you want to monitor a certain application such as Apache or IIS on a bunch of servers. Or you want certain thresholds to apply to groups of servers. You might also have certain groups of people to be "on call". Therefore a good templating system is vital to a monitor system.

The configuration is generally done through a user interface or text files. The user interface option will generally be easier, but text files tend to be better for reuse and variables. So depending on your IT staff you might prefer simplicity over power.

User Interface:
The most common interface for monitoring systems these days is a web interface. Some things to evaluate in regards to the web interface are:

Good overviews
Good detail pages
Speed (When you need to find information in crisis mode a slow interface can be very frustrating
General feeling. You will spend a lot of time in the interface, if it feels clunky your IT staff will feel resistant to using it
Customization. Every organization has certain things that are important, and other things that are not. It is important to be able to customize it to your needs

Alerting Engine:
The alerting engine has to be flexible and reliable. There are lots of different ways to be notified including:

SMS
Email
Phone
Other things like IM/Jabber

Other features to look for are:

Escalations (Notify someone if the other person has not acknowledged or fixed the alert)
Rotations and Shifts
Groups (Certain groups need to be notified of certain things)

It is important to trust that when something goes wrong you will get the alert. This comes down to two things:

A reliable system
A caveat free configuration. In monitoring systems it is not uncommon to think you should get an alert, but because of some detail in configuration the alert was never triggered.

Data Store:
If the system collects and stores data (i.e. systems that include graphs) than the system stores data. A very common implementation for both the store and graphing is RRD for example.

Some features to look for from the data store are:

Raw access to the data. This can be valuable for developing against or creating custom graphs with something like Excel.
Scalability. Depending on how much you data you collect it can add up fast, if you are going to collect a lot you want to make sure it will scale.

Graphing Library:
Graphs can be useful to quickly identify trends and give context to the current state of something based on its history. Some including trending which can be helpful to predict things before they happen (i.e. running out of disk space). Make sure that the graphs will give you the information you think you are going to need in a clear way.

Access Controls:
If you have a large organization you might need access controls because certain admins should only be able to adjust certain things. You might also want public facing dashboards. If this is important you should make sure the monitoring system has the controls you need.

Other Features

Reporting:
A system that provides good reports can help you identify what needs to be improved over long periods of time. For example it can give a good answer to things like "what systems go down the most?". This can be important when you are trying to convince management to spend money on certain things -- business's like hard evidence.

Specialized Features:
Some monitoring systems are targeted at specific products or have more support than others. For example if the main thing you need to monitor is SQL server, or if you make heavy use of VMWare products you should see how well these are supported.

Predefined Monitoring Templates:
A system that comes with a lot of predefined templates (or has a user base that has created many templates) can be a huge time saver.

Discovery:
If you have a large or changing environment. Some systems provide the ability to add new systems via an API or run scans to find new servers or components.

Distributed Monitoring:
If you have multiple locations to monitor, it can be helpful to have monitoring pollers in each location instead of a lot of independent systems are monitoring via the WAN.

Some Popular Monitoring Systems

There are a lot of monitoring systems out there. We have a list with a summary on this old question. For quick reference some that I hear the most about are:

Nagios
Cacti
OpenNMS
Solar Winds
Zabbix
Various cloud based Monitoring systems
Microsoft System Center
This one isn't popular yet, but Stack Exchange has open sourced its monitoring system http://bosun.org

How to Decide based on the above

The reason I can't tell you what to use is because every organization has its own needs. If you want to make the right choice you should think through all the above components and figure out what features are important to your organization. Then find a system or systems that claim to provide what you need and try them out. Some of these cost a little, a lot, or are free. Taking all of that into account you can then make your choice. From what I have used they are all far from perfect, but at least you can try to get something that fits.

It's helpful to distinguish between monitoring and alerting. Monitoring means collecting data and making graphs. Alerting means send me an SMS when a server goes down in the middle of the night.

Nagios is for alerting. Cacti and Munin are for monitoring. Other products combine the two functions. Zenoss and Zabbix are examples.

I'd start by answering some questions:

Do you need to monitor servers, network devices, applications, or all three?

Are there limitations on what methods you can use to monitor? Can you install monitoring clients like NRPE on the servers, or will you use SNMP, or maybe both?

Who will use the graphs, and who will use the alerts? What would you like the end result to look like? Does the look and feel of the interface matter (will business people be using this, or only tech staff?)

What are your resources, both in terms of time, skills and hardware? Do you have at least modest scripting ability? Do you need an out-of-the-box solution?

In my opinion, the first rule of both alerting and monitoring should be Keep it Simple! An organization can live or die on how it alerts and gathers data, and most of the time it will get complicated on its own anyway. Start with the basics and build from there.

tl;dr

Think about the services that your software provides, send alerts when these services fail, or when the risk of a failure of these services increases.

Service Level Agreements

The theory behind monitoring strategies is to tie monitoring and alerts to some sort of service level agreement. After all, you want to be alerted to the fact that you're losing money, not necessarily that there's a spike in the number of TCP connections to nji0019.myserver.com. There are various tools that will give you tons of alerts, define dependencies between alerts, but many of these checks aren't directly relevant to the service you provide to someone.

Breach of service

Identify the important services that you provide, such as the ability to serve a web site, and the ability to modify that web site (e.g. a CMS of some sort). Those should be checked (e.g. by monitoring that you can get the web page, and that you can). The failure of these two Services (used here with a capital S) should trigger an alert to notify you.

If it's important that the site responds within a reasonable amount of time, that too should trigger alerts. Sort of a "breach of SLA" if you will.

Increased risk

Usually there's an inherent risk of a Service failing, and often enough that risk is mitigated by the fact that you introduce redundancy, e.g. a second server, or a slave database, or extra network cards...

When that redundancy is lost, the Service is still fine, but the risk of the Service failing just went up.

This is the second major reason to trigger alerts; that redundancy is gone (e.g. that the second server died), or that there is an imminent danger that the risk will increase (e.g. disk only has 500Mb left, or disk trend indicates that the disk will go full in about 5 hours).

What about all those indicators?

But check_mk gives me 50-60 checks per host, are these all worthless?

No. All this doesn't mean you want to ditch the plethora of automatic checks you get with e.g. check_mk, but it means you should try to categorize each of the checks into what Service(s) might be affected if something does fail.

What Service would be affected if the /var/ partition fills up? What Service would be affected if the eth0 interface is down? ... if outbound TCP connections are blocked by some firewall? ... if the number of threads exceeds 800? ... if the database goes down?

Example

You have 2 web servers, and a database server serving a site behind a load balancer you don't own (e.g. the ISP). The Service you provide is port 80 on the two servers, and they have enormous caches that can survive e.g. database downtime (database on a third server).

In this scenario, the complete failure of a web server would not result in the site being down. What has happened is that the redundancy is gone so that the risk of failure just went up. That should trigger an alert.

The complete failure of the database might not affect the ability to serve the site at all, because of the well tuned caches in place; This then doesn't affect the Service of serving the web site, but it might affect a different Service, namely updating the web site, or accepting orders...

Each Service would have its own level of service that designates how important it is to restore service or to avoid outages

Be agile

Every time you get an alert, you should do one of the following: - change the system being monitored to fix the problem that caused the alert (e.g. replace the drive or reconfigure logrotate or something) - change the monitoring system to avoid the alert being sent out the next time that situation arises. (e.g. change the levels for "disk free" so that the disk can fill up to 90% instead of just 80%)

My own experience

I'm mostly familiar with Nagios and its verbose configuration, and have since been hooked on Check-mk's multisite. I recently learned that check_mk has this concept of Business Intelligence (since 1.11) which seems to match this thinking well. You can define that checks in nagios are part of a larger service and have rules that define the state of the "Service" as being a function of the state of many checks, aggregating to the worst or best state.

One of the most critical points companies forget when chosing a monitoring solution is that it's not all about solving immediate operational issues, it's about tomorrow's unforeseen issues! I mean, of course solving immediate issues is important, but trust me, in a lot of cases this short-sighted strategy will not guarantee a company's survival.

There are dozens of great monitoring solutions on the market. Shortlisting a small set of solutions that satisfy your requirements is a difficult and long task, moreover, finding one that fits your budget is even more difficult. The interesting part is finding one that's aligned with your present and your future. And there is no evaluation process to detect that, it is a matter of experience + intuition + a very important factor: Trust, which is not an easy thing to hack.

As a rule of thumb, search and dig for success stories of your shortlisted set of monitoring solutions, specially if it affects a company from your sector. Ask the vendor for their success stories, and even ask them for permission to speak with one of their customers. Companies that are not afraid of this show they have real relationships with their customers, and they don't hide that, and this is an extremely rare thing to find nowadays.

Zabbix, Icinga, Pandora FMS, op5, Datadog, New Relic... they all have their ups and downs, but the real issue is finding which one adapts better to your future.