What metrics should I monitor on my Linux-server?
Solution 1:
The usual metrics which indicate problems include cpu utilization, memory utilization, load average, and disk utilization. For mail servers, the size of the mail queue is an important indicator. For web servers, the number of busy servers is an important measure. Excessive network throughput also leads to problems. If you have processes which need to check times NTP can be an important tool in keeping clocks in sync.
Standard warning levels I have used include (warning, critical). You may want to adjust your values based on a number of factors. Higher values reduce the number of alerts, while lower values give you more time to react to developing problems. This might be a suitable starting point for a template.
- Sustained CPU utilization (80%, 100%). Exclude time for niced processes.
- Load average per CPU (2, 5).
- Disk utilization per partition (80%, 90%).
- Mail queue (10, 50). Use lower values on non mail servers.
- Busy web servers (10, 25).
- Network throughput (80%, 100%). Network backups and other such process may exceed values. I would use throttling settings if they are available.
- NTP offset in seconds ( 0.2, 1).
Munin does a good job gathering these statistics and others. It also has the capability to trigger alarms when thresholds are passed. Its warning capabilities are not as good as those of Nagios. Its gathering and display of historical data makes it a good choice to be able to review whether the current values differ significantly from past values. It is easy to setup and can be run without generating warnings. The main problem is volume of data captured, and its fixed frequency of gathering information. You may want to generate graphs on demand. Munin provides many of the statistics I would check using sar
when a system was in trouble. It's overview page is useful for identifying possible problems.
Nagios is very good at alerting, but has historically not been very good at gathering historical data in a manner suitable for comparison to current values. It appears this is changing and the new release is much better at gathering this data. It is a good choice for generating warnings when there are problems, and scheduling outages during which alerts are not generated. Nagios is very good at alerting when services go down. This is especially suitable for critical servers and services.
Solution 2:
I would use Nagios if I were you, for a number of reasons (here's two of them):
- You can use "templates" and set up server groups, and monitor different "groups" with different metrics. For example, put all of your web servers in 1 group, put all your database servers in another group, etc...
- It's very easy to automate the alerts to go to email, etc... (and create an alert escalation in case the first on-call responder doesn't respond to the alert within a certain amount of time)
A 3rd reason is that Nagios already comes with a default monitoring schema, which takes care of most things you'd want to monitor across the board - so you wouldn't have to set up your own monitoring "metrics" to begin with.
But if I were setting up my own metrics, I would monitor across all servers things like: Server load, free disk space, free memory, swap space usage, and then I would also do some external monitoring with ICMP pings, etc...
Solution 3:
You can first monitor the system resources such as cpu and memory.
Then, you can monitor the service-specific resouces. For example, you can monitor the response time and number of active connections.
For the default monitoring values, I think it should be related to the expected usage pattern and how much you expect the server to be busy.