What metrics should I watch when I monitor a server?

Solution 1:

The metrics that matter are those which:

  • Indicate a problem with the correct and proper operation of the services you provide; or
  • Indicate the root cause of a problem

What metrics matter to you depends on what you judge, in your professional opinion, to be the metrics that best fulfil those two criteria. If you don't have the expertise to be able to accurately judge that in advance, well... yeah. Collecting more data that you may never need is better than not collecting some data which you turn out to need later. (The caveat there is that if your monitoring is starting to interfere with the efficient operation of the service, you might need to turn it down a bit, or optimise the statistics collection).

If you're looking for a short-cut answer, I'm afraid I don't have one -- you're on a steep learning curve that speaks to the very heart of what it means to be a sysadmin. If you're in a situation where some downtime doesn't matter, great! you've got yourself a great learning opportunity. If you're going to end up getting sued or going out of business if this service doesn't run perfectly, you might want to find someone with more experience to give you one-on-one guidance and mentoring.

Solution 2:

I just wrote and published a guide on exactly this subject:

  • Zen and the Art of System Monitoring

Allow me to summarize here: There are 3 main goals to think about when monitoring any sort of production system:

  1. Identify as many problems as possible;
  2. Identify those problems as early as possible; and
  3. Generate as few false alarms as possible (that means setting proper alerts)

And you want to do this by picking your metrics under the following framework:

  1. Monitor Potential Bad Things (things that could go wrong - this is often in the form of things that fill up / run out -- i.e. memory, disk, bandwidth)
  2. Monitor Actual Bad Things (things that do go wrong despite your best efforts)
  3. Monitor Good Things (or the lack thereof - pay attention to things you want to happen and set an alert when they happen less-frequently
  4. Tune and Improve (otherwise you risk "alert fatigue" aka the DevOps equivalent of "crying wolf")

Every deployment is going to be a bit different so YMMV, but this is the framework that lots of seasoned pros use to think about things (whether explicit or not).

[Edit for disclosure: I'm affiliated with Scalyr, a company that is involved in this space, and the link above is published on their site]

Solution 3:

The most basic is to keep an eye on amount of CPU load, free memory & swap, disk space, disk I/O, and network/bandwidth I/O. This can be done using tools like munin or collectd. Some people like to monitor a lot of things, but if you keep it simple at least you can get the overall picture. I also recommend that you configure the monitoring tools to send you email alerts when things start to go wrong (ie using "thresholds" or similar).

Another very useful thing is to keep an eye on the most important server logs for anything unusual, ie error messages or perhaps even warnings. But such messages can be very common depending on how the various softwares are configured to log. Usually, daemons have a config file where you can change the "LogLevel" from error (=only log when something is broken) to debug (=log anything). Check which demons you have running on your server, and change the log levels to error or warning. Then you can install a log file analysis tool such as OSSEC and train it to be silent when certain things are acceptable while it should alert when things are broken. These alerts can be sent via email to you.

For your specific services Nginx and Mysql, I recommend that you monitor their response time. This is good for two reasons: if you don't get a response at all, something is broken. ANd if you get a response that indicates an unusually high response time - especially if it's not temporary but over a period of say a few minutes or hours - then the service is struggling.