Good introduction to server monitoring?

Solution 1:

Sever monitoring depends on which metrics matter to the server's purpose. As a web application there's quite a few areas to cover. There's endless numbers of metrics you can think of but you'll usually have these bare minimums:

  • Availability of server and services
  • Disk space & usage
  • Network usage
  • Memory usage
  • CPU usage
  • Log files

The other part of monitoring besides viewing into the present is to keep a record of the past. This gives you the ability to:

  • Plan for the future
  • Identify reasons when issues pop up

Will you run out of disk space in the next two months with the same growth? Are you seeing increases in CPU usage aligning with new feature deployments? Why are users having to wait four seconds to view a page?

I'll touch on each of the above metrics:

Availability

Very simple availability monitoring is via the ping command but the fact that a server pings doesn't mean the services like the web server are available, as it may have crashed. More complex monitoring would be running a test transaction on the website every hour to ensure that users can buy products.

Disk Space and Usage

The space metric is obvious, you'll want to know ahead of time before you app stops working. The usage part is a bit more complex. The usage will be metrics like bytes read/write, input/output operations per second, etc. These can be important because if you see an increase in site latency correlated with a drop in disk performance you may have developed a bad disk that requires multiple seeks or reads to satisfy the request. Don't forget to measure inode usage too, that's a metric I've forgotten about a couple times within OpenVZ.

Network Usage

Hitting your network bandwidth limit? Are you seeing the same numbers your ISP is seeing?

Memory Usage

When the system starts running out of memory it will start swapping. This will affect performance.

CPU Usage

Is the CPU spiked at 100% during peak times? Maybe you can improve the user's experience by upgrading the server to a faster CPU or more CPUs. Does performance die with the CPU having to handle so many network controller interrupts? Maybe time to invest in a TCP offload card.

Log Files

  • The MySQL slow query log: Queries are running slower than your threshold. Review this file and improve as needed. If you can't improve them and the query times are corresponding with heavy system load then maybe time to upgrade.

  • The application's log files: What were using doing causing all the heavy system load? Were most of them viewing a specific page? Why did only only half of the user uploads work today?

  • The Apache log files: Knowing the numbers is useful for site design effectiveness, usability, advertising campaign measurements, broken pages or images, etc.

  • The system's log files: Hack attempts, hardware errors, various daemon messages.

It's usually best to have system logs to be shipped off to another server so tracks can't be covered.

Beyond these there's lots of things that can be monitored: transactions per second, server temperature, hard drive temp & SMART, RAID status, backup reports, batch job statuses,

The Tools

There are quite a few tools to accomplish some of the above. Other more specific metrics will either need to be self-coded if not already available (showing the qmail queue size via SNMP is one such metric I've put together because sometimes qmail would half-break, still accept new emails but not send any out).

Some of the tools I use that you can easily start with:

  • Nagios or Icinga - One of the most popular *nix monitoring tools. Quite a few monitoring tools, like mysql slave monitoring. I generally use this specifically for availability monitoring of all services. Setup to send an email to phone's email-to-text address for alerts. Icinga is a fork of Nagios. Browser through the "commands" and see which ones you can use.
  • Munin or collectd - These give you the graphs. A breeze to setup on CentOS. Setup the MySQL monitoring plugin for database insights like buffer usage.
  • WebSitePulse - Be aware that availability monitoring is only best when done remotely. I use their POP3 monitoring to verify that Nagios is still running via a script I made.
  • AWStats - Process the Apache log files into reports.
  • Google Analytics - More client details that aren't in the common Apache log like screen resolution and color depth.

Solution 2:

Probably the first-stop would be vmstat, it will tell you various bits of information about the virtual-machine -- not virtual machine as in vmware or VirtualPC, etc, but the kernel virtual machine that oversees the memory and i/o systems. You can run vmstat with an update period such as vmstat 1, which will report the virtual machine status every second.