Monitoring production server [closed]

We have 3 dedicated server, splitted in several VPS using openVZ. We're using munin to monitor the VPS with the production sites, and monit on some one of the VPS to make sure it restarts the service when failing.

Thing is we need a much better way to monitor all of our servers, since we have up to 14 VPSes, we'd like to have a center hub where we could see not only the data collected by munin, but also some more extra stats on the networks and performances of our services.

Some of our requirements:
- SMS notification on failure (ability to setup certain custom verification)
- Log analyzer for apache error_log and some other.
- Must be central (meaning one server and several nodes collecting the data).
- Doesn't need to be easy to install but easy to maintain.
- Need to be free

I've been pointed to nagios and splunk, what do you think? Thanks,


Solution 1:

I have a similar setup, except with Xen in place. I have been very happy with a combination of:

  • Nagios for alerting (using PNP for some lite graphing, and Nagviz for a service state dashboard)
  • Ganglia for historical graphing of systems
  • OSSEC as a HIDS and equally importantly as collector for centralized logging
    • sidenote: There is a Splunk plugin for OSSEC that integrates these two tools very well, I am waiting for them to port it over to Splunk v4 though.
  • Splunk Lastly, once a few of the Splunk plugins are migrated over, we plan on using Splunk with some pre-filtering of logs (to keep from going over the free editions cap)

I hope that sharing our monitoring setup will help you out :-)

Here are some useful links:

http://www.ibm.com/developerworks/linux/library/l-ganglia-nagios-1/index.html

https://www.ibm.com/developerworks/linux/library/l-ganglia-nagios-2/

http://www.ossec.net/main/splunk-ossec-integration

update:

I forgot to mention that we use Matt Simmons Nagios config layout as well, found here http://www.standalone-sysadmin.com/blog/2009/07/nagios-config/

This layout made our Nagios Configuration sane and much easier to maintain (Thanks Matt!)

Solution 2:

I've had great success with Zabbix, it satisfies all of your points in one package.

alt text
(source: zabbix.com)

The hardest part will be getting the apache log monitoring, but Zabbix is extensible so you can use LogWatch or some other perl script to grab data for you.

Solution 3:

I like OpManager, and its free up to a certain number of nodes. Does all of the above, and is pretty easy to install and maintain.