Best tool for monitoring backups, etc. and trending statstics from that data [closed]

Rather than writing your own monitoring solution, I strongly recommend that you use an existing tool so that all the basic monitoring and alerting functionality is already implemented. If you pick Nagios, you'll get the basic monitoring of server and network resources for free, and the following plugins should give you most of the rest of what you need:

check_file_ages_in_dirs will tell you whether the backup files exist; here's a blog post I wrote with some basic examples.

check_file can monitor file size and contents (using regexes), so you can output your backup statistics to a file and monitor them.

The one thing you won't get from Nagios is trending and graphing; I recommend looking at Munin for that, as it's simple to set up and, like Nagios, has stacks of contributed plugins.


this should be pretty easy to set up with zabbix.

setting custom (and very powerful) thresholds is easy - you can write any expression you like, so something like "notify me if more than 3 of these 5 servers did not have a successful backup" is possible. you can also use 6 different severity levels and escalations to achieve flexible notification and alerting.

zabbix has bunlded data storage and visualisation capabilities - all data is stored in a database, and to graph a single metric you do not need any configuration - you just get a graph for it "for free". for long term storage & trending one hour averages are computed.

as for getting your data about backups into zabbix, there are multiple possibilities. you can read it from files, you can launch custom commands, you can push it from the monitored machine using commandline utility zabbix_sender... and there might be few more possible approaches.

extending is easy - any custom command that returns data can be used to gather, store and visualise that data.

of course, general monitoring of operating systems, applications, snmp and ipmi devices and so on is possible.


execution

backups get orchestrated by backupninja. i use it just a wrapper for my bash scripts - to have single backup log. each script starts with

 function handle {
         echo Error
         error problem occured
 }
 set -e
 trap handle ERR

so i get error in logs whenever any of the commands [ eg mysqldump or rsync ] fails.

all backups end up in rdiff repository so i have n days of increments.

all backups are transmitted using rsync to central storage server.

on storage server all backups are verified daily and after successful verification of data on local disk they get copied to external usb drive.

verification

backupninja.log on all servers is monitored by nagios. i check if they contain only DEBUG and INFO messages. anything else triggers alert.

every backup 'touches' a test file, presence and freshness of which is monitored on central backup repository server with nagios.

additionally more critical sql dumps get checked for their size [not just freshness] and completeness [eg at the end of mysql dumps i expect fresh timestamp in

-- Dump completed on 2010-04-22 23:21:02

all rdiff archives are verified daily before data gets synced to USB drive and then again after they get synced. so even if nightly transfer is interrupted i will have consistent repository just on USB disk. result of checking is logged to file which content and freshness is checked by nagios.

usb disks get rotated weekly and are stored offline, just in case. this might be overkill for bigger amounts of data, but works fine for ~300GB of slowly changing files/dumps.

trends

i use simple custom munin plugin to plot size of diff/data for each rdiff repository.

time it takes to execute can be checked in backupninja logs but for now i dont bother about it.