Monitor MySQL Replication
What is the best practice for monitoring a slave to make sure that it is
a) Still running b) Not too far being the master
I would like to alert by email if it is behind, happy to write a script or two to hook into command line applications.
Solution 1:
1
you can use maatkit's mk-heartbeat
2
you can look at result of
show slave status;
run on sql slave but Seconds_Behind_Master is disturbingly inaccurate at times.
3
you can hack your own solution, similar to mine - i use it both for nagios monitoring and for feeding munin charts showing 'seconds behind master'.
on master server i have simple cron job:
* * * * * root /usr/local/bin/repltest
where repltest is:
#!/bin/bash
start=`date +%s`
d=0
while [ $d -lt 60 ] ; do
echo "update repl_test set t= NOW(); " |mysql --defaults-file=/etc/mysql/debian.cnf repl_test
sleep 3
d=$(( `date +%s` - $start ))
done
on the slave i monitor value returned by:
echo "select UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(t) from repl_test" | mysql --defaults-file=/etc/mysql/debian.cnf -N repl_test
local time on all servers is synchronized via ntp.
repl_test db contains:
CREATE TABLE IF NOT EXISTS `repl_test` (`t` datetime NOT NULL) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `repl_test` (`t`) VALUES(NOW());
if you run replication - i suggest you also set up mk-table-checksum to compare content of your sql servers from time to time.
Solution 2:
pQd has it, checking 'show slave status' is the easiest way. Regarding Seconds_behind_master being inaccurate, I wanted to mention that the value is the difference in the timestamp for the statement being read out of the relay log by the slave SQL thread; it's not related to an estimate of how long it will take to catch up. For instance, a single long-running update that takes, say, an hour to run will cause the slave to appear up to an hour behind its master but once the statement completes it could very well only have 1 second of work left to do to catch up.
Also, you will want to grant 'REPLICATION CLIENT' to the user that you will be monitoring from in order to retrieve the slave status;
Solution 3:
The obvious answer as others have said is to use some variation on SHOW SLAVE STATUS. I use the checker built into Nagios personally, but that is because I do all kinds of other monitoring through nagios already. There is a catch though, it is possible for SHOW SLAVE STATUS to show both processes running and yet for the slave to be hung. From what we can tell (because we had the problem and looked into it), the problem occurs when there are network burps of some duration that is too short to kill the slave outright but too long for it to recover properly. We came up with a work around where we look at the time stamp of the latest entry in a table that changes routinely and compare it between the master and slave then throw an alert if it is "too far" behind. Not perfect and it would only work in certain circumstances, but consider yourself warned.