Diagnosing Mysql Replication Issues

We have a mysql replication client running on our backup server. Since a power failure last week it's stopped replicating. Before this it was running uninterrupted for several months.

I've tried restarting both the master and the slave but this hasn't helped. I can access the master server from the slave, so the network isn't the issue.

Is there anything else I can do to try diagnose what the problem is?

mysql> show slave status\G;
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: master
                Master_User: username
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000060
        Read_Master_Log_Pos: 46277494
             Relay_Log_File: mysqld-relay-bin.000348
              Relay_Log_Pos: 98
      Relay_Master_Log_File: mysql-bin.000060
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 46277494
            Relay_Log_Space: 98
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

ERROR:
No query specified


mysql> show master status\G;
*************************** 1. row ***************************
            File: mysql-bin.000069
        Position: 851796
    Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)

ERROR:
No query specified

Update: The errors were going into daemon.log, not mysql.err, which would explain why I couldn't find them. The problem seems to be that the master is saying the log is unavailable, which doesn't make much sense, because that log (and the previous one) are still available on the master.

090710  9:17:35 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000060' at position 46277494, relay log './mysqld-relay-bin.000350' position: 98
090710  9:17:35 [Note] Slave I/O thread: connected to master 'username@master:3306',  replication started in log 'mysql-bin.000060' at position 46277494
090710  9:17:35 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position ( server_errno=1236)
090710  9:17:35 [ERROR] Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log
090710  9:17:35 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000060', position 46277494

Solution 1:

Welcome to the wonderful world of MySQL replication. I haven't hit your particular issue myself, but I've hit a lot of other weird problems and the proximate solution is to just resync from the master as though it's a brand new slave and be done with it.

Solution 2:

You should examine the slave's error log - it's usually quite explicit about what the problem is.

You should have the mysql error logs tied into your monitoring system, otherwise your slaves are potentially worthless.

Moreover, you should have a monitor which checks the slave status.

And in order to be any use at all, you'll also want to check the sync of the slaves from time to time, perhaps by using something like mk-table-checksum; ideally tie the results of that into your monitoring system as well.

Solution 3:

Many people set skip-slave-start so that they can make sure everything is okay if a slave stops replicating before starting it up. Try running 'start slave' and see if anything changes or if something gets logged. Additionally it's strange that the SlaveSQL process is running and the SlaveIO is not. It's possible the local relay logs on the slave has been corrupted though that should be reported in the logs. You might try bring Mysql down and then deleting the relay logs.

Solution 4:

As womble has mentioned, forget about troubleshooting replication errors. The thing that worries me most about this approach is that you might succeed in getting replication to restart again and think everything is fine, but what if some parts of your db are still out of sync?

Best is to nuke the slave database and restart replication from a snapshot of the master. It should not be as disruptive as you might think:

http://www.neotitans.com/resources/mysql/quick-replication-error-recovery-via-snapshots.html

Solution 5:

From the above report i found the issue, this fieled must be set to (Slave_IO_Running): yes, but in above report it's showing Slave_IO_Running: No.

Thats causing the issue, If this variable reads ‘No’, then the IO thread was caused to stop. so there is no replication any more. You will have to check the Last_SQL_Errno and Last_SQL_Err for more information on the cause. An error number of 0 and message of the empty string mean “no error.” The Last_SQL_Error appears in the slave’s error log.

To fix this issue, stop the slave

Then set:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;

This tells the slave to skip one query (which is the invalid one that caused the replication to stop). If you'd like to skip two queries, you'd use SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 2; instead and so on.

Then restart the slave and check the logs, Hoping this will fix the issue...