MySQL Slave stuck on a single bin log + bin log position for 17+ hours

I was on the right track here with the large query transactions from yesterday.

After I migrated the data, I executed a DELETE statement on the original table, to get rid of the rows that I had migrated away.

These tables are just full of tracking data, and therefore, don't have any primary or unique keys on them.

Due to how ROW based replication works, the slave, instead of executing the identical DELETE statement that was executed on master, it executes a DELETE statement for each row, which ends up looking something like this:

DELETE FROM table WHERE colA=foo AND colB=bar AND colC=baz....etc

And, since there is not an index that matches that query, the single threaded replication SQL thread executed 40 million + delete statements (or...was trying to) which took a long time to run because of all the scanning that had to be done to identify each row (the table was about 80 million rows in size at the time).

In the end, I dealt with this by stopping the slave thread (STOP SLAVE) skipping a single slave transaction (SET GLOBAL sql_slave_skip_counter = 1;) and re-starting the slave thread (START SLAVE).

This resulted in my Master and Slave being out of sync on the table in question here -- but I was able to leverage the nature of the row based replication to bring it back in sync by executing the following on the Master:

mysql> CREATE TABLE table_tmp; -- with the same schema as 'table' (SHOW CREATE TABLE table;)
mysql> RENAME TABLE table TO table_bak, table_tmp TO table;
mysql> INSERT INTO table ( SELECT * FROM table_bak );
mysql> DROP TABLE table_bak;

Since the DELETE was executed on the Master, the INSERT here only inserted the records I wanted to keep (the deleted ones are gone). And, since row based replication inserts each row individually instead of executing the same INSERT INTO...SELECT statement, the Slave table was only populated with the desired data. Then, the subsequent DROP TABLE statement drops the table on the slave without having to address each row individually.

The caveat here is that because the Master version of the table was still 30-40 million rows...the INSERT and consequential replication ends up locking up your slave for a little while (duplicating the problem above), but it's a much shorter stall (ended up being about 20 minutes) due to mysql not having to scan the database for rows to delete.

I hope this can be of help to someone in the future. Sorry it's winded, hope it was informative and helpful.

MySQL Slave stuck on a single bin log + bin log position for 17+ hours

Related

Recent Posts