Improving MySql replication speed

i'm looking for pointers on how to improve the replication speed of a mysql slave. It is fast enough for normal production but needs a long time to catch up if it falls behind for some reason. (If the server was turned off for or replication stopped for some hours etc.)

Some data:

  • The database is around 2TB and the binlogs written daily are around 500GB.
  • For normal production use the slave lag stays at 0 or 1 seconds
  • While trying to catch up it come closer to master at around 1 second per second. (So catching up 10 hours takes 10 hours)
  • The Slave is ONLY replaying the data as a failover machine, there are no selects running against it and it should be able to replace the master if he dies (everything we be slow, but thats ok)
  • The servers are in the same physical network

What can be done to improve that speed ?

(Yes, improving the hardware is an option but we want to make sure we checked other options before spending lots of money)

Edit:

Some data in response to @3molo question:

  • Bandwith usage while catching up: ~2MB/s to 10MB/s, no problem there i guess.
  • CPU usage is low, io wait is high (slave only has 2 cpus, arround 70 to 85% percent io wait while catching up. So this means the bottleneck is in the disks ?
  • Ram usage is 1,5GB used of 4GB (the rest is 'cached'). So i don't see that much potential there ? (Since it's only write load anyways (?))

If the bottleneck is the disks (6 disk raid 10 setup) are there any options besides increasing the raid or do we have to go with that ?


Solution 1:

This would be interesting to write down while catching up:

  • how much bandwidth is used

  • how much cpu utilization (especially i/o wait)

  • ram usage

In response to the new information about i/o wait being high; for innodb a lot can be done, take a look at. I've learned many things from mysqlperformanceblog. Here's some hints:

innodb_flush_method=O_DIRECT

"Avoid double buffering and reduce swap pressure, in most cases this setting improves performance. Though be careful if you do not have battery backed up RAID cache as when write IO may suffer."

innodb_flush_log_at_trx_commit=2

"If you’re not concern about ACID and can loose transactions for last second or two in case of full OS crash than set this value. It can dramatic effect especially on a lot of short write transactions."

Those did WONDERS for us, but the drawback is that you might loose a second of written data. This is because instead of write (flush to disk) every record, you flush every second.

You can read more on: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/

Solution 2:

MySQL replication is fast, really fast. It's primary limitation is the link layer and then the rest of the hardware.

In your case, resuming replication will probably show the binary logs or slave IO catching up fairly quickly. If not, improve your link first. Otherwise, if it is the SQL, you're going to have the physical limitations of the server, which is going to vary between disk IO, RAM, and CPU depending upon the type of load.