improving rsync backup performance

What are the best techniques to improve rsync over ssh mirroring between unix boxes, assuming that one system will always have the master copy and the other system will always have a recent copy (less than 48hrs old)

Also, what would one have to do to scale that approach to handle dozens of machines getting a push of those changes?


If :

  • The modification time of your files are right
  • The files are not really big
  • No push can be missed (or there is some kind of backlog processing)

You can use find -ctime or file -cnewer to make a list of changed file since the last execution, and copying over only the modified files (Just a glorified differential push).

This translated itself quite nicely for multiple hosts : just do a differential tar on the source, and untar it on all the hosts.

It gives you something like that :

find -type f -cnewer /tmp/files_to_send.tar.gz > /tmp/files_to_send.txt
tar zcf /tmp/files_to_send.tar.gz --files-from /tmp/files_to_send.txt 
for HOST in host1 host2 host3 ...
do
    cat /tmp/files_to_send.tar.gz | ssh $HOST "tar xpf -"
done

The script has te be refined, but you get the idea.


Presuming that the data you're rsyncing isn't already compressed, turning on compression (-z) will likely help transfer speed, at the cost of some CPU on either end.


rsync has a way of doing disconnected copies. In other words, rsync can (conceptually) diff a directory tree and produce a patch file which you then later can apply on any number of files that are identical to the original source.

It requires that you invoke rsync with the master and mirror with --write-batch; it produces a file. You then transfer this file to any number of other targets, and you then apply the batch to each of those targets using --read-batch.

If you keep a local copy of the last rsynced state (i.e. a copy of what the mirrors look like right now) on the same machine as the master, you can generate this "patch" on the master without even contacting any mirror:

On the master:

rsync --write-batch=my-batch.rsync /master/data /current/mirror

Add whatever other options you want. This will do two things:

  1. It will make /current/mirror change to reflect /master/data
  2. It will create a binary patch file (or batch file) called my-batch.rsync for later use.

Transfer the my-batch.rsync file from the master to all of your mirrors, and then on the mirrors, apply the patch so to speak:

rsync --read-batch=my-batch.rsync /local/mirror

Benefits of this approach:

  • master is not swamped
  • no need to coordinate/have access to the master / mirror(s) at the same time
  • different people with different privileges can do the work on the master and mirror(s).
  • no need to have a TCP channel (ssh, netcat, whatever; the file can be sent via e-mail ;-) )
  • offline mirrors can be synced later (just bring them on-line and apply the patch)
  • all mirrors guaranteed to be identical (since they apply the same "patch")
  • all mirrors can be updated simultaneously (since the --read-batch is only cpu/io intensive on the mirror itself)

If you're transferring very large files with lots of changes, use the --inplace and --whole-file options, I use these for my 2Gb VM images and it helped a lot (mainly as the rsync protocol wasn't doing much with passing incremental data with these files). i don;t recommend these options for most cases though.

use --stats to see how well your files are being transferred using the rsync incremental protocol.


Another strategy is to make ssh and rsync faster. If you are going over a trusted network(read: private), then encrypting the actual payload is not necessary. You can use HPN ssh. This version of ssh only encrypts authentication. Also, rsync version 3 starts transfering files while building the file list. This of course is a huge time savings over rsync version 2. I don't know if that's what you were looking for, but I hope it helps. Also, rsync does support multicasting in some way, though I will not pretend to understand how.