improving rsync backup performance
What are the best techniques to improve rsync over ssh mirroring between unix boxes, assuming that one system will always have the master copy and the other system will always have a recent copy (less than 48hrs old)
Also, what would one have to do to scale that approach to handle dozens of machines getting a push of those changes?
If :
- The modification time of your files are right
- The files are not really big
- No push can be missed (or there is some kind of backlog processing)
You can use find -ctime
or file -cnewer
to make a list of changed file since the last execution, and copying over only the modified files (Just a glorified differential push).
This translated itself quite nicely for multiple hosts : just do a differential tar on the source, and untar it on all the hosts.
It gives you something like that :
find -type f -cnewer /tmp/files_to_send.tar.gz > /tmp/files_to_send.txt
tar zcf /tmp/files_to_send.tar.gz --files-from /tmp/files_to_send.txt
for HOST in host1 host2 host3 ...
do
cat /tmp/files_to_send.tar.gz | ssh $HOST "tar xpf -"
done
The script has te be refined, but you get the idea.
Presuming that the data you're rsyncing isn't already compressed, turning on compression (-z) will likely help transfer speed, at the cost of some CPU on either end.
rsync has a way of doing disconnected copies. In other words, rsync can (conceptually) diff a directory tree and produce a patch file which you then later can apply on any number of files that are identical to the original source.
It requires that you invoke rsync with the master and mirror with --write-batch
; it produces a file. You then transfer this file to any number of other targets, and you then apply the batch to each of those targets using --read-batch
.
If you keep a local copy of the last rsynced state (i.e. a copy of what the mirrors look like right now) on the same machine as the master, you can generate this "patch" on the master without even contacting any mirror:
On the master:
rsync --write-batch=my-batch.rsync /master/data /current/mirror
Add whatever other options you want. This will do two things:
- It will make
/current/mirror
change to reflect/master/data
- It will create a binary patch file (or batch file) called
my-batch.rsync
for later use.
Transfer the my-batch.rsync
file from the master to all of your mirrors, and then on the mirrors, apply the patch so to speak:
rsync --read-batch=my-batch.rsync /local/mirror
Benefits of this approach:
- master is not swamped
- no need to coordinate/have access to the master / mirror(s) at the same time
- different people with different privileges can do the work on the master and mirror(s).
- no need to have a TCP channel (ssh, netcat, whatever; the file can be sent via e-mail ;-) )
- offline mirrors can be synced later (just bring them on-line and apply the patch)
- all mirrors guaranteed to be identical (since they apply the same "patch")
- all mirrors can be updated simultaneously (since the
--read-batch
is only cpu/io intensive on the mirror itself)
If you're transferring very large files with lots of changes, use the --inplace and --whole-file options, I use these for my 2Gb VM images and it helped a lot (mainly as the rsync protocol wasn't doing much with passing incremental data with these files). i don;t recommend these options for most cases though.
use --stats to see how well your files are being transferred using the rsync incremental protocol.
Another strategy is to make ssh and rsync faster. If you are going over a trusted network(read: private), then encrypting the actual payload is not necessary. You can use HPN ssh. This version of ssh only encrypts authentication. Also, rsync version 3 starts transfering files while building the file list. This of course is a huge time savings over rsync version 2. I don't know if that's what you were looking for, but I hope it helps. Also, rsync does support multicasting in some way, though I will not pretend to understand how.