Using rsync to quickly upload a file that is similar to another file

I'm putting together a deployment script which tars up a directory of my code, names the tar file after the current date and time, pushes that up to the server, untars it in a directory of the same name and then swaps a "current" symlink to point at the new directory. This means my older deployments stay around in timestamped directories (at least until I delete them).

The tar file is around 5MB and it takes nearly a minute to transfer. I'd like to speed this up.

I assume each new tarball is pretty similar in structure to the previous tarball (since I'm often only changing a few lines of source code in between deployments). Is there a way to take advantage of this fact to speed up my uploads using rsync?

Ideally I'd like to say "hey rsync, upload this local file called 2009-10-28-222403.tar.gz to my server, but it's only a tiny bit different from the file 2009-10-27-101155.tar.gz which is already up there, so try to just send over the differences". Is this possible, or is there another tool I should be looking at?


Solution 1:

I'm putting together a deployment script which tars up a directory of my code, names the tar file after the current date and time, pushes that up to the server, untars it in a directory of the same name and then swaps a "current" symlink to point at the new directory.

Personally, I think you should skip using tar, and instead look at using the --link-dest or --copy-dest feature of rsync. The link-dest function is pretty cool it will know to look at the previous sync of the directory, and if the files where identical it will hardlink them together skipping the need to retransfer the file each time.

mkdir -p /srv/codebackup/2009-10-12 \
         /srv/codebackup/2009-10-13

# first backup on 10-12
rsync -a sourcehost:/sourcepath/ \
         /srv/codebackup/2009-10-12/

# second backup made on 10-13
rsync -a --link-dest=/srv/codebackup/2009-10-12/
         sourcehost:/sourcepath/ \
         /srv/codebackup/2009-10-13/

Your second run of rsync will only transfer changed files. Identical files will be hard linked together. You can delete the older tree and the new backup will still be 100% complete. You will save a lot of storage space since you will not be keeping multiple copies of identical files.

Solution 2:

rsync AFAIK can't do this directly, but you can structure your tarballs to make them transfer faster, taking advantage of the fact that they're similar.

Check out gzip's --resyncable flag. From the manual:

While compressing, synchronize the output occasionally based on the input. This increases size by less than 1 percent most cases, but means that the rsync(1) program can much more efficiently synchronize files compressed with this flag. gunzip cannot tell the difference between a compressed file created with this option, and one created without it.

This will make your similar tarballs actually more similar such that rsync will be able to recognize them.

You'd probably have to modify your deployment scripts a little to reduce the amount of transfer, because I don't think rsync can be told to "look at another file"... what I'd do is always rsync something called current.tar.gz (compressed with gzip and the above flag), and then rename it for archival purposes on the server. That, or rename an old tarball on the server to the name of the tarball that is about to be uploaded, so that rsync can use it.

Solution 3:

I think using tar here is the wrong answer. What I would do, for this particular case is, cp -rp your "current" code on the server to a dated directory. Then rsync your local code checkout against "current". So basically this:

  1. ssh user@host cp -rp /path/to/current /path/to/2009-10-28/

  2. rsync /local/copy user@host:/path/to/current

This gives you the backup copy you want, syncs your changes, and will be much faster than tar+scp+untar.

Hope that helps!

Solution 4:

Ok, I haven't tried this, but it'd be interesting to see how it works in your case.

You'll want to minimise the changes on each invocation of tar. Would it help to make sure that the files are always in the same order in each instance. You can then compress with the --rsyncable option.

Can you order the files by last modified date? That way the files that don't change are always in the same order, and at the beginning, and the files that change are at the end, so when they change length they don't break the blocking algorithm.

tar cvf - -T `find . -type f | xargs ls --sort=time -r` | gzip -9 --rsyncable

Another thing to consider is that tar supports blocking, and will pad out each file with nulls to a block offset. Check out block sizes. You could set this to the rsync block size (ah, that depends on the size of the file, erm how about 8k?). Which will help the algorithm when a single file is reordered. Now, drop the gzip on each end (gzip the last-but-one on the server if you're worried about disk space), and I think you might get the speed up you want.

I'm not that impressed with the --rsyncable option. I'm using it on daily postgres dumps, and find that, although only a small amount of the dump changes each day, rsync uses about half the bandwidth of just copying the .gz around. I might ask a question about this actually.

I think you'll be best off with the efficient rsync of individual files included in other answers, and then generating the .tar.gz from the resulting directory on the server (or the client if that's where you want to keep your archive). What's wrong with your version control system, as a record of what you deployed when? You're not deploying uncommitted code are you?