Synchronizing very large folder structures

If you can trust the filesystem last-modified timestamps, you can speed things up by combining Rsync with the UNIX/Linux 'find' utility. 'find' can assemble a list of all files that show last-modified times within the past day, and then pipe ONLY that shortened list of files/directories to Rsync. This is much faster than having Rsync compare the metadata of every single file on the sender against the remote server.

In short, the following command will execute Rsync ONLY on the list of files and directories that have changed in the last 24 hours: (Rsync will NOT bother to check any other files/directories.)

find /local/data/path/ -mindepth 1 -ctime -0 -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.

In case you're not familiar with the 'find' command, it recurses through a specific directory subtree, looking for files and/or directories that meet whatever criteria you specify. For example, this command:

find . -name '\.svn' -type d -ctime -0 -print

will start in the current directory (".") and recurse through all sub-directories, looking for:

  • any directories ("-type d"),
  • named ".svn" ("-name '.svn'"),
  • with metadata modified in the last 24 hours ("-ctime -0").

It prints the full path name ("-print") of anything matching those criteria on the standard output. The options '-name ', '-type ', and '-ctime ' are called "tests", and the option '-print' is called an "action". The man page for 'find' has a complete list of tests and actions.

If you want to be really clever, you can use the 'find' command's '-cnewer ' test, instead of '-ctime ' to make this process more fault-tolerant and flexible. '-cnewer' tests whether each file/directory in the tree has had its metadata modified more recently than some reference file. Use 'touch' to create the NEXT run's reference file at the beginning of each run, right before the 'find... | rsync...' command executes. Here's the basic implementation:

#!/bin/sh
curr_ref_file=`ls /var/run/last_rsync_run.*`
next_ref_file="/var/run/last_rsync_run.$RANDOM"
touch $next_ref_file
find /local/data/path/ -mindepth 1 -cnewer $curr_ref_file -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.
rm -f $curr_ref_file

This script automatically knows when it was last run, and it only transfers files modified since the last run. While this is more complicated, it protects you against situations where you might have missed running the job for more than 24 hours, due to downtime or some other error.


Try unison, it was specifically designed to solve this problem by keeping the change lists (building file list), locally to each server, speeding up the time to calculate the delta, and the reduce amount that is sent across the wire afterwards.


http://oss.linbit.com/csync2/ is designed for this sort of thing, I'd give that a try.


If you're using the -z switch on rsync, try running without it. For some reason I've seen this speed up even the initial enumeration of files.