rsync take too long to run

I have a load balancer setup involve 2 server. these 2 server mirror each other. the main usage of the blanacer is serving static files. Let's call them Server A and Server B.

Server A will retrieve file from remote host on different network. those remote files being retrieved are media files for a community website, so the rsync need to run every 30 minutes in order for the files to stay in sync. Other wise user will see broken images etc. Server A is also serving the files via http, peak time at 400MB/S

Server B will rsync with files on Server A, in order to keep consistency, rsync is running every 30 min as well. Server B is also serving the files via http, peak time at 400MB/S

The load on A and B have been very high load average: 8.00, 8.10, 7.68 and more

How can I improve my setup to reduce server load and improve rsync efficiency ?

thank you


Solution 1:

It depends on what is causing this high processor utilization. If the high processor utilization is caused by Rsync generating the file checksums, there are some things you can do.

You may not need checksums at all. By default, rsync decides a file is different based on modification time and file size. If you add the "-c" option, it will decide a file is different by comparing checksums. Omit the option if you don't need checksums.

If you do need checksums, there are some circumstances where checksum caching may work. If the files you are syncing do not change often, you can generate the checksums once per day in a cron job, and rsync will use the generated checksums. Rsync will still generate checksums for any new files or for any files that have a different modification time or size from when the checksum was created.

This info is based on rsync 3.0.5 but should work the same in 3.0.6. You'll need to recompile rsync; the checksum caching is a patch. Here's what I used to compile rsync:

rsync_version="3.0.5"
scriptroot="Set this to your working directory."
mkdir -p $scriptroot/rsync-source/rsync-working
cd $scriptroot/rsync-source/rsync-working
tar xvzf ../rsync-${rsync_version}.tar.gz
tar xvzf ../rsync-patches-${rsync_version}.tar.gz
cd $scriptroot/rsync-source/rsync-working/rsync-${rsync_version}
patch -p1 < patches/checksum-reading.diff
./configure
make

Then use rsyncsums to generate the checksums. When invoking rsync, use the "--sumfiles=lax" option.

Solution 2:

You don't state the version you're using. It's likely you're stuck at version 2.x if you're on RHEL/Centos. The problem with 2.x is that it scans all the directories and sends the file list BEFORE it does any transfer. This is bad because if the tree is big enough, it risks having been pushed out of the cache when the transfer actually starts, which results in twice the disk activity. Additionally, if the connection is flaky, you will never transfer anything because the connection will drop early.

Starting with version 3.0 however, the directory structure is scanned as it goes. To upgrade to 3.x on RHEL/Centos, I just downloaded a Fedora (version 10 and under, because the format changed and is slightly incompatible with RHEL's) SRPM from http://koji.feodraproject.org, and issued a:

rpmbuild --rebuild rsync.xxxx.src.rpm

You need to install the new package on both machines.

Solution 3:

A lot of sites suggest the -avzuh for archiving. After some testing, I found that it was -z (compression) that made it take forever for me (doing a backup from my 500g portable HD from work to home), even if no changes were made.

With -z it took about 1 hour (no changes) and without it takes about 30s.

Solution 4:

Both for load balancing & fail-over / disaster recovery I'm starting to experiment with DRBD - its like RAID-1 over a network.

Sticking with rsync, if you're primarily mirroring a static set of files pass rsync a file list this way rsync won't spend initially time polling your local filesystem to build a file list - saves a lot of time. File lists are quite cool - if you include in the list a directory rsync will dynamically scan & send that directory (ie if said directory is prone to change often)

You are using a secondary NIC for the mirroring right?

Solution 5:

Depending on the frequency of file changes and the number of files, it might be better to wait for modifications and then only sending the notifications. This is much better in case the frequency of modifications is low and the total number of files is high. In that case, rsync will hit the disk to stat() all files to see whether they are changed.

http://inotify-tools.sourceforge.net/ has a simple example (see example 1) on how to connect Linux's inotify (the file-modification monitor) with rsync in a crude way.

Ideally this would be integrated into rsync itself (I think there is some experimental version somewhere what did that, but cannot find it now...)