How to make rsync of ~2M files from remote server performant for regular backups
We have a large amount of files on a remote server that I'd like to setup regular backups to a local system for extra redundancy. Some details:
- Remote system is not in my control. I only have SSH/rsync or FTP access
- Remote system runs rsync 2.6.6 and cannot be upgraded
- Remote system allows a max of 25 concurrent connections and 5 are reserved for production needs (so, 20 available)
- Remote system contains 2M files - the majority of which are 100-200K in size
- Files are stored in a hierarchy
Similar to:
0123456789/
0123456
abc/
1.fff
2.fff
3.fff
xyz/
9.fff
8.fff
7.fff
9877656578/
5674563
abc/
1.fff
2.fff
3.fff
xyz/
9.fff
8.fff
7.fff
with 10's of thousands of those root folders containing just a few of the internal folder/file structures - but all root folders are numeric (0-9) only.
I ran this with a straight rsync -aP
the first time and it took 3196m20.040s
. This is partially due to the fact that since the remote server is on rsync
2.6.6 I can't use the incremental file features found in 3.x.x. It takes almost 12 hours to compile the file list - running about 500 files per 10 seconds. I don't anticipate subsequent runs will take as long because the initial run had to download everything anew - however even 12 hours just for the file listing is too long.
The folder naming is broken up as such:
$ ls | grep "^[^67]" | wc -l
295
$ ls | grep "^6" | wc -l
14167
$ ls | grep "^7" | wc -l
14414
I've tested running this rsync -aWP --delete-during
by breaking it up using --include="/0*/" --exclude="/*/"
where I run 8 of these concurrently with 0* 1* 2* 3* 4* 5* 8* 9*
and for 6 and 7 I use 60*
-69*
and 70*-79*
because the brunt of the folders in the hierarchy begin with 6
or 7
(roughly 1400 per 6?*
or 7?*
).
Everything that's not a 6 or 7 takes about 5 minutes, total. The 6/7 directories (broken down in 1/10ths) take 15 minutes each.
This is quite performant, except to run this job I have to run 28 concurrent rsync
and this saturates the available connection count - not to mention potentially saturating the network.
Does anyone have a recommendation for another variant of rsync
or some additional options I could add to prevent this from using so many connections concurrently without having to stage this sequentially in the bounds of rsync
2.6.6 on one end?
Edit #1: We do pay for bandwidth to/from this external provider so ideally we would only send things over the wire that need to be sent, and nothing more.
After an initial sync time of 40 hours to download and sync all of the data a subsequent scan and sync of the same data (just to pull in updates) only took 6.5 hours. The command used to run the rsync
was:
rsync -a --quiet USER@REMOTE_SERVER:ROOT/FOLDER/PATH/ /LOCAL/DESTINATION
I think my large initial time for download was twofold:
The initial dataset is 270GB and ~2M files, which is a lot to scan and download over the internet (in our case we have a 100mbit synchronous connection and this was connecting to a large CDN provider)
I had the -P option enabled and -v options on the initial sync which caused a lot of local console chatter displaying every file being synced and progress information.
So, the answer here: Just use rsync
with not so many verbosity options (and --quiet
ideally) and it's quite efficient - even to huge datasets.
Here's what I would personally do - there are two variations to the solution.
Variation 1 - the simple, bruteforce option:
2M * 200KB is roughly 400GB, so a full snapshot every time may not be possible. If it is possible, the simple solution would be to perform:
ssh <remote host> 'tar -c /directory/to/backup | <gzip/xz/lz4>' > backup.tar.<gz/xz/lz4>
How this works is turning all of those files into a single stream that's pushed across the pipe rather than Rsync/SFTP enumerating the millions of files.
From there, I would use borg to deduplicate the tar ball so you can efficiently store multiple versions. This is a common trick for piping tons of small files very fast. The downside is that you can't do the deduplication that RSync performs.
If the 400GB per interval is too large, I'd consider the following:
Variation 2 - the clever option.
You could perform the following, except you would create a tarball for each top level directory and compare the hash to the existing file on the backup server. If it's different, transfer it, otherwise, don't do anything.