full or incremental backup of large number of files

I have a large amount of files in both quantity and total file size. (we're talking a few terabytes). I would like to sync these files/folders once to an external backup system, then run a daily task that will resync the backup based on the daily changes. The changes aren't that frequent but some days we might have around a 300GB differential (for about 1.5K files).

I've been considering rsync or rdiff-backup or rsnapshot as some tools but I've wanted to run some tests with rsynch first. I've had one major issue with rsync and that is:

Checking of existing files for changes takes way too long. We're talking over 20h which makes the daily backup pointless. This is using rsync -rvhzP or -rvhP. It seems to simply scan all files and takes hours on end even if no file was added/changed/deleted.

Am I doing something wrong? Will any of the other systems I mentioned (rdiff-backup or rsnapshot) perform any better? I was going under the assumption they were based off of rsync anyways.

Thanks in advance.

Update with extra information: We have about 2600 directories and 100k files totalling around 3.5TB, ran the tests using rsync version 3.0.9 protocol version 30. As far as daily changes go, there are generally 10- file changes a day but it can peak at arround 1.5K file changes/additions/deletions and about 300Gb in volume (though these peaks are'nt that frequent, and generaly spread appart)

Assuming that the modification timestamps on your source files are legitimate (and are being updated when the files are modified) I think it makes sense for you to add the -t argument to synchronize times. Quoth the rsync man page:

-t, --times
This tells rsync to transfer modification times along with the files and update them on the remote system. Note that if this option is not used, the optimization that excludes files that have not been modified cannot be effective; in other words, a missing -t or -a will cause the next transfer to behave as if it used -I, causing all files to be updated (though rsync's delta-transfer algorithm will make the update fairly efficient if the files haven't actually changed, you're much better off using -t).

Basically, you're losing the optimization whereby rsync can use the file's modification timestamp as a sentinel to indicate that the file has been modified. If the modification timestamps disagree between the sender and receiver the delta copy algorithm is used and the file contents are scanned. With a corpus as large as you're talking about that's going to be a lengthy scanning process, as you're seeing.

If your files' modification timestamps aren't being updated when the files are changed (for some bizarre reason) then this won't be effective and you'll have to do full file scans. If you need the remote files' modification timestamps to reflect when they were synchronized, rather than the source files' modification timestamp, then this also won't be a workable solution.

I suspect this option will radically speed-up your synchronizations, though.

You may want to go one layer down, using lvm snapshots and lvmsync.

In this solution the snapshots will know what has been changed, no scanning is needed. The downside is that this solution doesn't understand files, it will just transfer blocks.

The other solution would be a daemon that uses inotify and stores information which files have been changed. Then you may just rsync only files on the list. Lsyncd looks like a software that you are looking for.

full or incremental backup of large number of files

Related

Recent Posts