How does RSYNC do incremental Backups

How does rsync know which files are changed and which are not? Does it log its data anywhere in the file?

Because I want to do incremental backups, but first it will transfer all files.

So my main question is: if I upload the initial files via FTP but not by rsync. Will rsync still skips those existing files or it will upload everything on the first run.


Solution 1:

Rsync has a number of flags which control what it will look at and what it will copy over to the destination. Most often the "-a" flag is used which is the "Archive" flag, this is probably what you want. run rsync with the "-av" flags and have it do a first run against the data you want backed up. The next time it runs it'll do a block checksum of the file, and only copy over the parts which have been modified on existing files, copy new files over, and remove files which are no longer there. Check the "-a" options section on:

http://linux.die.net/man/1/rsync

The first run will be BandWidth intensive, the following runs will most likely be processor intensive but use little Bandwidth compared to the initial run. Unless you have a lot of churn over your data set.

Rsync doen't care how you got the files in the source, or the destination directories, it's only going to copy the changes between the two, unless you add flags to do something different.

If you want to log what was changed you can use the "--log-file" option. All in all something like this sounds like what you want:

rsync -av --log-file=/var/log/rsync.log -e "ssh -l backup-user" backup-user@source-machine::module /nas01/backups

Solution 2:

rsync doesn't do 'incremental', it's more like 'differential'. it doesn't transfer changes (which assume some knowledge of a prior run), it transfers differences (by comparing the source with the target files)

a simplification of the process:

  • first checks file size, creation/modification dates, flags... if it's all identical, skip the file.
  • if there's no file with that name on the target, simply copies the whole file.
  • if there's a file on the target, it calculates checksums for each 2KB of the file and transfers to the sender.
  • the sender compares the content of the origin file and compares with the target, and transfers any data not already there, together with references to any matched data. with that, the target can reconstruct the whole file using pieces of the old target and new blocks.

Solution 3:

Maybe I being pedantic but incremental backups mean you have a full backup first. Then you have backup of files changed since that backup. Then you have another backup of the backups changed since the previous one, etc. So you need the full backup, and all incremental backups since that one.

So, just using archive mode is not a incremental backup. I think the difference is important because it means you can't go back in time to get files before they changed.

If you want to do a true incremental backup, you use options such as --backup-dir. There is an example here.

Solution 4:

rsync does not log any data, it checks file modification timestamps and then content. if you upload by ftp first it'll be fine - rsync will not re-transmit all data but probably will go through all the content and fix timestamps. but there will be no huge transfer again.

Solution 5:

If the real question is "I want to do incremental backups over rsync", there are a few options available. I use Dirvish:

http://www.dirvish.org/

Restoring is easy because it gives you snapshots: it uses hardlinks to give you complete snapshots while saving space where a file is identical. Internally, it uses rsync's --link-dest option:

--link-dest=DIR         hardlink to files in DIR when unchanged

Since it uses rsync it also saves network bandwidth (and hence time) where the changes are very small. It also works if you tar and sneakernet a locally created dirvish image first, if you have lots of files and a slow link.