rsync directory so all changes appear atomically

I do some nightly and weekly mirrors of frequently used repositories for the local network. On a few occasions, someone has tried to do an update while the rsync is happening and failed because the expected files aren't all there yet.

Is it possible to do an rsync such that all changed files only appear with the correct names at completion? I know rsync uses temporary .hidden files while each transfer is in progress, but can I postpone the renames until it is finished somehow?

Alternatively it seems I could use the --backup option to move all changes to one directory and atomically move them after, but I'd want the feature to work in reverse to what it does now.

I'm on Linux for what its worth.


Solution 1:

You can use the --link-dest= option. Basically you would create a new folder, all files are hard-linked to the new one. When everything is done, you can just swap the folder names and remove the old one.

It is impossible to do this 100% atomic in Linux since there is no kernel/VFS support for it. However, swapping the names is actually only 2 syscalls away so it should take way less than 1 second to complete it. It is possible only on Darwin (MAC/OSX) with the exchangedata system call on HFS filesystems.

Solution 2:

I do something similar with rsync backups [to disk] and I've encountered the same problem due to a daemon updating files while the backup is running.

Unlike many programs, rsync has many different error codes [See the man page bottom]. Of interest are two:

23 -- partial transfer due to error
24 -- partial transfer due to vanished source files

When rsync is doing a transfer and encounters one of these situations, it doesn't just stop immediately. It skips over and continues with the files it can transfer. At the end, it presents the return code.

So, if you get error 23/24, just rerun the rsync. The subsequent runs will go much faster, usually just transferring the missing files from the previous run. Eventually, you'll get [or should get] a clean run.

As to being atomic, I use a "tmp" dir during transfer. Then, when rsync run is clean, I rename it [atomically] to <date>

I also use the --link-dest option, but I use that to keep delta backups (e.g. --link-dest=yesterday for daily)

Although I've not used it myself, the --partial-dir=DIR may keep the hidden files from cluttering up the backup directory. Be sure that DIR is on the same filesystem as your backup directory so renames will be atomic

While I do this in perl, I written a script that summarizes what I've been saying with a bit more detail/precision for your particular situation. It's in tcsh-like syntax, [untested and a bit rough], but treat it as pseudo-code to write your own bash, perl, python script as you choose. Note that it has no limit on retries, but you can add that easily enough, according to your wishes.

#!/bin/tcsh -f
# repo_backup -- backup repos even if they change
#
# use_tmp -- use temporary destination directory
# use_partial -- use partial directory
# use_delta -- make delta backup

# set remote server name ...
set remote_server="..."

# directory on server for backups
set backup_top="/path_to_backup_top"
set backup_backups="$backup_top/backups"

# set your rsync options ...
set rsync_opts=(...)

# keep partial files from cluttering backup
set server_partial=${remote_server}:$backup_top/partial
if ($use_partial) then
    set rsync_opts=($rsync_opts --partial-dir=$server_partial)
endif

# do delta backups
if ($use_delta) then
    set latest=(`ssh ${remote_server} ls $backup_backups | tail -1`)

    # get latest
    set delta_dir="$backup_backups/$latest"

    if ($#latest > 0) then
        set rsync_opts=($rsync_opts --link-dest=${remote_server}:$delta_dir)
    endif
endif

while (1)
    # get list of everything to backup
    # set this to whatever you need
    cd /local_top_directory
    set transfer_list=(.)

    # use whatever format you'd like
    set date=`date +%Y%m%d_%H%M%S`

    set server_tmp=${remote_server}:$backup_top/tmp
    set server_final=${remote_server}:$backup_backups/$date

    if ($use_tmp) then
        set server_transfer=$server_tmp
    else
        set server_transfer=$server_final
    endif

    # do the transfer
    rsync $rsync_opts $transfer_list $server_transfer
    set code=$status

    # run was clean
    if ($code == 0) then
        # atomically install backup
        if ($use_tmp) then
            ssh ${remote_server} mv $backup_top/tmp $backup_backups/$date
        endif
        break
    endif

    # partial -- some error
    if ($code == 23) then
        continue
    endif

    # partial -- some files disappeared
    if ($code == 24) then
        continue
    endif

    echo "fatal error ..."
    exit(1)
end

Solution 3:

Not sure if this is going to help you, but...

If you don't mind copying the whole data set each time and if you can use symlinks to refer to the target directory, then you should be able to rsync everything into a temporary directory and then swap (rename()) the old and new symlinks atomically, like so:

% mkdir old_data new_data
% ln -s old_data current
% ln -s new_data new
% strace mv -T new current

which runs

rename("new", "current") = 0

and gives

current -> new_data

Even for this to work any clients trying to read from this setup should cd into the directory referenced by the symlink before attempting any reads, otherwise they risk loading some parts of code/data from the old copy and some from the new one.

Solution 4:

Are the mirror syncs automatic (a cron task or like)? If so, you probably use a dedicated OS user for this, am I right? So the solution could be, instead of simply copying:

  1. Set destination directory permissions so only rsync will be able to access it.
  2. Proceed with syncing.
  3. Change target's permissions (unconditionally) so the others can access it again.

The downside is that during the sync process (not sure how long it takes) the target directory won't be accessible. You need to decide yourself if it's OK here.