rsync directory so all changes appear atomically
I do some nightly and weekly mirrors of frequently used repositories for the local network. On a few occasions, someone has tried to do an update while the rsync is happening and failed because the expected files aren't all there yet.
Is it possible to do an rsync such that all changed files only appear with the correct names at completion? I know rsync uses temporary .hidden files while each transfer is in progress, but can I postpone the renames until it is finished somehow?
Alternatively it seems I could use the --backup option to move all changes to one directory and atomically move them after, but I'd want the feature to work in reverse to what it does now.
I'm on Linux for what its worth.
Solution 1:
You can use the --link-dest=
option. Basically you would create a new folder, all files are hard-linked to the new one. When everything is done, you can just swap the folder names and remove the old one.
It is impossible to do this 100% atomic in Linux since there is no kernel/VFS support for it. However, swapping the names is actually only 2 syscalls away so it should take way less than 1 second to complete it. It is possible only on Darwin (MAC/OSX) with the exchangedata system call on HFS filesystems.
Solution 2:
I do something similar with rsync
backups [to disk] and I've encountered the same problem due to a daemon updating files while the backup is running.
Unlike many programs, rsync has many different error codes [See the man page bottom]. Of interest are two:
23 -- partial transfer due to error
24 -- partial transfer due to vanished source files
When rsync is doing a transfer and encounters one of these situations, it doesn't just stop immediately. It skips over and continues with the files it can transfer. At the end, it presents the return code.
So, if you get error 23/24, just rerun the rsync. The subsequent runs will go much faster, usually just transferring the missing files from the previous run. Eventually, you'll get [or should get] a clean run.
As to being atomic, I use a "tmp" dir during transfer. Then, when rsync run is clean, I rename it [atomically] to <date>
I also use the --link-dest
option, but I use that to keep delta backups (e.g. --link-dest=yesterday
for daily)
Although I've not used it myself, the --partial-dir=DIR
may keep the hidden files from cluttering up the backup directory. Be sure that DIR is on the same filesystem as your backup directory so renames will be atomic
While I do this in perl, I written a script that summarizes what I've been saying with a bit more detail/precision for your particular situation. It's in tcsh-like syntax, [untested and a bit rough], but treat it as pseudo-code to write your own bash
, perl
, python
script as you choose. Note that it has no limit on retries, but you can add that easily enough, according to your wishes.
#!/bin/tcsh -f
# repo_backup -- backup repos even if they change
#
# use_tmp -- use temporary destination directory
# use_partial -- use partial directory
# use_delta -- make delta backup
# set remote server name ...
set remote_server="..."
# directory on server for backups
set backup_top="/path_to_backup_top"
set backup_backups="$backup_top/backups"
# set your rsync options ...
set rsync_opts=(...)
# keep partial files from cluttering backup
set server_partial=${remote_server}:$backup_top/partial
if ($use_partial) then
set rsync_opts=($rsync_opts --partial-dir=$server_partial)
endif
# do delta backups
if ($use_delta) then
set latest=(`ssh ${remote_server} ls $backup_backups | tail -1`)
# get latest
set delta_dir="$backup_backups/$latest"
if ($#latest > 0) then
set rsync_opts=($rsync_opts --link-dest=${remote_server}:$delta_dir)
endif
endif
while (1)
# get list of everything to backup
# set this to whatever you need
cd /local_top_directory
set transfer_list=(.)
# use whatever format you'd like
set date=`date +%Y%m%d_%H%M%S`
set server_tmp=${remote_server}:$backup_top/tmp
set server_final=${remote_server}:$backup_backups/$date
if ($use_tmp) then
set server_transfer=$server_tmp
else
set server_transfer=$server_final
endif
# do the transfer
rsync $rsync_opts $transfer_list $server_transfer
set code=$status
# run was clean
if ($code == 0) then
# atomically install backup
if ($use_tmp) then
ssh ${remote_server} mv $backup_top/tmp $backup_backups/$date
endif
break
endif
# partial -- some error
if ($code == 23) then
continue
endif
# partial -- some files disappeared
if ($code == 24) then
continue
endif
echo "fatal error ..."
exit(1)
end
Solution 3:
Not sure if this is going to help you, but...
If you don't mind copying the whole data set each time and if you can use symlinks to refer to the target directory, then you should be able to rsync everything into a temporary directory and then swap (rename()) the old and new symlinks atomically, like so:
% mkdir old_data new_data
% ln -s old_data current
% ln -s new_data new
% strace mv -T new current
which runs
rename("new", "current")
= 0
and gives
current -> new_data
Even for this to work any clients trying to read from this setup should cd
into the directory referenced by the symlink before attempting any reads, otherwise they risk loading some parts of code/data from the old copy and some from the new one.
Solution 4:
Are the mirror syncs automatic (a cron task or like)? If so, you probably use a dedicated OS user for this, am I right? So the solution could be, instead of simply copying:
- Set destination directory permissions so only rsync will be able to access it.
- Proceed with syncing.
- Change target's permissions (unconditionally) so the others can access it again.
The downside is that during the sync process (not sure how long it takes) the target directory won't be accessible. You need to decide yourself if it's OK here.