Behavior of rsync with file that's still being written?
If Apache is in the middle of writing a large file and an rsync cron job runs on that file, does rsync attempt to copy the file?
Example
- Apache-1: Has file large being written to
/var/www
.
- Apache-2: Clone of Apache-1. Every five minutes has cron run rsync to get
/var/www
's synched.
If Apache is writing a file of some kind to one place and has not completed writing it and then rsync
kicks in, rsync
will copy whatever is sitting there.
Meaning if Apache is dealing with a 5MB file, only 2MB is written and rsync
kicks in, the partial 2MB file will be copied. So that file would seem like it is “corrupted” on the destination server.
Depending on the size of the files you are using, you can use the --inplace
option in rsync
to do the following:
This option changes how rsync transfers a file when the file's data needs to be updated: instead of the default method of creating a new copy of the file and moving it into place when it is complete, rsync instead writes the updated data directly to the destination file.
The benefit of this is if a 5MB file only has 2MB copied on the first run, the next run will pick up at 2MB and continue to copy the file until the full 5MB is in place.
The negative is that it could create a situation where someone is accessing the web server while a file is being copied and then they would see a partial file. In my opinion rsync
works best in it’s default behavior of caching an “invisible” file and then moving it into place right away. But --inplace
is good for scenarios where large files and bandwidth constraints might stand in the way of a large file being easily copied from square one.
That said you do state this; emphasis is mine:
Every five minutes has cron run rsync…
So I assume you have some bash script in place to manage this cron job? Well, the thing is rsync
is smart enough to only copy the files that need to be copied. And if you have a script that runs every 5 minutes it appears you are trying to avoid having rsync
step on each other if it goes faster. Meaning, if you ran it every minute, there is a risk that one or more of the rsync
processes would still be running due to file size or network speed and the next process would just be in competition with it; a racing condition.
One way to avoid this is to wrap your whole rsync
command in a bash script that checks for a file lock; below is a boilerplate bash script framework I use for cases like this.
Note that some people will recommend using flock
but since flock
is not installed on some systems I use—and I jump between Ubuntu (which has it) and Mac OS X (which does not) a lot—I use this simple framework without any real issue:
LOCK_NAME="MY_GREAT_BASH_SCRIPT"
LOCK_DIR='/tmp/'${LOCK_NAME}.lock
PID_FILE=${LOCK_DIR}'/'${LOCK_NAME}'.pid'
if mkdir ${LOCK_DIR} 2>/dev/null; then
# If the ${LOCK_DIR} doesn't exist, then start working & store the ${PID_FILE}
echo $$ > ${PID_FILE}
echo "Hello world!"
rm -rf ${LOCK_DIR}
exit
else
if [ -f ${PID_FILE} ] && kill -0 $(cat ${PID_FILE}) 2>/dev/null; then
# Confirm that the process file exists & a process
# with that PID is truly running.
echo "Running [PID "$(cat ${PID_FILE})"]" >&2
exit
else
# If the process is not running, yet there is a PID file--like in the case
# of a crash or sudden reboot--then get rid of the ${LOCK_DIR}
rm -rf ${LOCK_DIR}
exit
fi
fi
The idea is that general core—where I have echo "Hello world!"
—is where the heart of your script is. The rest of it is basically a locking mechanism/logic based on mkdir
. A good explanation of the concept is in this answer:
mkdir creates a directory if it doesn't exist yet, and if it does, it sets an exit code. More importantly, it does all this in a single atomic action making it perfect for this scenario.
So in the case of your rsync
process, I would recommend using this script by just changing the echo
command to your rsync
command. Also, change the LOCK_NAME
to something like RSYNC_PROCESS
and then you are good to go.
Now with your rsync
wrapped in this script, you can set the cron job to run every minute without any risk of a racing condition where two or more rsync
processes are fighting to do the same thing. This will allow you to increase the speed or rsync
updates which will not eliminate the issue of partial files being transferred, but it will help speed up the overall process so the full file can be properly be copied over at some point.
Yes - and the file might be corrupted if rsync is reading the file at the same time the file is being written to.
You can try this: https://unix.stackexchange.com/a/2558
You can also script it with lsof:
lsof /path/to file
An exit code of 0 means that the file is in use, and exit code of 1 means there's no activity on that file.