ZFS Sync over unreliable, slow WAN. ZFS replication, or rsync?
I've been tasked with making an off-site backup work over the WAN. Both storage boxes are FreeBSD based NAS boxes running ZFS.
Once or twice a week, 15-60 gigs of photography data gets dumped to the office NAS. My job is to figure out how to get this data off-site as reliably as possible using the VERY SLOW DSL connection (~700Kb/s upload). The receiving box is in much better shape, at 30Mb/s down, 5Mb/s up.
I know, carrying a hard drive off-site would move data much more quickly, but it's not an option in this case.
My options seem to be either:
- ZFS incremental send over ssh
- Rsync
rsync is a time honored solution, and has the all-important ability to resume a send if something gets interrupted. It has the disadvantage of iterating over many files and not knowing about dedup.
ZFS snapshot sending might transfer a bit less data (it knows a lot more about the file system, can do dedup, can package up the metadata changes more efficiently than rsync) and has the advantage of properly duplicating the filesystem state, rather than simply copying files individually (which is more disk intensive).
I'm concerned about ZFS replication performance[1] (though that article is a year old). I'm also concerned about being able to re-start the transfer if something goes down - the snapshot capability doesn't seem to include that. The whole system needs to be completely hands-off.
[1] http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html
Using either option, I should be able to de-prioritize the traffic by routing it through a specified port, then using the QOS on the routers. I need to avoid a major negative impact on users at both sites during each transfer, since it will take several days.
So... that's my thinking on the issue. Have I missed any good options? Has anyone else set something similar up?
Solution 1:
If you can transfer a maximum of 6GB per day (assuming zero overhead and zero competing traffic) and you are needing to move "15-60 gigs" at a frequency of "once or twice per week," that works out to 15-120 GB per week, or anywhere from 2-17 GB per day. Because it is necessary to plan for peak demand, and 17 GB is far in excess of even your theoretical maximum of 6 GB, it is likely that you have a very serious bandwidth problem. What will it take to upgrade the connection? If upgrading the connection is impossible, please consider the option of mailing physical media on a scheduled basis (e.g. weekly).
Assuming that you can get the bandwidth math to make a little bit more sense, rsync is likely to be the best option. Deduplication awareness would be hugely valuable when replicating highly redundant data (e.g. virtual machine images), but it should have little or no benefit when it comes to unique digital content (audio, video, photos)... unless, of course, users are inadvertently storing duplicate copies of identical files.
Solution 2:
After doing some research I believe you are right about sending snapshots. The ZFS SEND
and RECEIVE
commands can be piped into bzip2 and then that file can be rsync-ed to the other machine.
Here are some sources I used:
The Oracle Solaris ZFS Administrator Guide page 211 (or web version here) begins talking about about this.
I also found a blog post that gave a simple example of this. This blog also showed piping the bit stream through bzip2 and sending it.
I hadn't found any posts with replication scripts posted, but I did find someone that posted their backup script. That said, I didn't understand it so it may be junk.
Many of the website talked about setting up a cron job to do this frequently. If this is the case, you could replicate/backup with less impact to bandwidth and users and be a good disaster recovery feature because the offsite data is more up to date. (That is, after the initial chunk of data when getting started.)
Again, I think you had the right idea sending snapshots there seems to be a lot of advantages to using SEND
/ RECEIVE
.
EDIT: Just watched a video1 video2 that may helps suports the use of SEND
/RECEIVE
and talks about rsync (starts at 3m49s). Ben Rockwood was the speaker and here is a link to his blog.
Solution 3:
What is the purpose of the backups and how will they need to be accessed?
If your backups are mainly for disaster recovery then ZFS snapshots might be preferable as you'll be able to get a filesystem back to the exact state it was in at the time of the last incremental.
However, if your backups are also supposed to provide users access to files that might have been accidentally deleted, corrupted, etc. then rsync could be a better option. End users may not understand the concept of snapshots or perhaps your NAS doesn't provide end users access to previous snapshots. In either case you can use rsync to provide a backup that is easily accessible to the user via the filesystem.
With rsync you can use the --backup flag to preserve backups of files that have been changed, and with the --suffix flag you can control how old versions of files are renamed. This makes it easy to create a backup where you might have dated old versions of files like
file_1.jpg
file_1.jpg.20101012
file_1.jpg.20101008
etc.
You can easily combine this with a cronjob containing a find command to purge any old files as needed.
Both solutions should be able to preserve enough metainformation about files to work as a backup (rsync provides --perms, --owner etc. flags). I use rsync to backup large amounts of data between datacenters and am very happy with the setup.
Solution 4:
ZFS should receive the 'resumable send' feature, which will allow continuing an interrupted replication some time around March of this year. The feature has been completed by Matt Ahrens and some other people, and should be upstreamed soon.