Reducing ZFS stream size for offsite backup

Solution 1:

I know this is a really old question, but I've seen it a few difference places. There's always been some confusion about the value expressed in zfs list as it pertains to using zfs send|recv. The problem is that the value expressed in the USED column is actually an estimate of the amount of space that will be released if that single snapshot is deleted, bearing in mind that there may be earlier and later snapshots referencing the same data blocks.

Example:

zfs list -t snapshot -r montreve/cev-prod | grep 02-21
NAME                                      USED  AVAIL  REFER  MOUNTPOINT
montreve/cev-prod@2018-02-21_00-00-01     878K      -   514G  -
montreve/cev-prod@2018-02-21_sc-daily     907K      -   514G  -
montreve/cev-prod@2018-02-21_01-00-01    96.3M      -   514G  -
montreve/cev-prod@2018-02-21_02-00-01    78.5M      -   514G  -
montreve/cev-prod@2018-02-21_03-00-01    80.3M      -   514G  -
montreve/cev-prod@2018-02-21_04-00-01    84.0M      -   514G  -
montreve/cev-prod@2018-02-21_05-00-01    84.2M      -   514G  -
montreve/cev-prod@2018-02-21_06-00-01    86.7M      -   514G  -
montreve/cev-prod@2018-02-21_07-00-01    94.3M      -   514G  -
montreve/cev-prod@2018-02-21_08-00-01     101M      -   514G  -
montreve/cev-prod@2018-02-21_09-00-01     124M      -   514G  -

In order find out how much data will need to be transferred to reconstitute a snapshot via zfs send|recv, you'll need to use the dry-run feature (-n) for these values. Taking the above-listed snapshots try:

zfs send -nv -I montreve/cev-prod@2018-02-21_00-00-01 montreve/cev-prod@2018-02-21_09-00-01
send from @2018-02-21_00-00-01 to montreve/cev-prod@2018-02-21_sc-daily estimated size is 1.99M
send from @2018-02-21_sc-daily to montreve/cev-prod@2018-02-21_01-00-01 estimated size is 624M
send from @2018-02-21_01-00-01 to montreve/cev-prod@2018-02-21_02-00-01 estimated size is 662M
send from @2018-02-21_02-00-01 to montreve/cev-prod@2018-02-21_03-00-01 estimated size is 860M
send from @2018-02-21_03-00-01 to montreve/cev-prod@2018-02-21_04-00-01 estimated size is 615M
send from @2018-02-21_04-00-01 to montreve/cev-prod@2018-02-21_05-00-01 estimated size is 821M
send from @2018-02-21_05-00-01 to montreve/cev-prod@2018-02-21_06-00-01 estimated size is 515M
send from @2018-02-21_06-00-01 to montreve/cev-prod@2018-02-21_07-00-01 estimated size is 755M
send from @2018-02-21_07-00-01 to montreve/cev-prod@2018-02-21_08-00-01 estimated size is 567M
send from @2018-02-21_08-00-01 to montreve/cev-prod@2018-02-21_09-00-01 estimated size is 687M
total estimated size is 5.96G

Yikes! That's a whole heck of a lot more than the USED values. However, if you don't need all of the intermediary snapshots at the destination, you can use the consolidate option (-i rather than -I), which will calculate the necessary differential between any two snapshots even if there others in between.

zfs send -nv -i montreve/cev-prod@2018-02-21_00-00-01 montreve/cev-prod@2018-02-21_09-00-01
send from @2018-02-21_00-00-01 to montreve/cev-prod@2018-02-21_09-00-01 estimated size is 3.29G
total estimated size is 3.29G

So that's isolating the various blocks that were rewritten between snapshots, so we only take their final state.

But that's not the whole story! zfs send is based on extracting the logical data from the source, so that if you have compression activated on the source filesystem, the estimates are based on the uncompressed data that will need to be sent. For example, taking one incremental snapshot and writing it to disk you get something close to the estimated value from the dry-run command:

zfs send -i montreve/cev-prod@2018-02-21_08-00-01 montreve/cev-prod@2018-02-21_09-00-01 > /montreve/temp/cp08-09.snap
-rw-r--r--  1 root root    682M Feb 22 10:07 cp08-09.snap

But if you pass it through gzip, we see that the data is significantly compressed:

zfs send -i montreve/cev-prod@2018-02-21_08-00-01 montreve/cev-prod@2018-02-21_09-00-01 | gzip > /montreve/temp/cp08-09.gz
-rw-r--r--  1 root root    201M Feb 22 10:08 cp08-09.gz

Side note - this is based on the OpenZFS on Linux, version : - ZFS: Loaded module v0.6.5.6-0ubuntu16

You will find some references to optimisations that can be applied to the send stream (-D deduplicated stream, -e more compact), but with this version I haven't observed any impact on the size of the streams generated with my datasets.

Solution 2:

What type of email system and what type of "store" technology? If the mail store is already compressed in any way, then each incremental may actually be a full as its compressed data stream may be dynamically changing due to its compression.

Also is dedup in play across either system? It sounds like there may be a remote chance that it might be on the source system. That might account for the size difference.