strange ZFS disk space usage report for a ZVOL

We have a 100G ZVOL on a FreeBSD 10.0-CURRENT host which claims to use 176G of disk space:

root@storage01:~ # zfs get all zroot/DATA/vtest
NAME              PROPERTY              VALUE                  SOURCE
zroot/DATA/vtest  type                  volume                 -
zroot/DATA/vtest  creation              Fri May 24 20:44 2013  -
zroot/DATA/vtest  used                  176G                   -
zroot/DATA/vtest  available             10.4T                  -
zroot/DATA/vtest  referenced            176G                   -
zroot/DATA/vtest  compressratio         1.00x                  -
zroot/DATA/vtest  reservation           none                   default
zroot/DATA/vtest  volsize               100G                   local
zroot/DATA/vtest  volblocksize          8K                     -
zroot/DATA/vtest  checksum              fletcher4              inherited from zroot
zroot/DATA/vtest  compression           off                    default
zroot/DATA/vtest  readonly              off                    default
zroot/DATA/vtest  copies                1                      default
zroot/DATA/vtest  refreservation        none                   local
zroot/DATA/vtest  primarycache          all                    default
zroot/DATA/vtest  secondarycache        all                    default
zroot/DATA/vtest  usedbysnapshots       0                      -
zroot/DATA/vtest  usedbydataset         176G                   -
zroot/DATA/vtest  usedbychildren        0                      -
zroot/DATA/vtest  usedbyrefreservation  0                      -
zroot/DATA/vtest  logbias               latency                default
zroot/DATA/vtest  dedup                 off                    default
zroot/DATA/vtest  mlslabel                                     -
zroot/DATA/vtest  sync                  standard               default
zroot/DATA/vtest  refcompressratio      1.00x                  -
zroot/DATA/vtest  written               176G                   -
zroot/DATA/vtest  logicalused           87.2G                  -
zroot/DATA/vtest  logicalreferenced     87.2G                  -
root@storage01:~ # 

This looks like a bug, how can it consume more than its volsize if it does not have snapshots, reservations and children? Or maybe we are missing something?

Upd:

Results of zpool status -v:

root@storage01:~ # zpool status -v
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0h6m with 0 errors on Thu May 30 05:45:11 2013
config:

        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk2  ONLINE       0     0     0
            gpt/disk3  ONLINE       0     0     0
            gpt/disk4  ONLINE       0     0     0
            gpt/disk5  ONLINE       0     0     0
        cache
          ada0s2       ONLINE       0     0     0

errors: No known data errors
root@storage01:~ # 

Results of zpool list:

root@storage01:~ # zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zroot  16.2T   288G  16.0T     1%  1.05x  ONLINE  -
root@storage01:~ # 

Results of zfs list:

root@storage01:~ # zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
zroot                           237G  10.4T   288K  /
zroot/DATA                      227G  10.4T   352K  /DATA
zroot/DATA/NFS                  288K  10.4T   288K  /DATA/NFS
zroot/DATA/hv                  10.3G  10.4T   288K  /DATA/hv
zroot/DATA/hv/hv001            10.3G  10.4T   144K  -
zroot/DATA/test                 288K  10.4T   288K  /DATA/test
zroot/DATA/vimage              41.3G  10.4T   288K  /DATA/vimage
zroot/DATA/vimage/vimage_001   41.3G  10.5T  6.47G  -
zroot/DATA/vtest                176G  10.4T   176G  -
zroot/SYS                      9.78G  10.4T   288K  /SYS
zroot/SYS/ROOT                  854M  10.4T   854M  /
zroot/SYS/home                 3.67G  10.4T  3.67G  /home
zroot/SYS/tmp                   352K  10.4T   352K  /tmp
zroot/SYS/usr                  4.78G  10.4T   427M  /usr
zroot/SYS/usr/local             288K  10.4T   288K  /usr/local
zroot/SYS/usr/obj              3.50G  10.4T  3.50G  /usr/obj
zroot/SYS/usr/ports             895K  10.4T   320K  /usr/ports
zroot/SYS/usr/ports/distfiles   288K  10.4T   288K  /usr/ports/distfiles
zroot/SYS/usr/ports/packages    288K  10.4T   288K  /usr/ports/packages
zroot/SYS/usr/src               887M  10.4T   887M  /usr/src
zroot/SYS/var                   511M  10.4T  1.78M  /var
zroot/SYS/var/crash             505M  10.4T   505M  /var/crash
zroot/SYS/var/db               1.71M  10.4T  1.43M  /var/db
zroot/SYS/var/db/pkg            288K  10.4T   288K  /var/db/pkg
zroot/SYS/var/empty             288K  10.4T   288K  /var/empty
zroot/SYS/var/log               647K  10.4T   647K  /var/log
zroot/SYS/var/mail              296K  10.4T   296K  /var/mail
zroot/SYS/var/run               448K  10.4T   448K  /var/run
zroot/SYS/var/tmp               304K  10.4T   304K  /var/tmp
root@storage01:~ # 

Upd 2:

We created a number of ZVOLs with different parameters and used dd to move the content. We noticed another odd thing, disk usage was normal for ZVOLs with 16k and 128k volblocksize and it remained abnormal for a ZVOL with 8k volblocksize even after dd (so this is not a fragmentation issue):

root@storage01:~ # zfs get all zroot/DATA/vtest-3
NAME                PROPERTY              VALUE                  SOURCE
zroot/DATA/vtest-3  type                  volume                 -
zroot/DATA/vtest-3  creation              Fri May 31  7:35 2013  -
zroot/DATA/vtest-3  used                  201G                   -
zroot/DATA/vtest-3  available             10.2T                  -
zroot/DATA/vtest-3  referenced            201G                   -
zroot/DATA/vtest-3  compressratio         1.00x                  -
zroot/DATA/vtest-3  reservation           none                   default
zroot/DATA/vtest-3  volsize               100G                   local
zroot/DATA/vtest-3  volblocksize          8K                     -
zroot/DATA/vtest-3  checksum              fletcher4              inherited from zroot
zroot/DATA/vtest-3  compression           off                    default
zroot/DATA/vtest-3  readonly              off                    default
zroot/DATA/vtest-3  copies                1                      default
zroot/DATA/vtest-3  refreservation        103G                   local
zroot/DATA/vtest-3  primarycache          all                    default
zroot/DATA/vtest-3  secondarycache        all                    default
zroot/DATA/vtest-3  usedbysnapshots       0                      -
zroot/DATA/vtest-3  usedbydataset         201G                   -
zroot/DATA/vtest-3  usedbychildren        0                      -
zroot/DATA/vtest-3  usedbyrefreservation  0                      -
zroot/DATA/vtest-3  logbias               latency                default
zroot/DATA/vtest-3  dedup                 off                    default
zroot/DATA/vtest-3  mlslabel                                     -
zroot/DATA/vtest-3  sync                  standard               default
zroot/DATA/vtest-3  refcompressratio      1.00x                  -
zroot/DATA/vtest-3  written               201G                   -
zroot/DATA/vtest-3  logicalused           100G                   -
zroot/DATA/vtest-3  logicalreferenced     100G                   -
root@storage01:~ # 

and

root@storage01:~ # zfs get all zroot/DATA/vtest-16
NAME                 PROPERTY              VALUE                  SOURCE
zroot/DATA/vtest-16  type                  volume                 -
zroot/DATA/vtest-16  creation              Fri May 31  8:03 2013  -
zroot/DATA/vtest-16  used                  102G                   -
zroot/DATA/vtest-16  available             10.2T                  -
zroot/DATA/vtest-16  referenced            101G                   -
zroot/DATA/vtest-16  compressratio         1.00x                  -
zroot/DATA/vtest-16  reservation           none                   default
zroot/DATA/vtest-16  volsize               100G                   local
zroot/DATA/vtest-16  volblocksize          16K                    -
zroot/DATA/vtest-16  checksum              fletcher4              inherited from zroot
zroot/DATA/vtest-16  compression           off                    default
zroot/DATA/vtest-16  readonly              off                    default
zroot/DATA/vtest-16  copies                1                      default
zroot/DATA/vtest-16  refreservation        102G                   local
zroot/DATA/vtest-16  primarycache          all                    default
zroot/DATA/vtest-16  secondarycache        all                    default
zroot/DATA/vtest-16  usedbysnapshots       0                      -
zroot/DATA/vtest-16  usedbydataset         101G                   -
zroot/DATA/vtest-16  usedbychildren        0                      -
zroot/DATA/vtest-16  usedbyrefreservation  886M                   -
zroot/DATA/vtest-16  logbias               latency                default
zroot/DATA/vtest-16  dedup                 off                    default
zroot/DATA/vtest-16  mlslabel                                     -
zroot/DATA/vtest-16  sync                  standard               default
zroot/DATA/vtest-16  refcompressratio      1.00x                  -
zroot/DATA/vtest-16  written               101G                   -
zroot/DATA/vtest-16  logicalused           100G                   -
zroot/DATA/vtest-16  logicalreferenced     100G                   -
root@storage01:~ # 

Solution 1:

VOLSIZE represents that size of the volume as it will be seen by the clients, not the size of the volume as stored on the pool.

This difference may come from multiple sources:

  • space required for metadata
  • space required for storing multiple copies (the "copies" parameters)
  • "wasted space" due to padding while aligning blocks of "volblocksize" size to vdev structure; by vdev structure I mean two parameters: number of disks in raidz-N, and physical block size of the devices.

When creating a volume, zfs will estimate how much space it would need to use in order to be able to present a volume of "volsize" to its clients. You can see that difference in the vtest-16 and vtest-3 volumes (where refreservation is 102GB and volsize is 100GB). The calculation can be found in libzfs_dataset.c (zvol_volsize_to_reservation(uint64_t volsize, nvlist_t *props))

What is not taken into account by that calculation is the third source. The third source has little impact on vdevs that are created with disks that have 512 byte sectors. From my experiments (I have tested that by filling an entire zvol to check that out) it does make quite a difference when the vdev is created over newer 4K sector disks.

Another thing I found in my experiments is that having mirrors does not show differences between calculated refreservation and what ends up to be used.

These are my results when using 4K drives with volumes that have the default volblocksize (8K). The first column represents the number of disks in a vdev:

    raidz1  raidz2
3   135%    101%
4   148%    148%
5   162%    181%
6   162%    203%
7   171%    203%
8   171%    217%
9   181%    232%
10  181%    232%
11  181%    232%
12  181%    232%

These are my results when using 512 bytes sector drives and default volblocksize of 8K. The first column represents the number of disks in a vdev:

    raidz1  raidz2
3   101%    101%
4   104%    104%
5   101%    113%
6   105%    101%
7   108%    108%
8   110%    114%
9   101%    118%
10  102%    106%
11  103%    108%
12  104%    110%

My conclusions are the following:

  • Do not use 4K drives
  • If you really need to use them, create the volume using volblocksize greater than or equal to 32K; There is negligible performance impact, and negligible space usage overhead (larger block sizes require less padding to properly align).
  • Prefer stripes of mirrors for your pools; This layout has both performance benefits and less space related surprises like this one.
  • The estimation is clearly wrong for the cases outlined above, and this is a bug in zfs.