MongoDB and ZFS bad performance: disk always busy with reads while doing only writes

This may sound a bit crazy, but I support another application that benefits from ZFS volume management attributes, but does not perform well on the native ZFS filesystem.

My solution?!?

XFS on top of ZFS zvols.

Why?!?

Because XFS performs well and eliminates the application-specific issues I was facing with native ZFS. ZFS zvols allow me to thin-provision volumes, add compression, enable snapshots and make efficient use of the storage pool. More important for my app, the ARC caching of the zvol reduced the I/O load on the disks.

See if you can follow this output:

# zpool status
  pool: vol0
 state: ONLINE
  scan: scrub repaired 0 in 0h3m with 0 errors on Sun Mar  2 12:09:15 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243223  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243264  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243226  ONLINE       0     0     0
            scsi-SATA_OWC_Mercury_AccOW140128AS1243185  ONLINE       0     0     0

ZFS zvol, created with: zfs create -o volblocksize=128K -s -V 800G vol0/pprovol (note that auto-snapshots are enabled)

# zfs get all vol0/pprovol
NAME          PROPERTY               VALUE                  SOURCE
vol0/pprovol  type                   volume                 -
vol0/pprovol  creation               Wed Feb 12 14:40 2014  -
vol0/pprovol  used                   273G                   -
vol0/pprovol  available              155G                   -
vol0/pprovol  referenced             146G                   -
vol0/pprovol  compressratio          3.68x                  -
vol0/pprovol  reservation            none                   default
vol0/pprovol  volsize                900G                   local
vol0/pprovol  volblocksize           128K                   -
vol0/pprovol  checksum               on                     default
vol0/pprovol  compression            lz4                    inherited from vol0
vol0/pprovol  readonly               off                    default
vol0/pprovol  copies                 1                      default
vol0/pprovol  refreservation         none                   default
vol0/pprovol  primarycache           all                    default
vol0/pprovol  secondarycache         all                    default
vol0/pprovol  usedbysnapshots        127G                   -
vol0/pprovol  usedbydataset          146G                   -
vol0/pprovol  usedbychildren         0                      -
vol0/pprovol  usedbyrefreservation   0                      -
vol0/pprovol  logbias                latency                default
vol0/pprovol  dedup                  off                    default
vol0/pprovol  mlslabel               none                   default
vol0/pprovol  sync                   standard               default
vol0/pprovol  refcompressratio       4.20x                  -
vol0/pprovol  written                219M                   -
vol0/pprovol  snapdev                hidden                 default
vol0/pprovol  com.sun:auto-snapshot  true                   local

Properties of ZFS zvol block device. 900GB volume (143GB actual size on disk):

# fdisk -l /dev/zd0

Disk /dev/zd0: 966.4 GB, 966367641600 bytes
3 heads, 18 sectors/track, 34952533 cylinders
Units = cylinders of 54 * 512 = 27648 bytes
Sector size (logical/physical): 512 bytes / 131072 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x48811e83

    Device Boot      Start         End      Blocks   Id  System
/dev/zd0p1              38    34952534   943717376   83  Linux

XFS information on ZFS block device:

# xfs_info /dev/zd0p1
meta-data=/dev/zd0p1             isize=256    agcount=32, agsize=7372768 blks
         =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=235928576, imaxpct=25
         =                       sunit=32     swidth=32 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=65536, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

XFS mount options:

# mount
/dev/zd0p1 on /ppro type xfs (rw,noatime,logbufs=8,logbsize=256k,nobarrier)

Note: I also do this on top of HP Smart Array hardware RAID in some cases.

The pool creation looks like:

zpool create -o ashift=12 -f vol1 wwn-0x600508b1001ce908732af63b45a75a6b

With the result looking like:

# zpool status  -v
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h14m with 0 errors on Wed Feb 26 05:53:51 2014
config:

        NAME                                      STATE     READ WRITE CKSUM
        vol1                                      ONLINE       0     0     0
          wwn-0x600508b1001ce908732af63b45a75a6b  ONLINE       0     0     0

First off, it's worth stating that ZFS is not a supported filesystem for MongoDB on Linux - the recommended filesystems are ext4 or XFS. Because ZFS is not even checked for on Linux (see SERVER-13223 for example) it will not use sparse files, instead attempting to pre-allocate (fill with zeroes), and that will mean horrendous performance on a COW filesystem. Until that is fixed adding new data files will be a massive performance hit on ZFS (which you will be trying to do frequently with your writes). While you are not doing that performance should improve, but if you are adding data fast enough you may never recover between allocation hits.

Additionally, ZFS does not support Direct IO, so you will be copying data multiple times into memory (mmap, ARC, etc.) - I suspect that this is the source of your reads, but I would have to test to be sure. The last time I saw any testing with MongoDB/ZFS on Linux the performance was poor, even with the ARC on an SSD - ext4 and XFS were massively faster. ZFS might be viable for MongoDB production usage on Linux in the future, but it's not ready right now.


We were looking into running Mongo on ZFS and saw that this post raised major concerns about the performance available. Two years on we wanted to see how new releases of Mongo that use WiredTiger over mmap, performed on the now officially supported ZFS that comes with the latest Ubuntu Xenial release.

In summary it was clear that ZFS doesn't perform quite as well as EXT4 or XFS however the performance gap isn't that significant, especially when you consider the extra features that ZFS offers.

I've made a blog post about our findings and methodology. I hope you find it useful!