ZFS: good read but poor write speeds

I'm in charge of downloading and processing large amounts of financial data. Each trading day, we have to add around 100GB.

To handle this amount of data, we rent a virtual server (3 cores, 12 GB ram) and a 30 TB block device from our university's data center.

On the virtual machine I installed Ubuntu 16.04 and ZFS on Linux. Then, I created a ZFS pool on the 30TB block device. The main reason for using ZFS is the compression feature as the data is nicely compressible (~10%). Please don't be too hard on me for not following the golden rule that ZFS wants to see bare metal, I am forced to use the infrastructure as it is.

The reason for posting is that I face a problem of poor write speeds. The server is able to read data with about 50 MB/s from the block device but writing data is painfully slow with about 2-4 MB/s.

Here is some information on the pool and the dataset:

zdb

tank:
version: 5000
name: 'tank'
state: 0
txg: 872307
pool_guid: 8319810251081423408
errata: 0
hostname: 'TAQ-Server'
vdev_children: 1
vdev_tree:
    type: 'root'
    id: 0
    guid: 8319810251081423408
    children[0]:
        type: 'disk'
        id: 0
        guid: 13934768780705769781
        path: '/dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d-part1'
        phys_path: '/iscsi/[email protected]%3Asn.606f4c46fd740001,0:a'
        whole_disk: 1
        metaslab_array: 30
        metaslab_shift: 38
        ashift: 9
        asize: 34909494181888
        is_log: 0
        DTL: 126
        create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data

zpool get all

NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        31,8T                       -
tank  capacity                    33%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        8319810251081423408         default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  off                         default
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        21,1T                       -
tank  allocated                   10,6T                       -
tank  readonly                    off                         -
tank  ashift                      0                           default
tank  comment                     -                           default
tank  expandsize                  255G                        -
tank  freeing                     0                           default
tank  fragmentation               12%                         -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  active                      local
tank  feature@enabled_txg         active                      local
tank  feature@hole_birth          active                      local
tank  feature@extensible_dataset  enabled                     local
tank  feature@embedded_data       active                      local
tank  feature@bookmarks           enabled                     local
tank  feature@filesystem_limits   enabled                     local
tank  feature@large_blocks        enabled                     local

zfs get all tank/test

NAME       PROPERTY               VALUE                  SOURCE
tank/test  type                   filesystem             -
tank/test  creation               Do Jul 21 10:04 2016   -
tank/test  used                   19K                    -
tank/test  available              17,0T                  -
tank/test  referenced             19K                    -
tank/test  compressratio          1.00x                  -
tank/test  mounted                yes                    -
tank/test  quota                  none                   default
tank/test  reservation            none                   default
tank/test  recordsize             128K                   default
tank/test  mountpoint             /tank/test             inherited from tank
tank/test  sharenfs               off                    default
tank/test  checksum               on                     default
tank/test  compression            off                    default
tank/test  atime                  off                    local
tank/test  devices                on                     default
tank/test  exec                   on                     default
tank/test  setuid                 on                     default
tank/test  readonly               off                    default
tank/test  zoned                  off                    default
tank/test  snapdir                hidden                 default
tank/test  aclinherit             restricted             default
tank/test  canmount               on                     default
tank/test  xattr                  on                     default
tank/test  copies                 1                      default
tank/test  version                5                      -
tank/test  utf8only               off                    -
tank/test  normalization          none                   -
tank/test  casesensitivity        mixed                  -
tank/test  vscan                  off                    default
tank/test  nbmand                 off                    default
tank/test  sharesmb               off                    default
tank/test  refquota               none                   default
tank/test  refreservation         none                   default
tank/test  primarycache           all                    default
tank/test  secondarycache         all                    default
tank/test  usedbysnapshots        0                      -
tank/test  usedbydataset          19K                    -
tank/test  usedbychildren         0                      -
tank/test  usedbyrefreservation   0                      -
tank/test  logbias                latency                default
tank/test  dedup                  off                    default
tank/test  mlslabel               none                   default
tank/test  sync                   disabled               local
tank/test  refcompressratio       1.00x                  -
tank/test  written                19K                    -
tank/test  logicalused            9,50K                  -
tank/test  logicalreferenced      9,50K                  -
tank/test  filesystem_limit       none                   default
tank/test  snapshot_limit         none                   default
tank/test  filesystem_count       none                   default
tank/test  snapshot_count         none                   default
tank/test  snapdev                hidden                 default
tank/test  acltype                off                    default
tank/test  context                none                   default
tank/test  fscontext              none                   default
tank/test  defcontext             none                   default
tank/test  rootcontext            none                   default
tank/test  relatime               off                    default
tank/test  redundant_metadata     all                    default
tank/test  overlay                off                    default
tank/test  com.sun:auto-snapshot  true                   inherited from tank

Can you give me a hint what I could do to improve the write speeds?

Update 1

After your comments about the storage system I went to the IT department. The guy told me that the logical block which the vdev exports is actually 512 B.

This is the output of dmesg:

[    8.948835] sd 3:0:0:0: [sdb] 68717412272 512-byte logical blocks: (35.2 TB/32.0 TiB)
[    8.948839] sd 3:0:0:0: [sdb] 4096-byte physical blocks
[    8.950145] sd 3:0:0:0: [sdb] Write Protect is off
[    8.950149] sd 3:0:0:0: [sdb] Mode Sense: 43 00 10 08
[    8.950731] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    8.985168]  sdb: sdb1 sdb9
[    8.987957] sd 3:0:0:0: [sdb] Attached SCSI disk

So 512 B logical blocks but 4096 B physical block?!

They provide me some temporary file system to which I can backup the data. Then, I will first test the speed on the raw device before setting up the pool from scratch. I will send an update.

Update 2

I destroyed the original pool. Then I ran some speed tests using dd, the results are ok - around 80MB/s in both directions.

As a further check I created an ext4 partition on the device. I copied a large zip file to this ext4 partition and the average write speed is around 40MB/s. Not great, but enough for my purposes.

I continued by creating a new storage pool with the following commands

zpool create -o ashift=12 tank /dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d
zfs set compression=on tank
zfs set atime=off tank
zfs create tank/test

Then, I again copied a zip file to the newly create test file system. The write speed is poor, just around 2-5 MB/s.

Any ideas?

Update 3

tgx_syncis blocked when I copy the files. I opened a ticket on the github repository of ZoL.


Solution 1:

You have set ashift=0, which causes slow write speeds when you have HD drives that use 4096 byte sectors. Without ashift, ZFS doesn't properly align writes to sector boundaries -> hard disks need to read-modify-write 4096 byte sectors when ZFS is writing 512 byte sectors.

Use ashift=12 to make ZFS align writes to 4096 byte sectors.

You also need to check that the alignment of your partition is correct with respect to the actual hard disk in use.