ZFS: good read but poor write speeds
I'm in charge of downloading and processing large amounts of financial data. Each trading day, we have to add around 100GB.
To handle this amount of data, we rent a virtual server (3 cores, 12 GB ram) and a 30 TB block device from our university's data center.
On the virtual machine I installed Ubuntu 16.04 and ZFS on Linux. Then, I created a ZFS pool on the 30TB block device. The main reason for using ZFS is the compression feature as the data is nicely compressible (~10%). Please don't be too hard on me for not following the golden rule that ZFS wants to see bare metal, I am forced to use the infrastructure as it is.
The reason for posting is that I face a problem of poor write speeds. The server is able to read data with about 50 MB/s from the block device but writing data is painfully slow with about 2-4 MB/s.
Here is some information on the pool and the dataset:
zdb
tank:
version: 5000
name: 'tank'
state: 0
txg: 872307
pool_guid: 8319810251081423408
errata: 0
hostname: 'TAQ-Server'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 8319810251081423408
children[0]:
type: 'disk'
id: 0
guid: 13934768780705769781
path: '/dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d-part1'
phys_path: '/iscsi/[email protected]%3Asn.606f4c46fd740001,0:a'
whole_disk: 1
metaslab_array: 30
metaslab_shift: 38
ashift: 9
asize: 34909494181888
is_log: 0
DTL: 126
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
zpool get all
NAME PROPERTY VALUE SOURCE
tank size 31,8T -
tank capacity 33% -
tank altroot - default
tank health ONLINE -
tank guid 8319810251081423408 default
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace off default
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand off default
tank dedupditto 0 default
tank dedupratio 1.00x -
tank free 21,1T -
tank allocated 10,6T -
tank readonly off -
tank ashift 0 default
tank comment - default
tank expandsize 255G -
tank freeing 0 default
tank fragmentation 12% -
tank leaked 0 default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress active local
tank feature@spacemap_histogram active local
tank feature@enabled_txg active local
tank feature@hole_birth active local
tank feature@extensible_dataset enabled local
tank feature@embedded_data active local
tank feature@bookmarks enabled local
tank feature@filesystem_limits enabled local
tank feature@large_blocks enabled local
zfs get all tank/test
NAME PROPERTY VALUE SOURCE
tank/test type filesystem -
tank/test creation Do Jul 21 10:04 2016 -
tank/test used 19K -
tank/test available 17,0T -
tank/test referenced 19K -
tank/test compressratio 1.00x -
tank/test mounted yes -
tank/test quota none default
tank/test reservation none default
tank/test recordsize 128K default
tank/test mountpoint /tank/test inherited from tank
tank/test sharenfs off default
tank/test checksum on default
tank/test compression off default
tank/test atime off local
tank/test devices on default
tank/test exec on default
tank/test setuid on default
tank/test readonly off default
tank/test zoned off default
tank/test snapdir hidden default
tank/test aclinherit restricted default
tank/test canmount on default
tank/test xattr on default
tank/test copies 1 default
tank/test version 5 -
tank/test utf8only off -
tank/test normalization none -
tank/test casesensitivity mixed -
tank/test vscan off default
tank/test nbmand off default
tank/test sharesmb off default
tank/test refquota none default
tank/test refreservation none default
tank/test primarycache all default
tank/test secondarycache all default
tank/test usedbysnapshots 0 -
tank/test usedbydataset 19K -
tank/test usedbychildren 0 -
tank/test usedbyrefreservation 0 -
tank/test logbias latency default
tank/test dedup off default
tank/test mlslabel none default
tank/test sync disabled local
tank/test refcompressratio 1.00x -
tank/test written 19K -
tank/test logicalused 9,50K -
tank/test logicalreferenced 9,50K -
tank/test filesystem_limit none default
tank/test snapshot_limit none default
tank/test filesystem_count none default
tank/test snapshot_count none default
tank/test snapdev hidden default
tank/test acltype off default
tank/test context none default
tank/test fscontext none default
tank/test defcontext none default
tank/test rootcontext none default
tank/test relatime off default
tank/test redundant_metadata all default
tank/test overlay off default
tank/test com.sun:auto-snapshot true inherited from tank
Can you give me a hint what I could do to improve the write speeds?
Update 1
After your comments about the storage system I went to the IT department. The guy told me that the logical block which the vdev exports is actually 512 B.
This is the output of dmesg
:
[ 8.948835] sd 3:0:0:0: [sdb] 68717412272 512-byte logical blocks: (35.2 TB/32.0 TiB)
[ 8.948839] sd 3:0:0:0: [sdb] 4096-byte physical blocks
[ 8.950145] sd 3:0:0:0: [sdb] Write Protect is off
[ 8.950149] sd 3:0:0:0: [sdb] Mode Sense: 43 00 10 08
[ 8.950731] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 8.985168] sdb: sdb1 sdb9
[ 8.987957] sd 3:0:0:0: [sdb] Attached SCSI disk
So 512 B logical blocks but 4096 B physical block?!
They provide me some temporary file system to which I can backup the data. Then, I will first test the speed on the raw device before setting up the pool from scratch. I will send an update.
Update 2
I destroyed the original pool.
Then I ran some speed tests using dd
, the results are ok - around 80MB/s in both directions.
As a further check I created an ext4 partition on the device. I copied a large zip file to this ext4 partition and the average write speed is around 40MB/s. Not great, but enough for my purposes.
I continued by creating a new storage pool with the following commands
zpool create -o ashift=12 tank /dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d
zfs set compression=on tank
zfs set atime=off tank
zfs create tank/test
Then, I again copied a zip file to the newly create test
file system.
The write speed is poor, just around 2-5 MB/s.
Any ideas?
Update 3
tgx_sync
is blocked when I copy the files. I opened a ticket on the github repository of ZoL.
Solution 1:
You have set ashift=0
, which causes slow write speeds when you have HD drives that use 4096 byte sectors. Without ashift
, ZFS doesn't properly align writes to sector boundaries -> hard disks need to read-modify-write 4096 byte sectors when ZFS is writing 512 byte sectors.
Use ashift=12
to make ZFS align writes to 4096 byte sectors.
You also need to check that the alignment of your partition is correct with respect to the actual hard disk in use.