How to limit ZFS writes on NVME SSD in RAID1 to avoid rapid disk wear?

Currently I'm running Proxmox 5.3-7 on ZFS with few idling debian virtual machines. I'm using two SSDPE2MX450G7 NVME drives in RAID 1. After 245 days of running this setup the S.M.A.R.T values are terrible.

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    98%
Available Spare Threshold:          10%
Percentage Used:                    21%
Data Units Read:                    29,834,793 [15.2 TB]
Data Units Written:                 765,829,644 [392 TB]
Host Read Commands:                 341,748,298
Host Write Commands:                8,048,478,631
Controller Busy Time:               1
Power Cycles:                       27
Power On Hours:                     5,890
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

I was trying to debug what's consuming so much write commands, but I'm failing. iotop shows 400kB/s average writes with 4MB/s spikes.

I've tried to run zpool iostat and it doesn't look bad too.

zpool iostat rpool 60
capacity operations bandwidth
pool alloc free read write read write

rpool 342G 74.3G 0 91 10.0K 1.95M
rpool 342G 74.3G 0 90 7.80K 1.95M
rpool 342G 74.3G 0 107 7.60K 2.91M
rpool 342G 74.3G 0 85 22.1K 2.15M
rpool 342G 74.3G 0 92 8.47K 2.16M
rpool 342G 74.3G 0 90 6.67K 1.71M

I've decided to take a look into writes by echoing 1 into /proc/sys/vm/block_dump and looking into /var/log/syslog. Here's the result:

Jan 25 16:56:19 proxmox kernel: [505463.283056] z_wr_int_2(438): WRITE block 310505368 on nvme0n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283058] z_wr_int_0(436): WRITE block 575539312 on nvme1n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283075] z_wr_int_1(437): WRITE block 315902632 on nvme0n1p2 (32 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283096] z_wr_int_4(562): WRITE block 460141312 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283108] z_wr_int_4(562): WRITE block 460141328 on nvme0n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283271] z_null_iss(418): WRITE block 440 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283315] z_null_iss(418): WRITE block 952 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283348] z_null_iss(418): WRITE block 878030264 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283378] z_null_iss(418): WRITE block 878030776 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283409] z_null_iss(418): WRITE block 440 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283442] z_null_iss(418): WRITE block 952 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283472] z_null_iss(418): WRITE block 878030264 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283502] z_null_iss(418): WRITE block 878030776 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.289562] z_wr_iss(434): WRITE block 460808488 on nvme1n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.289572] z_wr_iss(434): WRITE block 460808488 on nvme0n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.457366] z_wr_iss(430): WRITE block 460808744 on nvme1n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.457382] z_wr_iss(430): WRITE block 460808744 on nvme0n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.459003] z_wr_iss(431): WRITE block 460809000 on nvme1n1p2 (24 sectors)

and so on. Is there any way to limit number of writes? As you can see the data units written are outrageous and I'm stuck, because I'm out of ideas how to limit it.


There are different reasons why your real writes were so much inflated. Lets mark some base point:

  • first, let set a baseline: from your zpool iostat output, we can infer a continuous ~1.5 MB/s write stream to each of the mirror leg. So, in 245 days, it add up to 1.5*86400*245 = 32 TB written;

  • the number above already take into account both ZFS recordsize write amplification and dual data write due to first writing to ZIL, then at txg_commit (for writes smaller than zfs_immediate_write_sz).

Give the above, to reduce ZFS-induced write amplification, you should:

  • set a small recordsize (ie: 16K);

  • set logbias=throughput

  • set compression=lz4 (as suggested by @poige)

EDIT: to more correctly estimate write-amplification, please show the output of nvme intel smart-log-add /dev/nvme0


In addition to already given advice to reduce recordsize — there's no reason not to use LZ4 compression (zfs set compression=lz4 …) as well by default, thus reducing size even more (and sometimes very significantly).