ZFS Heavy Write Amplification due to Free Space Fragmentation
Without deep debug, it is difficult to give you a definitive answer. Anyway, some things to note are:
ZFS allocate blocks via spacemaps. When a spacemap is >= 96% full (80% for older build), ZFS will switch from first-fit to best-fit allocator. Note that this is a per-spacemap decision: you can have an 80% full pool with some spacemaps well over that value, maybe already at over 96%. When writing to such spacemaps, ZFS will use the slower best-fit allocator
a fragmented spacemap will use much more memory than a non-fragmented one. This added memory pressure can lead to spacemap trashing. You can avoid that by setting
metaslab_debug_load=1
; if it does not work, try re-importing your pool and/or settingmetaslab_debug_unload=1
. Note that persistently locking all spacemaps in memory will inevitably consume more RAMyou could be burned by gang blocks but, again, it is difficult to tell if it is the case without further debug. Surely a 128K recordsize, with such a good
compressratio
, is doing you no favor with regard to fragmentation. You can read some more information here and here.
Side note: I see your pool has ashift=9
. I think that pure 512B devices are quite rare nowadays, especially in cloud environment. In a bid to increase performance, you can/would re-create your pool with ashift=12
.