ZFS Heavy Write Amplification due to Free Space Fragmentation

Without deep debug, it is difficult to give you a definitive answer. Anyway, some things to note are:

  • ZFS allocate blocks via spacemaps. When a spacemap is >= 96% full (80% for older build), ZFS will switch from first-fit to best-fit allocator. Note that this is a per-spacemap decision: you can have an 80% full pool with some spacemaps well over that value, maybe already at over 96%. When writing to such spacemaps, ZFS will use the slower best-fit allocator

  • a fragmented spacemap will use much more memory than a non-fragmented one. This added memory pressure can lead to spacemap trashing. You can avoid that by setting metaslab_debug_load=1; if it does not work, try re-importing your pool and/or setting metaslab_debug_unload=1. Note that persistently locking all spacemaps in memory will inevitably consume more RAM

  • you could be burned by gang blocks but, again, it is difficult to tell if it is the case without further debug. Surely a 128K recordsize, with such a good compressratio, is doing you no favor with regard to fragmentation. You can read some more information here and here.

Side note: I see your pool has ashift=9. I think that pure 512B devices are quite rare nowadays, especially in cloud environment. In a bid to increase performance, you can/would re-create your pool with ashift=12.