Why does doing large deletes, copies, moves on my ZFS NAS block all other IOs?

Option #2 is most likely the reason. Dedup performs best when the dedup table (DDT) fits entirely in memory. If it doesn't, then it spills over onto disk, and DDT lookups that have to go to disk are very slow and that produces the blocking behavior.

I would think that 30G of RAM is plenty, but the size of the DDT is directly dependent on the amount of data being deduped and how well dedup works on your data. The dedup property is set at the dataset level, but lookups are done across the entire pool, so there is just one pool-wide DDT.

See this zfs-discuss thread on calculating the DDT size. Essentially it's one DDT entry per unique block on the pool, so if you have a large amount of data but a low dedup ratio, that means more unique blocks and thus a larger DDT. The system tries to keep the DDT in RAM, but some of it may be evicted if the memory is needed for applications. Having L2ARC cache can help prevent the DDT from going to the main pool disks, as it will be evicted from main memory (ARC) into L2ARC.


One thing to keep in mind with ZFS and snapshots is that nothing is free, and as you remove large amounts of data, and expect snapshots to continue to maintain that data for you, especially if you have a large number of snapshots, as you conduct your deletes, snapshots have to be updated accordingly to reflect changes to the filesystem. I am assuming that you have 6 VDEVs, basically 6 mirrors in the pool, which means that you actually have to perform these changes against all disks, since data is quite evenly spread across each VDEV. With dedup on, the situation gets much more complicated, especially if the ratio is good, and if the ratio is poor, do not use it. In essence, if the ratio is good, or great, you have a large number of references, all of which are of course metadata and they all need to be updated, along with snapshots, and snapshot related metadata. If you filesystems have small blocksizes, the situations gets even more complex, because the amount of refs. is much greater for a 4K blocksize dataset vs. a 128K dataset.

Basically, there are few things you can do, other than: 1) Add high-performance SSDs as caching devices, and tune your filesystem to use caching devices for nothing but metadata, 2) reduce large delete/destroy operations, 3) re-consider usage of deduplication. But, you cannot simply disable deduplication on a pool, or a filesystem. If enabled on whole pool, you have to re-create the pool, or if set on individual filesystem destroying and re-creating the filesystem will address the issue.

At Nexenta, we are very careful with deduplication when we recommend it to customers. There are a lot of cases where it is a brilliant solution, and customer could not live without it. And in those cases we often have customers using 96GB of RAM or more, to maintain more of the metadata, and the DDT in RAM. As soon as DDT metadata is pushed to spinning media everything literally comes to a screeching halt. Hope this helps.