What is the performance impact of disabling NCQ?
NCQ is a technology that lets drives re-order the order in which read and write requests are serviced.
SSDs save you the seek times hard drives are plagued by, but actually reading from and writing to a NAND die is not especially fast. SSDs get around this by reading and writing to multiple NAND dies in parallel.
In order to achieve this SSDs rely on three kinds of strategies: For large IO request the split the request across multiple dies, by splitting it up and writing parts of the data to separate dies in parallel. For reads the data is, hopefully, also split across dies and can be read back in parallel.
For small IO write loads SSDs usually cache a bunch of them in on board memory and the write the whole slew of them out to different NAND dies in parallel. This is why SSDs can have such high random write performance.
For small IO reads, or mixed workloads, the SSD will service the request in the command queue out of order trying to keep as many NAND dies working in parallel as possible. The SSD can only do this if NCQ is enabled. This can make a huge difference in IO heavy work-loads. For AHCI I've seen up to 10x difference, and for NVMe over 100x.
If you've ever seen benchmarks of an SSD from a benchmarking application like CrystalDiskMark or similar you can see that they usually provide 4k random read results both with and without queue depth. If NCQ is disabled the difference between these two numbers is small, with NCQ enable it is huge. For example, this Bit-tech review, put the QD1 4k random read results for the Samsung 950 PRO 512GB (a NVMe drive) at 60MB/s, but the QD32 4k random read results for the same drive are 1261 MB/s.
NCQ is here to rearrange queued read/write operation in order to minimize heads seeks, maximizing performance. A good benchmark with mechanichal disks and NCQ can be found here
However, contrary to commond belive, NCQ is even more important for SSDs. The reason is that while they have no heads, due to their very low latency command queueing become really critical to extract maximum performance.
Think about that: if AHCI only had a single 32-entries queue, NVMe has 64k queues with 64k entries each.