Why 512 behaves worse than 4096 when NVMe configured with 512 sector size?
With SSDs, the block size presented to upper layers is nowhere near the erase page size but 4096 bytes IS closer to the erase page size than 512 bytes. Further, if you send data down in "clumps" of 4096 bytes rather than 512 bytes then everything has less work to do for the same total I/O and the I/O will be more frequently aligned to page size. In fact you will probably find things are faster again when using a 64k block size - minimum block size is different to optimal block size! See http://codecapsule.com/2014/02/12/coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ (especially the section about NAND-flash pages and blocks) and http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/ for details.