Linus/ext4/nvme crashes during high io
Solution 1:
I can't exactly tell you where the problem is as this is just a "generic failure" somewhere in NVMe subsystem. But I can suggest what you can try to pinpoint the problem.
- Try adding nvme_core.default_ps_max_latency_us=5500 kernel boot option.
- Install nvme-cli package (or even better build a most recent one from sources) and check various logs with it, like smart-log and error-log. That might help to diagnose error further.
- Try booting some other distros (live) and stress test under them to see if this is kernel version / distro related. Systemrescuecd distro might be a good starting point.
- If that doesn't helps you can try updating your MB firmware ("BIOS", which is not BIOS in fact now with UEFI) to a most recent one. While this doesn't sound obvious and even the patch notes might not have anything directly related to NVMe/PCI-E subsystems, sometimes it helps (practical knowledge).
- Update your NVMe drive firmware. Look for a vendor supplied tools and manual for this.
- If everything above won't help or give any clues you might have faced yet unknown bug or hardware failure.
Solution 2:
The line kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
means that the NVMe disk controller was not responding and was reset by the NVMe driver to recover communication with the device.
Such issues can be caused by:
- malfunctioning hardware
- spurious power (ie: bad PSU)
- too aggressive PCIe Active State Power Management (ASPM)
Putting aside bad hardware, you can try disabling ASPM with the kernel boot command line pcie_aspm=off