Linus/ext4/nvme crashes during high io

Solution 1:

I can't exactly tell you where the problem is as this is just a "generic failure" somewhere in NVMe subsystem. But I can suggest what you can try to pinpoint the problem.

  1. Try adding nvme_core.default_ps_max_latency_us=5500 kernel boot option.
  2. Install nvme-cli package (or even better build a most recent one from sources) and check various logs with it, like smart-log and error-log. That might help to diagnose error further.
  3. Try booting some other distros (live) and stress test under them to see if this is kernel version / distro related. Systemrescuecd distro might be a good starting point.
  4. If that doesn't helps you can try updating your MB firmware ("BIOS", which is not BIOS in fact now with UEFI) to a most recent one. While this doesn't sound obvious and even the patch notes might not have anything directly related to NVMe/PCI-E subsystems, sometimes it helps (practical knowledge).
  5. Update your NVMe drive firmware. Look for a vendor supplied tools and manual for this.
  6. If everything above won't help or give any clues you might have faced yet unknown bug or hardware failure.

Solution 2:

The line kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 means that the NVMe disk controller was not responding and was reset by the NVMe driver to recover communication with the device.

Such issues can be caused by:

  • malfunctioning hardware
  • spurious power (ie: bad PSU)
  • too aggressive PCIe Active State Power Management (ASPM)

Putting aside bad hardware, you can try disabling ASPM with the kernel boot command line pcie_aspm=off