Interpreting SMART Logs of Newly Installed NVMe RAID0 Crashing Everyday

An Ubuntu 20.04 system has been stable for a year until a 2nd and 3rd NVMe drive is installed on the motherboard to form a 2x1TB RAID0 array. Ever since then, there is huge amount of IO load on this RAID0 array 24/7 and the system crashes about once a day.

nvme smart-logs /dev/nvme1n1 and nvme smart-logs /dev/nvme1n2 contains a few entries that are non-zero, particularly num_err_log_entries, Thermal Management T1 Trans Count and Thermal Management T1 Total Time.

These 3 entires are all 0 in the existing first NVMe drive in the macine.

What does these 3 entires mean? How can we check the error logs that num_err_log_entries is tracking?

Is this a concern?

$ sudo nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 60 C
available_spare                     : 100%
available_spare_threshold           : 5%
percentage_used                     : 3%
data_units_read                     : 100,951,144
data_units_written                  : 107,072,517
host_read_commands                  : 152,100,781
host_write_commands                 : 179,955,901
controller_busy_time                : 1,376
power_cycles                        : 6
power_on_hours                      : 115
unsafe_shutdowns                    : 5
media_errors                        : 0
num_err_log_entries                 : 18
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 2
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 131395
Thermal Management T2 Total Time    : 0
~$ sudo nvme smart-log /dev/nvme2n1
Smart Log for NVME device:nvme2n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 64 C
available_spare                     : 100%
available_spare_threshold           : 5%
percentage_used                     : 3%
data_units_read                     : 100,952,564
data_units_written                  : 107,069,314
host_read_commands                  : 152,056,852
host_write_commands                 : 179,238,524
controller_busy_time                : 1,885
power_cycles                        : 6
power_on_hours                      : 120
unsafe_shutdowns                    : 5
media_errors                        : 0
num_err_log_entries                 : 18
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 5
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 169552
Thermal Management T2 Total Time    : 0

Solution 1:

My quick guess would be that the case of the computer has too little ventilation and the environment exceeds new drives' operational temperature range. This triggers device's thermal protection, which might be buggy and causes crashes.

Another source for crashes could be operating system disk drivers that cannot process thermal management events properly.