watchdog: BUG: soft lockup - CPU#51 stuck
Solution 1:
For everyone that is facing similar issue:
The RAID card Adaptec 72405 had an outdate firmware we flashed it, but this did not solved the issue) -- mentioning because it may be part of the solution; so, after working only with the Intel 3606 (no other disk) we had the hard lock again;
We also add some throttling points to our routines and increase vm.min_free_kbytes
, changed the scheduler to noop, but it all seemed to delay the hard lock.
Since we have other 10 machines with the same Intel NVME 3605 with no issue until now (those have less 30% of RAM and 1/4 of CPU power than the ones that show the issue), we did not expect the problem to related to Intel NVME hardware, but as it was the only direction, after some searching, we found a know and fixed bug about "rigorous write to NVME".
After installing Ubuntu 20, we did not face the hard lockup again in 60h of testing. We must disclose that we did not do the proper isolation by downgrade the Adapter and remove the throttling points on routines, but we really believe that more RAM and more CPU made possible a throughput that can take the NVME to the limit and culminate in the hard lockup described in the bug 1810998. We can confirm that the change of scheduler did not made any difference after Ubuntu 20