At my wits end. What could cause my server to randomly hard reset? (Seems to be related to ZFS)

I have a server that I built years ago that has worked like a champ. But within the past few months it has started to become seriously unstable with no discernible pattern. I have been debugging it and swapping out parts to no avail. I have replaced almost everything in the system that I can think of that might be the cause save drives used for storage.

Note that the system is running CentOS 7.5.

The symptoms are that the machine will spontaneously perform a hard reset as if the power supply were cycling or there was a sudden loss of power. It can happen once every few days or sometimes twice a day. The system can be idle or with a load. There is no pattern.

I have removed everything but the bare essentials. Note that I have replaced:

The motherboard, CPU, RAM, and PSU.

If any of the ram sticks were defective I would expect to see logs of corrected/uncorrectable ECC errors, which I do not. If it were the CPU I would expect something a bit more random with some logging from a possible kernel panic. I suspected that it might be a fault with the power supply and replaced that. The problem persisted so I tried replacing the motherboard. No change.

The system was configured with two processors and 16 sticks of identical memory so I tried to remove one CPU and half the ram, see if it crashed, then swap the other set in. No change in symptoms.

I started removing extra components and have arrived at the bare minimum with no change in symptoms.

There is never anything suggesting a hardware failure in the logs; they simply end at the point of reset.
There is nothing in the IPMI logs.
There is nothing in the UPS logs (removing the UPS did not help either).
The processors are not overheating. I logged lmsensors with no abnormalities.
Monitored system temperature, CPU and memory Vcore, fan RPM, and PSU voltages with ipmitool logs.
All SMART tests report PASSED.
I swapped the primary disk used for the OS (/ root, boot, swap) to another SSD by mirroring it with mdadm and installing grub.
Both RAID arrays (see specs below) are ZFS and do not report any faults. There are no issues when scanning for bit rot or corruption.

I am now at a complete and utter loss. With the exception of the few remaining drives in the system, I've run out of things to try replacing save for the case itself.

What could possibly be causing my server to be resetting itself? What else can I test for? Would the fault really be coming from one of the drives?

Currently the system is specced as follows:

Base components:

SuperMicro H8DG6-F (Motherboard)

1x AMD Opteron Processor 6328 (CPU)

16GB x 8 Hynix DDR3 ECC HMT42GR7BMR4C-G7 (Memory)

Storage:

1x Samsung SSD 850 PRO 128GB XFS (/ root, boot, swap)

2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)

8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)

The Western Digital RED drives are connected to the case backplane and are conencted to the onboard SAS controller. All if the SSDs are in a ToughArmor MB998SP-B backplane mounted in a 5.25" bay at the front of the case and are connected to the motherboard SATA controller.

Cooling:

NH-U12DO A3 (CPU)

Fans added to chipset heatsinks (they get very hot)

Small heatsink added to Intel Gigabit chip

Thermal paste on ALL heatsinks has been replaced with Noctua NT-H1 with the exception of the small heatsinks around the CPUs which have thermal pads

Case:

Supermicro SC743TQ-865B-SQ

Power Supply:

SuperMicro PWS-865-PQ 865W

UPS

APC Back UPS Pro BX1500M 1500VA 900W

UPDATE:

I have been able to trace the stability issue to an unlikely source: software. This seems unlikely and was not previously entertained during differential diagnosis as a software issue (even for a Kernel module) should at worst trigger a kernel panic.

The source has been identified as the ZFS arrays (ZFS on Linux). I can replicate the crash by removing all disks except for the OS and a ZFS array and then performing a scrub on that array while there are simultaneous reads on any ZFS array (the same or other) on the system.

Basic testing setup:

1 CPU
16GB x 8 Memory
128GB SSD for CentOS 7.5 (Boot/Swap/Root)
SuperMicro H8DG6-F Motherboard
PWS-865-PQ 865W PSU
Onboard Matrox G200 Video

All disks are connected to the motherboard. No PCIe slots are populated.

Elimination of other sources:

CPU (swapped with a second CPU)
Memory (swapped with a second set of memory)
Motherboard (swapped with another identical board; BIOS is updated)
OS Hard Disk (swapped between Crucial and Samsung 128GB SSDs)
PSU is certified for use with this motherboard (tested against two of these)

ZFS activity:

Scrub on a single array
Access read/write on the same array OR another (exclusive)

Test 1: !! CRASH !!

Basic setup (as described above)

2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)

8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)

ZFS scrub on /backup. Several Minecraft servers run on /data.

Server reboots suddenly shortly thereafter.

This is similar to what the system is normally configured as but stripped down to a minimal set of components for testing and analysis.

Test 2: !! STABLE !!

Basic setup (as described above)

8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)

ZFS scrub on /backup. NO Minecraft servers active and no access to any ZFS disk.

Server is stable for over 24h and scrub completes.

At this point I suspect the /data array as a fault.

Test 3: !! CRASH !!

Basic setup (as described above)

8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)

ZFS scrub on /backup. Several Minecraft servers run on /backup.

Server reboots suddenly shortly thereafter.

At this point I suspect the /backup array may be the real fault as the /data array is no longer present and the system crashed identically to how it always has.

Test 4: !! CRASH !!

Basic setup (as described above)

2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)

ZFS scrub on /data. Several Minecraft servers run on /data.

Server reboots suddenly shortly thereafter.

Stability seems to be related to ZFS?

Test 5: !! STABLE !!

Basic setup (as described above)

1x Samsung SSD 850 PRO 512GB XFS (/data-testing)

Several Minecraft servers run on /data-testing.

Server has been stable for weeks.

I am now confident that the source of the stability is related to the ZFS arrays. This is very strange as I've run ZFS on this system for years without issue until now. It's also very strange that a fault would cause the entire system to reboot without a kernel panic or log.

I welcome any additional insight that anyone might be able to provide.

Solution 1:

Since I've been at my wits end in a similar situation I thought I'd post what ultimately helped me. It might not be exactly related to your situation, but maybe some other poor soul can stumble upon this and find solace.

I have a ZFS backup server that runs rsnapshot (rsync with rotation) across my company's server fleet. Every 2-3 weeks the server would just reset itself.

As @tjikkun pointed out you should try to get some panic information. In my case it was a "Panic String: double fault" error that I would find in the dump and something related to a stack-overflow in a recursive ZFS routine.

There is some information around, related to this, but it only seems to apply to 32bit processors. I however run on 64bit and for that I could not find any information.

Still the 32bit error hinted at the kern.kstack_pages kernel setting that needs to be increased in certain situations. In my case this is what did the trick. I added kern.kstack_pages=16 to /boot/loader.conf, rebooted the server and I haven't had a crash since (in 6 months). It makes sense that this setting helped because the crash I had happened due to ZFS encountering a stack limit.

Again, not necessarily relevant to your specific case, but I had a very hard time finding this information and I hope someone else will find it useful.

Solution 2:

Here are some steps you can take to narrow this down:

Reboot on panic

If automatic reboot on panic is turned on, you might want to turn it off for testing. If you run sysctl kernel.panic you should get the current value. If it is 0, it is turned off, any other value is the number of seconds it wait before reboot With sysctl -w kernel.panic=0 you would turn it off, if it is not already off. If this is set to 0 and your server still reboots itself, I would really think this is a hardware issue. If this stops the automatic rebooting, then we know the reboot is caused by a watchdog timer.

Reading kernel panic

When this stops the rebooting and you are lucky, the screen show some panic information. If this is the case and you want the full info of the crash, you need to set up serial logging, or netconsole.

Nothing on the screen

If you are not as lucky, you might want to configure kdump and see if this gives you any information.

Other things to try

You might want to go back to a very early 0.7.x version of ZFS, to see if you can reproduce the problem there. Another option is to try 0.8.0-rc2, but be careful with prereleases if you value your data much. I don't expect data loss, but you can better be safe than sorry.