Diagnosing Hardware Problem in Linux Server that's Kernel Panicking

We have a server that has been occasionally kernel panicking for a while now that we believe has a hardware problem. How would you go about troubleshooting hardware that you don't have physical access to? Are there any tools that I can use within the OS itself to diagnose different pieces of the system to try to figure out what's causing all of this panicking?

Barring anything revealing in the system's logs or vendor-supplied test tools (front panel display, Dell Diagnostics, etc.), most diagnostic procedures will require physical access to the system.

My suggestion would be to have memtest86 or memtest86+ run on the system: Most panics/random crashes are caused by bad RAM and this will usually catch it.

You're going to have a really hard time diagnosing hardware problems without access to the hardware; if it's not obvious in the logs or from smoke and crackly noises followed by neat sparkles of light then a lot of hardware troubleshooting comes down to switching parts until the issue goes away.

Thing with hardware is that when you use software to troubleshoot it, it can only tell you what is the problem, not what might be the problem. I.e., memtest86 finds a definite memory problem, you have a definite memory problem, but if memtest86 says there isn't a memory problem, you actually might still have a memory problem (I've had systems test fine but only stopped crashing after swapping the module).

It's like asking your brain to diagnose yourself. You can't trust the conclusions. :-)

Diagnosing Hardware Problem in Linux Server that's Kernel Panicking

Related

Recent Posts