Why would a server lockup knock other servers off the network?

We have a couple dozen Proxmox servers (Proxmox runs on Debian), and about once a month, one of them will have a kernel panic and lock up. The worst part about these lock ups is that when it's a server that is on a separate switch than the cluster master, all other Proxmox servers on that switch will stop responding until we can find the server that has actually crashed and reboot it.

When we reported this issue on the Proxmox forum, we were advised to upgrade to Proxmox 3.1 and we've been in the process of doing that for the past several months. Unfortunately, one of the servers that we migrated to Proxmox 3.1 locked up with a kernel panic on Friday, and again all Proxmox servers that were on that same switch were unreachable over the network until we could locate the crashed server and reboot it.

Well, almost all Proxmox servers on the switch... I found it interesting that the Proxmox servers on that same switch that were still on Proxmox version 1.9 were unaffected.

Here is a screen shot of the console of the crashed server:

enter image description here

When the server locked up, the rest of the servers on the same switch that were also running Proxmox 3.1 became unreachable and were spewing the following:

e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
...etc...

uname -a output of locked server:

Linux ------ 2.6.32-23-pve #1 SMP Tue Aug 6 07:04:06 CEST 2013 x86_64 GNU/Linux

pveversion -v output (abbreviated):

proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-23-pve: 2.6.32-109

Two questions:

Any clues what would be causing the kernel panic (see image above)?
Why would other servers on the same switch and version of Proxmox be knocked off the network until the locked server is rebooted? (Note: There were other servers on the same switch that were running the older 1.9 version of Proxmox that were unaffected. Also, no other Proxmox servers in the same 3.1 cluster were affected that were not on that same switch.)

Thanks in advance for any advice.

I'm almost certain your problem is not caused by just one single factor but rather by a combination of factors. What those individual factors are is not certain, but most likely one factor is either the network interface or driver and another factor is found on the switch itself. Hence it is quite likely the problem can only be reproduced with this particular brand of switch combined with this particular brand of network interface.

You seem the trigger for the problem is something happening on one individual server which then has a kernel panic which has effects that somehow manage to propagate across the switch. This sounds likely, but I'd say it is about as likely, that the trigger is somewhere else.

It could be that something is happening on the switch or network interface, which simultaneously causes the kernel panic and link issues on the switch. In other words, even if the kernel had not had a kernel panic, the trigger may very well have brought down connectivity on the switch.

One has to ask, what could possibly happen on the individual server, which could have this effect on the other servers. It shouldn't be possible, so the explanation has to involve a flaw somewhere in the system.

If it was just the link between the crashed server and the switch which went down or became unstable, then that should have no effect on the link state to the other servers. If it does, that would count as a flaw in the switch. And trafficwise, the other servers should see slightly less traffic once the crashed server lost connectivity, which cannot explain why they see the problem they do.

This leads me to believe a design flaw on the switch is likely.

However a link problem is not the first explanation one would look for when trying to explain how an issue on one server could cause problems to other servers on the switch. A broadcast storm would be a more obvious explanation. But could there be a link between a server having a kernel panic and a broadcast storm?

Multicast and packets destined for unknown MAC addresses are more or less treated the same as broadcasts, so a storm of such packets would count as well. Could the paniced server be trying to send a crashdump across the network to a MAC address not recognized by the switch?

If that's the trigger, then something is going wrong on the other servers. Because a packet storm should not cause this kind of error on the network interface. Reset adapter unexpectedly does not sound like a packet storm (which should just cause a drop in performance but no errors as such), and it does not sound like an link problem (which should have resulted in messages about links going down, but not the error you are seeing).

So it is likely there is some flaw in the network interface hardware or driver, which is triggered by the switch.

A few suggestions that can give additional clues:

Can you hook up some other equipment to the switch and look at what traffic you see on the switch when the problem shows up (I predict it either goes quiet or you see a flood).
Would it be possible to replace the network interface on one of the servers with a different brand using a different driver to see how the result turns out differently?
Is it possible to replace one of the switches with a different brand? I expect replacing the switch will ensure the problem no longer affects multiple servers. What's more interesting to know is if it also stops the kernel panics from happening.

It sounds to me like a bug in the ethernet driver or the hardware/firmware, this being a red flag:

e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly
e1000e 0000:00:19.0: eth0: Reset adapter unexpectedly

I have seen these before and it can knock the server offline. I don't remember exactly whether it was on intel ethernet cards but I believe so. It could even be related to a bug in the ethernet cards themselves. I remember reading something about particular intel ethernet cards having such issues. But I lost the article's link.

I would imagine that the trigger for this depends partially on the driver (version) being used, the fact an older version of the software works ok seems to confirm that. You say the vendor uses their own custom kernel, try to update the ethernet driver module that's being used for your particular ethernet hardware. Either one from your vendor or one from the official kernel source tree.

Also look into bonding your ethernet hardware, normally a server would have two ethernet ports, onboard and/or add on card(s). That way if one ethernet card is having this problem the other will pick up. I use the word "card" but it applies to any ethernet hardware of course.

Also replacing the ethernet hardware can fix it. Either replace or add a newer (intel) ethernet card and use that instead. Chances are if the issue is in the hardware/firmware a newer card has a fix (or older?).

Why would a server lockup knock other servers off the network?

Related

Recent Posts