Network interface periodically going down with speed changed to 0 kernel errors
Over the last couple of days, the bonded network interface on one of our servers has stopped responding.
Looking through the kernel logs, I notice when the interface goes down, we are getting lots of repeated errors of the form:
[76019.645601] e1000e 0000:03:00.0 p9p1: speed changed to 0 for port p9p1
[76325.575540] e1000e 0000:03:00.0 p10p1: speed changed to 0 for port p10p1
Having had a quick search around for similar issues, I haven't been able to find anyone having reported this sort of behaviour before.
To provide a few more details on the server's configuration:
- Both of the bonded network interfaces are associated to Intel 82574L ethernet controllers.
- The server is running Ubuntu 16.04, with Linux kernel version
4.4.0-101-generic
. -
The bonded network interface has the following configuration:
auto p9p1 iface p9p1 inet manual bond-master bond0 auto p10p1 iface p10p1 inet manual bond-master bond0 auto bond0 iface bond0 inet static address 10.0.0.10 gateway 10.0.0.1 netmask 255.255.255.0 bond-mode 4 bond-miimon 100 bond-lacp-rate 1 bond-slaves p9p1 p10p1 dns-nameservers 10.0.0.2 10.0.0.3
When the network interface goes down restarting the networking service on the server, by running
service networking restart
, seems to remedy the issues
I was wondering if anyone had experienced similar issues before and or has any suggestions for debugging the cause of something like this?
Solution 1:
It would appear that, for me, these issues were likely caused by a known bug in the Linux kernel v4.4.0-97-generic on Ubuntu 16.04: e1000e in 4.4.0-97-generic breaks 82574L under heavy load.
Having applied the patched test kernel version v4.4.0-98, submitted by the bug's assignee, on the Ubuntu Linux package bug tracker I have not since experienced the erroneous behaviour returning after a weekend of fairly heavy load testing of the bonded interface.
Solution 2:
I just hit the same error messages.. but in my case issue wasn't on the server side at all. The stderr prints not only e1000e NIC, but all 4 of them. With cable disconnect/connect the messages are reproduced. So different drivers have same behavior... and after we did software debug on server, then cabling (changing cables with new ones), what's rest was the top of rack switch.
A switch reboot solved it.