Network interface periodically going down with speed changed to 0 kernel errors

Over the last couple of days, the bonded network interface on one of our servers has stopped responding.

Looking through the kernel logs, I notice when the interface goes down, we are getting lots of repeated errors of the form:

[76019.645601] e1000e 0000:03:00.0 p9p1: speed changed to 0 for port p9p1
[76325.575540] e1000e 0000:03:00.0 p10p1: speed changed to 0 for port p10p1

Having had a quick search around for similar issues, I haven't been able to find anyone having reported this sort of behaviour before.

To provide a few more details on the server's configuration:

  • Both of the bonded network interfaces are associated to Intel 82574L ethernet controllers.
  • The server is running Ubuntu 16.04, with Linux kernel version 4.4.0-101-generic.
  • The bonded network interface has the following configuration:

    auto p9p1
    iface p9p1 inet manual
    bond-master bond0
    
    auto p10p1
    iface p10p1 inet manual
    bond-master bond0
    
    auto bond0
    iface bond0 inet static
    address 10.0.0.10
    gateway 10.0.0.1
    netmask 255.255.255.0
    bond-mode 4
    bond-miimon 100
    bond-lacp-rate 1
    bond-slaves p9p1 p10p1
    dns-nameservers 10.0.0.2 10.0.0.3
    
  • When the network interface goes down restarting the networking service on the server, by running service networking restart, seems to remedy the issues

I was wondering if anyone had experienced similar issues before and or has any suggestions for debugging the cause of something like this?


Solution 1:

It would appear that, for me, these issues were likely caused by a known bug in the Linux kernel v4.4.0-97-generic on Ubuntu 16.04: e1000e in 4.4.0-97-generic breaks 82574L under heavy load.

Having applied the patched test kernel version v4.4.0-98, submitted by the bug's assignee, on the Ubuntu Linux package bug tracker I have not since experienced the erroneous behaviour returning after a weekend of fairly heavy load testing of the bonded interface.

Solution 2:

I just hit the same error messages.. but in my case issue wasn't on the server side at all. The stderr prints not only e1000e NIC, but all 4 of them. With cable disconnect/connect the messages are reproduced. So different drivers have same behavior... and after we did software debug on server, then cabling (changing cables with new ones), what's rest was the top of rack switch.

A switch reboot solved it.