Ethernet bond interface fails after NIC replacement

I've replaced a failing 10GbE adapter in a Dell PowerEdge server running a bonded pair of Cat6 cables to an aggregate switch. The old NIC and its replacement are both Intel Ethernet Controllers (10-Gigabit X540-AT2) so I assumed (probably incorrectly) that all I'd need to do was plug it in, find the IDs of the new ports and update the port IDs of the bond interface, jack in the existing Cat6 cables and off we'd go. No dice.

On the outside of things, the network ports on both the new NIC and the switch show connection lights but no activity lights. On the server, I've confirmed that the new NIC is recognized by the system using lshw -class network -short:

lshw output

Running ip -br -c link show provides the following additional interface status info:

ip output

Following that, I checked the systemd journal and see the bond0 interface connecting with the slave ports at startup and is transmitting MTU settings to them. Beyond this, however, I'm not sure what of the log entries I've found are helpful to rooting out why the ports and bond interface are down on the new NIC. I can see that the interfaces exist and that when they're failing, they fail together, appropriately.

root@tsoukalos:~# journalctl | egrep 'enp7s0f*|bond0'
Feb 04 16:59:47 tsoukalos kernel: ixgbe 0000:07:00.0 enp7s0f0: renamed from eth5
Feb 04 16:59:47 tsoukalos kernel: ixgbe 0000:07:00.1 enp7s0f1: renamed from eth2
Feb 04 16:59:55 tsoukalos systemd-udevd[733]: Could not generate persistent MAC address for bond0: No such file or directory
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.0: registered PHC device on enp7s0f0
Feb 04 16:59:55 tsoukalos kernel: bond0: (slave enp7s0f0): Enslaving as a backup interface with a down link
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.1: registered PHC device on enp7s0f1
Feb 04 16:59:55 tsoukalos kernel: bond0: (slave enp7s0f1): Enslaving as a backup interface with a down link
Feb 04 16:59:55 tsoukalos kernel: ixgbe 0000:07:00.0 enp7s0f0: changing MTU from 1500 to 9000
Feb 04 16:59:56 tsoukalos kernel: ixgbe 0000:07:00.1 enp7s0f1: changing MTU from 1500 to 9000
Feb 04 16:59:56 tsoukalos kernel: vmbr2: port 1(bond0) entered blocking state
Feb 04 16:59:56 tsoukalos kernel: vmbr2: port 1(bond0) entered disabled state
Feb 04 16:59:56 tsoukalos kernel: device bond0 entered promiscuous mode
Feb 04 16:59:56 tsoukalos kernel: device enp7s0f0 entered promiscuous mode
Feb 04 16:59:56 tsoukalos kernel: device enp7s0f1 entered promiscuous mode
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device enp7s0f0
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device enp7s0f1
Feb 04 16:59:57 tsoukalos kernel: 8021q: adding VLAN 0 to HW filter on device bond0

For additional info, here's what I currently have for the relevant interfaces under /etc/network/interfaces:

auto enp7s0f0
iface enp7s0f0 inet manual
# Intel Ethernet Controller 10-Gigabit X540-AT2

auto enp7s0f1
iface enp7s0f1 inet manual
#Intel Ethernet Controller 10-Gigabit X540-AT2

auto bond0
iface bond0 inet manual
bond-slaves enp7s0f0 enp7s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2
mtu 9000

auto vmbr2
iface vmbr2 inet static
address XXX.XXX.XXX.15/24
bridge-ports bond0
bridge-stp off
bridge-fd 0 

I'm pretty new to both Linux and Ethernet bonding so any tips here to get this server back online are much appreciated!

Chris


Solution 1:

Thanks for the tips above, I was able to resolve the top level problem and handle the issues I faced on the other side. The reason for the "NO-CARRIER" interface status was, embarrassingly, that I had configured the bond to use NIC interfaces from an older 10GbE card in the same server and then I plugged the Ethernet cables into the interfaces on my new card! Running the ethtool blink command (ethtool -p [interface]) on the ports verified this mistake, which I corrected. Incidentally ports on the new card did not blink at any point, so I identified them through elimination.

At this point, I configured the bond to use the correct interfaces and observed that one bonded port was reporting a link speed of 1000Mb and the other 10000Mb. Checking the system logs revealed that there was now an ixgbe firmware error involving the new card, likely due to slightly different hardware rev, as @BrandonXavier suggested.

[44650.577580] ixgbe 0000:05:00.0: Warning firmware error detected FWSM: 0x00000000

ethtool again showed that the firmware on these ports was quite old. I downloaded and installed the latest ixgbe firmware from Intel, following the steps here, rebooted the server, and this cleared up all remaining issues!