Phantom NIC issue causing eth0/1 to drop out

We are experiencing a very strange and frustrating problem. Our company has servers here in Massachusetts, as well as in California. The issues we are seeing are only on the CA hardware. Out in CA, we have several hundred Dell R300 and Dell R310 servers, all connected to four HP Procurve 4208vl switches. There are two switches for each model, one for the front-end network, and one for the back-end network. These systems are aranged in clusters and all are used for various tests that we run to test out our software OS we are developing. Many of these tests require sucessive and/or repeating reboots. Many, if not most tests, re-provision the nodes with the Os again. The problem is that occuring is, given enough time, seemingly at random, one (or many) of these systems will have a downed eth0 or eth1 interface.

The issue is the node will intermittently boot up with no connectivity on either eth0 or eth1, sometimes both. The workaround is to SSH in via backend (if eth0 is down) or frontend (if eth1 is down) and run ifdown/ifup on the downed interface.

List of workarounds: - service network restart - ifdown eth1 (or eth0), then ifup eth1 (or eth0) - reseat the network cables - reboot the server

This is a huge pain for the development team as it will stop entire clusters from running their tests until manual intervention.

The worst part occurs when a node boots up busybox for an OS install and eth0 drops out: in this case the node is completely unreachable since we don't have eth1 in busybox, and the OS install can't proceed because it can't talk to the PXE server to pull down the latest image of the OS (since eth0 is down). Nodes that fall into this state will get stuck like this until the next time I get someone in CA on the phone and have him manually reboot the node.

The following has been done to attempt to resolve this seemingly random and irreproducable issue:

  • Both Procurve Switch and R310 firmware have been updated to latest revisions possible.
  • Both Switches and Servers set to Autonegotiate (1000/FULL DUPLEX).
  • We're seeing this accross 4 different HP switches and about 200-400 Dell servers (they were all purchased at different times, so it's not just a bad lot).
  • We do not have this issue on other hardware in CA, including Dell 860s and 750s plugged into their own HP Procurve switch.
  • This issue does not appear to happen when the nodes are plugged into a different switch (although we lack the hardware to test full with on a different switch).

Before the firmware upgrade, the HP Procurve switch logs show:

  • excessive broadcasts detected on port x
  • high collision or drop rate on port x
  • excessive CRC/alignment errors on port x

After the firmware upgrade, we see less of these errors, yet they still persist.

For troubleshooting, I have been logging the usual info:

ifconfig ; for n in 0 1; do ethtool eth$n;ethtool -i eth$n;ethtool -k eth$n;ethtool 
-S eth$n; done; dmesg | egrep 'eth|bnx|e1000'; cat /var/log/messages > /tmp/eth_issues

Here are some examples of output:

# ethtool -i eth0
driver: bnx2
version: 2.1.6
firmware-version: 6.4.5 bc 5.2.3 NCSI 2.0.11
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on

 # ethtool -S eth0
 NIC statistics:
 rx_bytes: 0
 rx_error_bytes: 0
 tx_bytes: 5676016
 tx_error_bytes: 0
 rx_ucast_packets: 0
 rx_mcast_packets: 0
 rx_bcast_packets: 0
 tx_ucast_packets: 0
 tx_mcast_packets: 7
 tx_bcast_packets: 10495
 tx_mac_errors: 0
 tx_carrier_errors: 0
 rx_crc_errors: 0
 rx_align_errors: 0
 tx_single_collisions: 0
 tx_multi_collisions: 0
 tx_deferred: 0
 tx_excess_collisions: 0
 tx_late_collisions: 0
 tx_total_collisions: 0
 rx_fragments: 0
 rx_jabbers: 0
 rx_undersize_packets: 0
 rx_oversize_packets: 0
 rx_64_byte_packets: 0
 rx_65_to_127_byte_packets: 0
 rx_128_to_255_byte_packets: 0
 rx_256_to_511_byte_packets: 0
 rx_512_to_1023_byte_packets: 0
 rx_1024_to_1522_byte_packets: 0
 rx_1523_to_9022_byte_packets: 0
 tx_64_byte_packets: 1054
 tx_65_to_127_byte_packets: 7
 tx_128_to_255_byte_packets: 0
 tx_256_to_511_byte_packets: 0
 tx_512_to_1023_byte_packets: 9441
 tx_1024_to_1522_byte_packets: 0
 tx_1523_to_9022_byte_packets: 0
 rx_xon_frames: 0
 rx_xoff_frames: 0
 tx_xon_frames: 0
 tx_xoff_frames: 0
 rx_mac_ctrl_frames: 0
 rx_filtered_packets: 0
 rx_ftq_discards: 0
 rx_discards: 0
 rx_fw_discards: 0

We've spent countless hours on the phone with Dell and HP and we can't seem to figure out what is causing this issue. At first we thought the firmware upgrades would fix it, but after going nowhere both companies claim that they cannot support either party's hardware and refuse to help any further than that.

Can someone help me track this issue down to the root cause? Keep in mind that I never know when or which system will be the culprit and the OS gets re-provsioned a lot, so installing software to help log this is useless since it will be lost during the product's next provisioning. Any help or insight you could provide would be appreciated. Any hunches or thoughts are welcome, too. Please let me know if you need more details or output posted. Thanks.


The answer is: get a better NIC and note to self to never buy Broadcom again:

http://blog.serverfault.com/2011/03/04/broadcom-die-mutha/


Honestly, I doubt it's an issue with hardware at this point... and more an issue with the underlying driver in the OS you're trying to boot. In my own experience the bnx2 driver is notorious for being pretty terrible... as it's written by Broadcom to try and make opensource users happy, but not much more than that. Have you tried downloading/building drivers directly from broadcom? It would be more interesting to see what's in the insane amount of broadcast packets... (read that as try capturing packets between the NIC & Switch) and throw that at Boadcom for feedback. The old switch(es) may have not complained, because they didn't bother dealing with the flood of bad packets... (high amount of errors reported on new switch)


We have a number of R300 and R310 - and never had an issue after booting them. BTW - what does Dell support say to your case?

So my guess is that there is something wrong on the network side of the hardware (Procurve Switches). However if I were you I would write a simple workaround:

An init-script that runs at a late stage and does the ifdown/ifup if no link is detected on eth0 or eth1.

BTW: eth0 and eth1 are both on board? Then both should be able to do PXE-boot (I am not at work right now, so I am not sure about the number of onboard-interfaces - I usually use the bigger brothers R510, R710, ...).