Unstable 10Gb copper links, Broadcom and Intel cards to Cisco 4900M switches

We bought some Dell PowerEdge R730 servers with QLogic/Broadcom BCM57810 PCI Express cards, and connect them to Cisco 4900M switches - the 10Gb links don't work reliably. They will sometimes not connect, sometimes connect after a few minutes, and when they do connect they drop several times a day. The disconnects can last 4 minutes or 2 hours.

The Cisco switches have existing 10Gb copper links to Dell PowerVault SANs, which have been stable and working for many months.

I see the disconnects in the VMware logs as messages like:

bnx2x 0000:82:00.1: vmnic5: NIC Link is Down

and

 network connectivity on virtual switch "vSwitch2". Physical NIC vmnic5 is down.

I can't see any helpful error codes or prior messages, only messages caused by the link drops. On Windows it shows as a disconnected card, and on the switch it shows as a disconnected switch port.

When the links connect, they work - jumbo frame pings ping, iSCSI sessions establish, datastores appear with all paths found. But the connections are intermittent.

We've checked:

The cables:
- originally Cat5e single cable, now Cat6 structured cabling. The cable length overall is <7m.
- Connected with a new cable, host to switch with no patches/joints and no other cables near by.
The drivers/OS:
- Originally VMware ESXi 5.5 U2 Dell build ("ESXi 5.5.0, 2068190") with the bnx2x driver version 2.710.39.v55.2
- Then the updated driver from vmware.com, bnx2x version 2.710.70.v50.7
- Then ESXi 6.0, Dell build ("ESXi 6.0.0 2494585") which has bnx2x version 2.712...
- Then Windows Server 2012 R2 with the latest driver from Dell's site.
The QLogic/Broadcom network card firmware; it's the latest from Dell, FFv7.12.17.
The switch port configuration, it is simply mtu 9000 and switchport access vlan NNN
The switch ports
- These are 8-port 10Gb RJ45 modules ( WS-X4908-10G-RJ45 ), one per switch. The SANs take up the first four ports in each module, the new servers take up the remaining four ports in each module. This appears to affect all the ports we're using for the new servers. So it's not one failing port, or one failing module.
- I haven't tried disrupting the SAN connections to test those ports, without some specific reason to think ports 1-4 are more reliable than 5-8 that would be a last resort.
The switch interface counters, no errors apart from disconnects.
Disabling various of the offload capabilities in the Windows QLogic/Broadcom driver, and enabling EnergyEfficientEthernet, forcing the cards to 10Gb instead of autodetect.
Connecting the same hosts to the same switches into 1Gb ports, which appears to work fine, they repeatedly connect very quickly.
Cross-connecting two hosts, they connect quickly at 10Gb and hold a stable connection for days.
We bought an Intel X540-t2 card, and tried that. It behaves the same.
Since then, we've bought Cat 6a patch cables and tested those, no change.

We raised a call with Dell support, they've found nothing wrong and suggest the switches are at fault, but as the switches run 10Gb copper connections to Dell PowerVault Storage, and as far as I can tell from our switch monitoring logs and the SAN event logs, those links don't drop, I'm unwilling to think Cisco switches are the problem.

They are running IOS 15.1(1)SG2 which is not the latest, but the switches are live and stable, I don't want to casually change the firmware "just in case".

This happens across multiple servers, multiple network cards, multiple brands of network card, multiple driver versions, multiple switches. It can't be a single faulty piece of hardware. It's all in an air-conditioned, power-conditioned rack.

This is the first time we've tried VMware host to switch connections at 10Gb, so we have no other configuration we can compare with or hardware we can connect to.

What else can we check?

-- Edit: We were looking to upgrade the switch firmware, but I've just found a related link - this appears to be a known issue between the Cisco WS-X4908-10G-RJ45 module and the Broadcom BCM57810 cards, IOS version dependent - https://supportforums.cisco.com/discussion/11755141/4900m-ws-x4908-10g-rj45-port-startup-delay which has a lot of relevant discussion, and leads to:

https://tools.cisco.com/bugsearch/bug/CSCug68370

WS-X4908-10G-RJ45 and Broadcom 57810S 10Gb BASE-T interoperability issue

CSCug68370

Description

Symptom: 10Gbps BaseT ports (on WS-X4908-10G-RJ45) connected to Dell 820 servers with Broadcom 57810S DP 10Gb BASE-T. On a reload of the switch or removal / re-install of the cable ports are coming up after a long time (up to 1 hour) or not coming up at all. Conditions: 1) Module WS-X4908-10G-RJ45 2) Versions 15.0(2)SG through 15.0(2)SG7, 15.1(2)SG through 15.1(2)SG3 Workaround: Downgrade to 12.2(54)SG

That's not exactly the same server model, and it doesn't mention Intel cards, but the problem is a pretty spot-on match.

Solution 1:

Please update your ESXi hosts. This is the one thing you've really missed in the troubleshooting steps.

Your 5.5 installation is almost 1 year old!!

As of this writing, the current version of ESXi 5.5 is 2718055. The current ESXi 6.0 build number is 2809209.

Dell, HP, doesn't matter... you're still supposed to update your ESXi installations. Many people overlook this, and it's the second most-often cause of unintended downtime in the environments I see.

Solution 2:

Well, looks like it's Cisco bug https://tools.cisco.com/bugsearch/bug/CSCug68370 and upgrading to one of the "known fixed" IOS versions (15.1(2) SG4) seems to have fixed it.

Should i disable DNS recursion?

On FreeNas what is the difference between the "Permission Type" and the "Share Type" setting?

What is reasonable storage failover time that most OS (VM) can tolerate?

Can clang format add braces to single line if statements etc

When to use a react framework such as Next or Gatsby vs Create React App [closed]

How to correctly read an Interlocked.Increment'ed int field?

Resharper: Implicitly captured closure: this

Docker on CentOS 7.2: kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Is there a way to give promo/coupon codes for people to download your app for free?

fgetcsv fails to read line ending in mac formatted csv file, any better solution?

Why do weird things in font color attribute produce real colors? [duplicate]

ImportError: No module named Image [duplicate]