ESXi Beacon Probing Limitation - Three Switches Required?

We are currently using Link Status Only for our NIC Teams on our VM hosts, but recently ran into an issue where one of our two switches had a memory error and stopped passing traffic. All of our VM hosts went down, and most of the guests (who had been using that path out already) also stopped responding until we shut off that switch manually.

In the Linux bonding environment, you can use arp_intervals as another way to detect link status, but in VMWare there is only Beacon Probing. BP is not the same as the arp_interval in that you don't choose a host to test connectivity to, as well as you need three or more interfaces to do it.

All of our VM Hosts have four NICs, so the three NIC requirement shouldn't be too much trouble. However, while the documentation only states that at least three separate physical NICs (pNICs) are required, every example also has three separate physical switches, and it doesn't state whether that's also a requirement. As I have looked for the answer to this, I came across this blog which states:

"Don’t use Beacon Probing if more than one pNIC in the vSwitch is connected to the same pSwitch. This could result in the same MAC address being presented on two or more ports on the pSwitch which is “a very bad thing”"

We don't have three switches in our configuration to just add to this problem, and in some of my preliminary testing I was having unexplained link flapping issues that may be related to them being plugged into the same switch.

So are three separate physical switches also a requirement for beacon probing? Am I relegated to link status only for my configuration? And, semi-rhetorically, why don't they have arp_interval as an option in their NIC teaming?


Solution 1:

With beacon probing is is recommended to use at least 3 pNICs because that is the way the beacon works best. ESXi sends out a broadcast packet out of the physical NIC cards. The other pNICs within the same vSwitch then wait to see if they receive the packets from the other pNICs. Whichever pNIC doesn't receive the broadcast, ESXi then assumes it is a down link.

Attaching all 3 pNICs to the same switch and using beacon probing is a waste of resources when link status would work because it's simpler. Is the link on or off? Configuration issues (STP or port blocks) won't show up with link status.

The intent and design of the beacon probing was to have the pNICs attached to different pSwitches because it was used to "test" downstream switches; switches beyond the ones the pNICs were attached to. BP can determine if say the 3rd pSwitch downstream to the iSCSI SAN has failed, link status won't detect it but BP should. Then the ESXi server can determine what it wants to do. Link status would continue to try to send packets to the SAN even though it is not available.