Using HyperV Server 2016 with a Scale Out File Server (storing the VHDX files on the file server) makes the following error appearing in the hypervisor's Event Log (SMB Client - Connectivity):

Failed to establish a network connection.

Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.

Server name: storage.DOMAIN
Server address: IP_OF_STORAGE2:445
Connection type: Wsk

Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA adapter, can also cause this issue.

The environment feels unstable, virtual machines keep having I/O errors so the storage is indeed failing.

The environment looks like this:

  • HyperV 2016 Server with NIC teaming (dual 10G ethernet interfaces) with VLAN tagging
  • Dual storage head servers with HyperV 2016 Server and File Services enabled, combined into a Failover Cluster with the role Scale Out Fileserver (Storage1 and Storage 2). As storage backend we have an EMC storage unit hooked up via iSCSI to the head nodes.

Between the nodes we have a Cisco Nexus network running with active etherchannel/LACP on the teamed interfaces.

I'd be more than happy to provide any information if needed.

The only relevant hit I found while Googling was this technet thread without any solutions https://social.technet.microsoft.com/Forums/en-US/ef3e9243-5a22-4020-97a0-219595666cd7/smbclient-errors?forum=winserver8gen


Solution 1:

It's a bad idea to mix iSCSI and LACP. Try to un-team your connections and use MPIO where you do trunking so far.

Solution 2:

I agree with the previous speaker, MPIO is your best bet if you are considering the performance first. As for the configuration in general, I think you can make it less complicated, more reliable and what most importantly, more performant going with local storage of your nodes instead of physical SAN box. Take starwind free and let it synchronize the data across the nodes which should give you a decent performance increase since your clients would have the shortest way to storage (data locality - low latency).

Solution 3:

We decided to take the suggestions we got here and modify our network based on them:

  • We added a second VLAN tagged interface to the LBFO-team which we used to enable SMB MultiChannel
  • Modified the team's loadbalancing algorithm to Address hash instead of the default Dynamic

We did these modifications a week ago, since then we don't see this error message, and in general the SMB Client event log has less messages.

Thank you!