Windows Server 2016 guests in WSFC cluster randomly quarantined due dropping heartbeat routes

Solution 1:

I just had the same problem with a Windows Server 2019 Failover Cluster (for Hyper-V 2019). I usually also disable IPv6 on my Servers and that caused the problems. The cluster threw lots of Critical Errors and sometimes did a hard failover, even though a file share witness was also in place and working(?!).

Errors and Warnings I observed in the eventlog were:

FailoverClustering Event IDs:

  • 1135 (Cluster node '....' was removed from the active failover cluster membership)
  • 1146 (The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted)
  • 1673 (Cluster node '....' has entered the isolated state.)
  • 1681 (Virtual machines on node '....' have entered an unmonitored state.)

Service Control Manager Event IDs:

  • 7024 (A quorum of cluster nodes was not present to form a cluster.)
  • 7031 (The Cluster Service service terminated unexpectedly.)

FailoverClustering-Client

  • 81 (Extended RPC error information)

Thanks to your research I got an important clue: The hidden adapter still uses IPv6. Since the article you had linked to said that it was not recommended or mainstream to disable IPv6 on the hidden adapter, but disabling it on all other adapters was supported and tested, i was wondering what stopped him from working.

Using the following Command I pulled the cluster logs (also thanks for the hint! I was not aware of this useful command):

# -Destination (Folder) in my case changed to be not on the "C:\" SATADOM (this thing is slow and has few write cycles)
# -TimeSpan (in minutes) limited to one of the Failovers because these logs get HUGE otherwise.
Get-ClusterLog -Destination "E:\" -TimeSpan 5

Unfortunately I had the same log entries you already have posted.

I re-enabled IPv6 on all adapters and reverted my tunnel related adapters/configs with:

Set-Net6to4Configuration -State Default
Set-NetTeredoConfiguration -Type Default
Set-NetIsatapConfiguration -State Default

That did not do the trick... Looking further I noticed that I also always disable "those unneeded" IPv6 related Firewall Rules... And that seemed to be the actually important Change! Those rules seem to affect the invisible adapter too.

The thing seems to be: IPv6 does not use ARP for finding the MAC addresses of its communication partners. It uses the Neighbor Discovery Protocol. And this protocol does not work, if you disable the associated Firewall Rules. While you can check the IPv4 ARP entries with:

arp -a

This won't show you the MAC addresses for IPv6 addresses. For those you can use:

netsh interface ipv6 show neighbors level=verbose

If you want, you can filter the output to your IPv6 adapter addresses like this:

netsh interface ipv6 show neighbors level=verbose | sls ".*fe80::1337:1337:1234:4321.*" -Context 4 |%{$_.Line,$_.Context.PostContext,""}

Doing that I found out, that those entries seem to be very short lived. The state of the entry for the Microsoft "Failover Cluster Virtual Adapter" link local address of the cluster partner was always toggling between "Reachable" and "Probe". I did not get the moment in which it was "Unreachable" though, but after re-enabling the IPv6 rules, the problem went away:

Get-NetFirewallRule -ID "CoreNet-ICMP6-*" | Enable-NetFirewallRule

Somehow this MAC address seems to be exchanged on another way between the cluster partners (probably because it is the "virtual Remote" address and not a real one?). So it keeps reappearing, leading to those wild Failover / Quarantine / Isolated states.

Probably disabling IPv6 on the invisible adapter would have helped too, but since this is not recommended, I now have decided to stop disabling IPv6 related things altogether. It's the future anyway :-)

Hope this helps another fellow IPv6-disabler!