Should an HA failover occur in this scenario?

Solution 1:

You seem to be confusing vMotion and HA, which are different features that do different things.

vMotion is a feature which allows virtual machines to be migrated from one physical host to another with no downtime and minimal (milliseconds) disruption in service. It is done in advance of maintenance and requires the VM and both the source and destination hosts to already be in a healthy state. HA is a feature which restarts failed virtual machines (or inaccessible virtual machines if host isolation is configured) and does result in downtime for the VM, since the entire virtual machine is powered off and restarted.

Important take-away: a vMotion is not an HA failover. An HA failover is an HA failover.

vMotions are triggered by the following things:

  1. A user initiates a vMotion
  2. DRS initiates a vMotion in response to load conditions (thresholds set by the DRS aggressiveness setting), affinity rule violations, or host updates triggered through VUM

HA failovers are triggered by the following things:

  1. A host in your HA cluster has detected that another host in the cluster has failed and is not responding to HA heartbeats using either the configured management networks or heartbeat datastores
  2. Isolation response is configured to shut down or power off VMs, and the host can no longer speak to a majority of cluster nodes, triggering a VM shutdown and subsequent HA failure detection from the remaining majority of the cluster (if there is one, which is one of the dangers of isolation response)
  3. The cluster/VM are configured for VM Monitoring through VMware Tools, the hypervisor has not received a heartbeat for a specific amount of time, and no disk or network activity has occurred for 120 seconds

Bottom line: vMotions occur because of performance events, and HA failovers happen because of availability events.

What you've done is pull the disk out from underneath a running VM. The standard behavior of vSphere, and most hypervisors, in this instance is to leave the virtual machine alone, and let it handle its own disk issues. There's several good reasons for this:

  1. Some operating systems/distros (i.e. pfSense) will work just fine if the underlying disk stops responding
  2. A few dozen VMs starting up at the same time tends to create a "thundering herd" problem -- doing this on storage that's already questionable may not end up being the best idea
  3. Like swapping, the operating system (and applications) will usually do a better job of dealing with storage issues than the hypervisor will
  4. Sometimes storage just hangs -- it's the most failure-prone component in most virtualized environments. Best to try to detect it and alert on it and let an administrator figure out what to do with it before you kick over an entire environment

On the other hand, for many workloads (databases come to mind), it's a good idea to shut down as soon as there's a chance corruption or lost transactions might occur. In a best-case scenario, though, since you can't cleanly quiesce the database without the disk, you're probably ending up in an inconsistent state anyway.

Ultimately: there's some good use cases for having HA respond to unreliable storage, but it doesn't do that today, and the behavior you're seeing is totally normal.