How to avoid VMware stunning a client during imaging with Veeam

Solution 1:

This is an HP ProLiant server running with a Smart Array RAID controller without a Flash-backed cache module.

As a result, you have no write cache (or read cache), and operations like snapshots of virtual machines will suffer. You've experienced the effect of this. The current configuration is unsuitable for most workloads, especially virtualization.

Your best option is to simply buy a cache module and battery/FBWC; HP parts 631681-B21, 631679-B21, or 631069-B21.

This will accelerate performance and eliminate the problem you're seeing.

Also see:

FBWC and Zero Memory (ZM) RAID Controller on HP DL360p

BBWC: in theory a good idea but has one ever saved your data?

What is the memory module on a RAID card needed for?

Solution 2:

Answering my own question from research. (I will only accept my own answer if one of these approaches actually works and it's before someone else's suggestion.)

This (older) article WHAT ARE THE DANGERS OF SNAPSHOTS AND HOW TO AVOID? mentions a few possible causes and three preventative measures. Interestingly it mentions how the issue similarly affects MS SQL Server and other server products.

If you do not want to stun / pause the virtual machine you can set snapshot.maxIterations to 20 (or higher). This means vSphere will do more tries (iterations) to commit the snapshot files. More information in this KB article.

It then goes on to describe the risks and downsides of this approach.

Secondly it suggests:

Alternatively you can set snapshot.maxConsolidateTime to 60 seconds. This means you can accept a pause of the virtual machine for 60 seconds to do a synchronous consolidate. This is often a better option than wait for the snapshot file grow so big the virtual machine will require to be stunned for a much longer time.

But I do not know the different between "stun" and "pause".

And lastly:

ESXi 4.1 has a update which added parameter snapshot.asyncConsolidate.forceSync = “FALSE” which needs to be added to the VMX file. This setting disables synchronous consolidate and the virtual machine will never be stunned. More info in this KB.

It doesn't describe the potential drawbacks with these solutions, but I'd presume there are some, else they'd be default.

I haven't yet checked if these parameters or solutions are still relevant in v5.

UPDATE: Veeam have recommended we make the above mentioned changes as listed in this KB which is relevant to v4 and v5 of ESXi.When removing a snapshot virtual machines become unresponsive for over 30 minutes (2039754)

UPDATE2: We are making these configuration changes tonight and rebooting the host, as it's cheaper and quicker than waiting for the cache. We will then monitor for a few days to see if this alone resolves it for us.