I am encountering a very stubborn VM (2008R2, VMware tools just slightly outdated, the ones that came with 5.5U3a) on a ESXi 6.0U2 cluster running on Dell R630 servers. From the outside, the VM becomes unresponsive after some time - might be a day, might be a week - and it's no longer responding to pings, connection requests and so on (it runs an industrial application and some MSSQL). That behaviour could already be observed when the cluster ran 5.5U3a, though.

So, I try to restart the VM via the webclient or via the fat client. Nothing happens. Like, for hours. Next escalation step:

esxcli vm process kill -w <worldID> -t soft

No response, no change. Skip -t hard and directly go to

esxcli vm process kill -w <worldID> -t force

No response as well. The VM keeps chugging along being unresponsive and all, but the world simply refuses to be killed. There's no error message, either. Rebooting the host with the VM is the last resort.

How can I identify the root cause for this very strange behaviour?


Solution 1:

How can I identify the root cause for this very strange behaviour?

Scientific method is your friend.

  1. Define the problem you want to solve. It looks like you have 2 (possibly interrelated) issues. The VM becomes unresponsive and ESXi can't kill it.

  2. Gather data. Look in the logs, your monitoring etc for relevant information.

  3. Analyse the data.

  4. Make changes based on your analysis.

  5. Verify the changes work. If they don't go back to 2 or 3 and gather more data/reanalyse.

  6. Document your findings.

Solution 2:

After having identified the right process using ps | grep vmx, you can abruptly terminate it via kill -9 <pid>

Be very careful to select (and kill) the right process. For more information, give a look here

If nothing works, according to VmWare's own documentation, you had to reboot the ESX host