Virtual Machine Reset By HA-VMware Tools Heartbeat failure

Today while investigating one production issue, I came across an incident where a virtual machine restart was performed by HA.

Navigating through task and events for the VM in vCenter, I was seeing message similar to

This virtual machine reset by HA. Reason: VMware Tools heartbeat failure. A screenshot is saved at /vmfs/volumes/4c4850e4-0dcce710-28d9-00215-a5d36b8/XYZ/xyz-0.png

On investigating further, I came to find out that VMware tools running on VM was out of date.

vm-1

I checked for HA settings related to VM monitoring on the cluster  and did not see any aggressive value set there which might have caused the vm restart. The VM monitoring settings were set at 30 secs for failure interval which means that if for continuous 30 seconds VM is unable to send any heartbeat to HA, a restart of VM will be performed.

Keep in mind that this is not related to host failure where all VM running on failed host is restarted on remaining nodes of the cluster. In case of single VM failure, HA will restart the VM on same host.

vm-monitoring-settings

I started my investigation with checking hostd.log file of the Esxi host where VM was running and observed following messages which hints on VM restart

2016-11-06T06:00:52.281Z info hostd[50A40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_ON -> VM_STATE_CREATE_SCREENSHOT)
2016-11-06T06:00:52.353Z info hostd[5114CB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_CREATE_SCREENSHOT -> VM_STATE_ON)
2016-11-05T00:26:30.020Z info hostd[50F44B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Send config update invoked
2016-11-05T00:26:30.049Z info hostd[502C6B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request
2016-11-05T01:58:44.821Z info hostd[50507B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request
2016-11-05T01:58:49.097Z info hostd[50244B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Upgrade is required for virtual machine, version: 10

Next was to check vmware.log for VM XYZ. I observed following log entries for VM reset

2016-11-01T18:13:53.321Z| vmx| I120: Vix: [43327 vmxCommands.c:685]: VMAutomation_Reset. Trying hard reset
 2016-11-01T18:13:53.321Z| vmx| W110:
 2016-11-01T18:13:53.321Z| vmx| W110+
 2016-11-01T18:13:53.321Z| vmx| W110+ VMXRequestReset
 2016-11-01T18:13:53.321Z| vmx| I120: Vigor_Reset: Attaching to reset.
 2016-11-01T18:13:53.321Z| vmx| I120: Stopping VCPU threads...

On digging vmware.log more, I found the messages regarding VMware tools version out of date

2016-10-27T04:25:34.389Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible
2016-10-27T04:25:37.656Z| vcpu-3| I120: ToolsSetVersionWork did nothing; new tools version (9344) matches old Tools version
2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersion: Status is supported old because this is status of monolithic version, and vsock_win2k3 and 25 more components are missing from the guest.
2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersionGetStatusWorkerThread: Tools status 2 derived from environment
2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Compared tools manifest from host and from the guest. Status = 2.
2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Updating the manifest info.
2016-10-27T04:25:37.832Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible
2016-10-27T04:25:37.832Z| vmx| I120: TOOLS installed legacy version 9344, available legacy version 9541
2016-10-27T04:25:37.832Z| vmx| I120: TOOLS manifest update status is 2
2016-10-27T04:25:37.832Z| vmx| I120: TOOLS can be autoupgraded.

Also found VMware tools heartbeat timing out which eventually forced HA to reset the VM

2016-10-27T04:33:10.583Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-10-27T04:34:26.229Z| vcpu-3| I120: GuestMsg: Channel 0, Protocol error, state: 0
2016-10-27T04:34:26.230Z| vcpu-3| I120: GuestMsg: Cannot close channel 0: it is not opened
2016-10-27T04:34:28.597Z| vcpu-0| I120: Guest: toolbox: Version: build-1280544
2016-10-27T04:34:28.597Z| vcpu-0| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2016-10-27T04:34:28.597Z| vcpu-0| I120: GuestRpc: Channel 0, guest application toolbox.
2016-10-27T04:34:29.924Z| vcpu-1| I120: TOOLS autoupgrade protocol version 2

From the above logs it was evident that vmware tools were unable to send heartbeat to HA and which caused HA to perform a VM reset.

VMware KB-1027734 was very helpful to me in carrying out my investigation.

Conclusion: From the above I can conclude that Vmware tools update is needed on the VM so that HA should not try to reset VM again.

I hope this post is informational to you. Feel free to share this on social media if it is worth sharing. Be sociable 🙂

One Comment