Objective 4.3 of VCAP6-Deploy exam covers following topics:
- Analyze and resolve DRS/HA faults
- Troubleshoot DRS/HA configuration issues
- Troubleshoot Virtual SAN/HA interoperability
- Resolve vMotion and storage vMotion issues
- Troubleshoot VMware Fault Tolerance
We will discuss each topic one by one.
Analyze and resolve DRS/HA faults
DRS faults can be viewed from Web Client by selecting Cluster > Monitor > vSphere DRS > Faults
HA issues can be viewed from Web Client by selecting Cluster > Monitor > vSphere HA > Configuration issue
Also if you look into issues tab, it will tell you HA and DRS issues collelctively.
Common DRS Faults are :
- Virtual Machine is Pinned: When DRS can’t move a VM because DRS is disabled on the VM.
- Virtual Machine Not Compatible with ANY Host: Fault occurs when DRS can’t find a host that can run the VM. This might mean that there are not enough physical compute resources or disk available to satisfy the VM’s requirements.
- VM/VM DRS Rule Violated When Moving to Another Host: Fault occurs when more that one virtual machines running on the same host are use the same affinity rule
- Host has Virtual Machine That Violates VM/VM DRS Rules: Occurs when moving or powering on a VM that has a VM/VM DRS rule. Machine can still be moved if done so manually. vCenter just will not do it automatically.
- Host has Insufficient Capacity for Virtual Machine: Occurs when the host does not have enough compute or memory to satisfy the VM’s requirements.
- Host in Incorrect State: Occurs when a host is entering maintenance or standby state when it’s needed for DRS.
This document from VMware has a full list of DRS faults
Troubleshoot DRS/HA configuration issues
Common HA issues and their resolution
1: Agent Unreachable State: The vSphere HA agent on a host is in unreachable state for more than a minute.
Solution: Right click on host and reconfigure HA. If this do not resolves the issue then check for network configuration.
2: Agent is in the Uninitialized State: The vSphere HA agent on a host is in the Uninitialized state for a minute or more
Solution: Check the hosts’s events tab and look for “vSphere HA Agent for the host has an error” event. Additionally Check for firewall issue: Verify there is no service on the host that is using port 8182. If found then stop that service, and reconfigure vSphere HA on host.
3: Agent is in the Initialization Error State: The vSphere HA agent on a host is in the Initialization Error state for a minute or more.
Possible causes and solution:
- Hosts communication error: Check network issues between hosts.
- Timeout errors: Possible causes include that the host crashed during the configuration task, the agent failed to start after being installed, or the agent was unable to initialize itself after starting up. Verify that vCenter Server is able to communicate with the host.
- Lack of resources: There should be atleast 75 MB free disk space on host. If the failure is due to insufficient unreserved memory, free up memory on the host by either migrating VM’s to another host or reducing their memory reservations. Once its done, try reconfiguring HA on the Esxi host.
- Reboot pending: Reboot the host and reconfigure HA
4: Agent is in the Uninitialization Error State: The vSphere HA agent on a host is in the Uninitialization Error state
Solution: Remove and Re-Add the host back to vCenter Server. Host can be added directly to a cluster or can be added as standalone host and later move it in a HA enabled clutser.
5: Agent is in the Network Partitioned/Isolated State: The vSphere HA agent on a host is in the Network Partitioned state
Solution: Check for networking issues
6: Configuration of vSphere HA on Hosts Times Out: Installation of HA agent on host did not completed in a timely manner.
Solution: Add a parameter “config.vpxd.das.electionWaitTimeSec” to vCenter Server advanced option and sets its value to 240.
Following logs are very helpful for troubleshooting For HA related issues
Ports needed by the HA:
- Inbound TCP/UDP 8042-8045
- Outbound TCP/UDP 2050-2250
Troubleshoot Virtual SAN/HA interoperability
For vSAN to work, we need a dedicated VMkernel portgroup allowed for passing vSAN traffic. When both vSAN and HA are enabled on same cluster, the HA interagent traffic start flowing through vSAN network rather than the management network. The management network is used by vSphere HA only when Virtual SAN is disabled.
Note: Virtual SAN can only be enabled when vSphere HA is disabled. To disable vSAN on a clutser, we need to disable HA first and then only we will be able to disable vSAN.
When any network configuration is changed on vSAN network, the vSphere HA agents do not automatically pick up the new network settings. So if you ever wanted to make any changes to the vSAN network, below are the sequence of steps:
- Disable Host Monitoring for the vSphere HA cluster.
- Make the Virtual SAN network changes.
- Right-click all hosts in the cluster and select Reconfigure for vSphere HA.
- Re-enable Host Monitoring for the vSphere HA cluster.
Below screenshot from VMware documentation shows the networking differences when HA/vSAN co-exists together
Resolve vMotion and Storage vMotion Issues
Troubleshooting vMotion Issues
Lets first see how vMotion works. Below image shows high level of overview of vMotion workflow:
Graphic thanks to Altaro.com
Before starting to use vMotion,make sure you have met all pre-requsites as mentioned in this article
VMware KB-1003734 is very helpul article when starting troubleshooting vMotion issues. Lets discuss few troubleshooting tips.
1: Make sure every host that is part of a cluster have vMotion enabled VMkernel
If vMotion is not enabled on a VMkernel portgroup of an Esxi host and you try to live migrate a VM on that host, you will get below error
2: vMotion fails at 10%.
See VMware KB-1013150 for more details.
Verify that Migrate Enabled is set to 1 for vMotion to work. If you have any backup software configured in your environment, then those software tends to change migrate enabled to 0 temporarily so that VM’s are not vMotioned during backup window. Although backup software changes this value back to 1 post backups are completed, any type of network/storage outage can prevent change of this value.
3: Verify VMkernel network connectivity
If VMkernel network connectivity is not right, vMotion will definitely fail. To test connectivity between 2 hosts, login to Esxi via ssh and perform a vmkping to IP of VMkernel postgroup specified for vMotion on destination host.
[root@esxi01:~] vmkping -I vmk1 192.168.108.11
PING 192.168.108.11 (192.168.108.11): 56 data bytes
64 bytes from 192.168.108.11: icmp_seq=0 ttl=64 time=0.939 ms
64 bytes from 192.168.108.11: icmp_seq=1 ttl=64 time=0.382 ms
64 bytes from 192.168.108.11: icmp_seq=2 ttl=64 time=1.080 ms
--- 192.168.108.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.382/0.800/1.080 ms
Also verify source Esxi host can connect to destination host over vMotion network port (8000)
[root@esxi01:~] nc -z 192.168.108.11 8000
Connection to 192.168.108.11 8000 port [tcp/*] succeeded!
4: Dismount unused ISOs
If a virtual machine has a mounted ISO image residing on storage not accessible by the ESXi host where you want the VM migrated to, vMotion will fail. VMware KB-1003780 explains this issue in detail.
Make sure to dismount iso from the VM’s to fix this issue.
Troubleshooting SDRS and svMotion issues
This article from VMware lists all issues related to SDRS and svMotion. Lets discuss some of the issue and possible resolution
1: Storage DRS is Disabled on a Virtual Disk: There can be many reason for SDRS disabled on virtual disk including:
- VM’s swap file is specifically specified for the VM.
- Storage vMotion is disabled
- The main disk of the VM is protected by HA and relocating will cause loss of HA protection
- If the disk is independent
- VM is a template
- VM has Fault tolerance enabled
- VM is sharing files between disks
2: Datastore Can’t Enter Maintenance Mode :One or more disks can’t be migrated with Storage vMotion and thus preventing a datastore from going it in MM. Reasons for this can be:
- Storage vMotion is disabled on a disk
- Storage vMotion rules are preventing Storage DRS from making migrations
- Remove or disable any rules that are preventing SDRS from happening.
- Set the SDRS advanced option IgnoreAffinityRulesForMaintenance to 1.
3: Storage DRS Cannot Operate on a Datastore: This can happen because of below reasons:
- Datastore is shared across multiple datacenters
- Datastore is connect to an unsupported host
- Datastore is connected to a host that is not running Storage I/O Control
- Datastore should be connected to hosts that are in same datacenter.
- Verify all Esxi hosts connected to the datastore cluster are running v 5.0 or above.
- Verify all host connected to the datastore cluster have Storage I/O Control enabled.
4: Applying Storage DRS Recommendations Fail
- The Thin Provisioning Threshold Crossed alarm is triggered for target datastore, which means the datastore is running out of space.
- Target datastore might be in maintenance mode
- Recify the issue that caused the Thin Provisioning Threshold alarm
- Remove target datastore from maintenance mode
Troubleshoot VMware Fault Tolerance
This article from VMware lists all issues related to FT. Some common issues with fault tolerance are:
1: Compatible Hosts Not Available for Secondary VM: You will see a message when trying to enable FT on VM “Secondary VM could not be powered on as there are no compatible hosts that can accommodate it”
Solution: Make sure the Esxi hosts have support for FT. Check VMware HCL to verify this. Alternatively if hardware virtualization is disabled on host, then enable it.
2: Increased Network Latency Observed in FT Virtual Machines
Solution: Use atleast 1G NIC for FT logging. As a best practice its is advisable to use 10G NIC’s
3: Access to FT metadata datastore was lost: Access to the Fault Tolerance metadata datastore is essential for the proper functioning of an FT VM. Loss of this access can cause a variety of problems including:
FT can terminate unexpectedly.
If both the Primary VM and Secondary VM cannot access the metadata datastore, the VMs might fail unexpectedly. Typically, an unrelated failure that terminates FT must also occur when access to the FT metadata datastore is lost by both VMs. vSphere HA then tries to restart the Primary VM on a host with access to the metadata datastore.
The VM might stop being recognized as an FT VM by vCenter Server. This failed recognition can allow unsupported operations such as taking snapshots to be performed on the VM and cause problematic behavior.
Solution: When planning your FT deployment, place the metadata datastore on highly available storage. While FT is running, if you see that the access to the metadata datastore is lost on either the Primary VM or the Secondary VM, promptly address the storage problem before loss of access causes one of the previous problems.
4: Enabling FT on a VM fails: When you select Turn On Fault Tolerance for a powered-on VM, the operation fails and you see an Unknown error message.
Solution: Free up memory resources on the host to accommodate the VM’s memory reservation and the added overhead or Move the VM to a host with ample free memory resources and try again.
5: FT failover fails due to partial storage and network failures or misconfiguration
Solution: Please see this article by VMware
1: Troubleshooting vMotion Connectivity Issues
2: Top 5 vMotion Configuration Mistakes
I hope you find this post informational. Feel free to share this on social media if it is worth sharing. Be sociable 🙂