VCAP6-DCV Deploy Objective 4.3

Objective 4.3 of VCAP6-Deploy exam covers following topics:

  • Analyze and resolve DRS/HA faults
  • Troubleshoot DRS/HA configuration issues
  • Troubleshoot Virtual SAN/HA interoperability
  • Resolve vMotion and storage vMotion issues
  • Troubleshoot VMware Fault Tolerance

We will discuss each topic one by one.

                                             Analyze and resolve DRS/HA faults

DRS faults can be viewed from Web Client by selecting Cluster > Monitor > vSphere DRS > Faults

clus-1.PNG

HA issues can be viewed from Web Client by selecting Cluster > Monitor > vSphere HA > Configuration issue

clus-2.PNG

Also if you look into issues tab, it will tell you HA and DRS issues collelctively. 

clus-4.PNG

Common DRS Faults are :

  • Virtual Machine is Pinned: When DRS can’t move a VM because DRS is disabled on the VM.
  • Virtual Machine Not Compatible with ANY Host: Fault occurs when DRS can’t find a host that can run the VM. This might mean that there are not enough physical compute resources or disk available to satisfy the VM’s requirements.
  • VM/VM DRS Rule Violated When Moving to Another Host: Fault occurs when more that one virtual machines running on the same host are use the same affinity rule
  • Host has Virtual Machine That Violates VM/VM DRS Rules: Occurs when moving or powering on a VM that has a VM/VM DRS rule. Machine can still be moved if done so manually. vCenter just will not do it automatically.
  • Host has Insufficient Capacity for Virtual Machine: Occurs when the host does not have enough compute or memory to satisfy the VM’s requirements.
  • Host in Incorrect State: Occurs when a host is entering maintenance or standby state when it’s needed for DRS.

This document from VMware has a full list of DRS faults

                                    Troubleshoot DRS/HA configuration issues

Common HA issues and their resolution

1: Agent Unreachable State: The vSphere HA agent on a host is in unreachable state for more than a minute.

Solution: Right click on host and reconfigure HA. If this do not resolves the issue then check for network configuration.

2: Agent is in the Uninitialized State: The vSphere HA agent on a host is in the Uninitialized state for a minute or more

Solution: Check the hosts’s events tab and look for “vSphere HA Agent for the host has an error” event. Additionally Check for firewall issue: Verify there is no service on the host that is using port 8182. If found then stop that service, and reconfigure vSphere HA on host.

clus-5.PNG

3: Agent is in the Initialization Error State: The vSphere HA agent on a host is in the Initialization Error state for a minute or more.

Possible causes and solution:

  • Hosts communication error: Check network issues between hosts.
  • Timeout errors: Possible causes include that the host crashed during the configuration task, the agent failed to start after being installed, or the agent was unable to initialize itself after starting up. Verify that vCenter Server is able to communicate with the host.
  • Lack of resources: There should be atleast 75 MB free disk space on hostIf the failure is due to insufficient unreserved memory, free up memory on the host by either migrating VM’s to another host or reducing their memory reservations. Once its done, try reconfiguring HA on the Esxi host. 
  • Reboot pending: Reboot the host and reconfigure HA

4: Agent is in the Uninitialization Error State: The vSphere HA agent on a host is in the Uninitialization Error state

Solution: Remove and Re-Add the host back to vCenter Server. Host can be added directly to a cluster or can be added as standalone host and later move it in a HA enabled clutser. 

5: Agent is in the Network Partitioned/Isolated State: The vSphere HA agent on a host is in the Network Partitioned state

Solution: Check for networking issues

6: Configuration of vSphere HA on Hosts Times Out: Installation of HA agent on host did not completed in a timely manner.

Solution: Add a parameter “config.vpxd.das.electionWaitTimeSec” to  vCenter Server advanced option and sets its value to 240.

clus-6.PNG

Following logs are very helpful for troubleshooting For HA related issues

  • /var/log/fdm.log
  • /var/log/vmkernel.log

Ports needed by the HA:

  • Inbound TCP/UDP 8042-8045
  • Outbound TCP/UDP 2050-2250

                                     Troubleshoot Virtual SAN/HA interoperability

For vSAN to work, we need a dedicated VMkernel portgroup allowed for passing vSAN traffic. When both vSAN and HA are enabled on same cluster, the HA interagent traffic start flowing through vSAN network rather than the management network. The management network is used by vSphere HA only when Virtual SAN is disabled.

Note: Virtual SAN can only be enabled when vSphere HA is disabled. To disable vSAN on a clutser, we need to disable HA first and then only we will be able to disable vSAN.

When any network configuration is changed on vSAN network, the vSphere HA agents do not automatically pick up the new network settings. So if you ever wanted to make any changes to the vSAN network, below are the sequence of steps:

  1. Disable Host Monitoring for the vSphere HA cluster.
  2. Make the Virtual SAN network changes.
  3. Right-click all hosts in the cluster and select Reconfigure for vSphere HA.
  4. Re-enable Host Monitoring for the vSphere HA cluster.

Below screenshot from VMware documentation shows the networking differences when HA/vSAN co-exists together 

HA-vSAN.png

                                           Resolve vMotion and Storage vMotion Issues

Troubleshooting vMotion Issues

Lets first see how vMotion works. Below image shows high level of overview of vMotion workflow:

vMotion-workflow.png

                                                     Graphic thanks to Altaro.com

Before starting to use vMotion,make sure you have met all pre-requsites as mentioned in this article

VMware KB-1003734 is very helpul article when starting troubleshooting vMotion issues. Lets discuss few troubleshooting tips.

1: Make sure every host that is part of a cluster have vMotion enabled VMkernel

If vMotion is not enabled on a VMkernel portgroup of an Esxi host and you try to live migrate a VM on that host, you will get below error

vmotion-1.PNG

2: vMotion fails at 10%.

See VMware KB-1013150 for more details.

Verify that Migrate Enabled is set to 1 for vMotion to work. If you have any backup software configured in your environment, then those software tends to change migrate enabled to 0 temporarily so that VM’s are not vMotioned during backup window. Although backup software changes this value back to 1 post backups are completed, any type of network/storage outage can prevent change of this value.

vmotion-2.PNG

3: Verify VMkernel network connectivity

If VMkernel network connectivity is not right, vMotion will definitely fail. To test connectivity between 2 hosts, login to Esxi via ssh and perform a vmkping to IP of VMkernel postgroup specified for vMotion on destination host.

Also verify source Esxi host can connect to destination host over vMotion network port (8000)

4: Dismount unused ISOs

If a virtual machine has a mounted ISO image residing on storage not accessible by the ESXi host where you want the VM migrated to, vMotion will fail. VMware KB-1003780 explains this issue in detail. 

Make sure to dismount iso from the VM’s to fix this issue.

Troubleshooting SDRS and svMotion issues

This article from VMware lists all issues related to SDRS and svMotion. Lets discuss some of the issue and possible resolution

1: Storage DRS is Disabled on a Virtual Disk: There can be many reason for SDRS disabled on virtual disk including:

  • VM’s swap file is specifically specified for the VM.
  • Storage vMotion is disabled
  • The main disk of the VM is protected by HA and relocating will cause loss of HA protection
  • If the disk is independent
  • VM is a template
  • VM has Fault tolerance enabled
  • VM is sharing files between disks

2: Datastore Can’t Enter Maintenance Mode :One or more disks can’t be migrated with Storage vMotion and thus preventing a datastore from going it in MM. Reasons for this can be: 

  • Storage vMotion is disabled on a disk
  • Storage vMotion rules are preventing Storage DRS from making migrations

Solution:

  • Remove or disable any rules that are preventing SDRS from happening.
  • Set the SDRS advanced option IgnoreAffinityRulesForMaintenance to 1.

3: Storage DRS Cannot Operate on a Datastore: This can happen because of below reasons: 

  • Datastore is shared across multiple datacenters
  • Datastore is connect to an unsupported host
  • Datastore is connected to a host that is not running Storage I/O Control

Solution:

  • Datastore should be connected to hosts that are in same datacenter. 
  • Verify all Esxi hosts connected to the datastore cluster are running v 5.0 or above.
  • Verify all host connected to the datastore cluster have Storage I/O Control enabled.

4: Applying Storage DRS Recommendations Fail

  • The Thin Provisioning Threshold Crossed alarm is triggered for target datastore, which means the datastore is running out of space.
  • Target datastore might be in maintenance mode

Solution

  • Recify the issue that caused the Thin Provisioning Threshold alarm
  • Remove target datastore from maintenance mode

                                                  Troubleshoot VMware Fault Tolerance

This article from VMware lists all issues related to FT. Some common issues with fault tolerance are:

1: Compatible Hosts Not Available for Secondary VM: You will see a message when trying to enable FT on VMSecondary VM could not be powered on as there are no compatible hosts that can accommodate it”

Solution: Make sure the Esxi hosts have support for FT. Check VMware HCL to verify this. Alternatively if hardware virtualization is disabled on host, then enable it.

2: Increased Network Latency Observed in FT Virtual Machines

Solution: Use atleast 1G NIC for FT logging. As a best practice its is advisable to use 10G NIC’s

3: Access to FT metadata datastore was lost: Access to the Fault Tolerance metadata datastore is essential for the proper functioning of an FT VM. Loss of this access can cause a variety of problems including: 

  • FT can terminate unexpectedly.

  • If both the Primary VM and Secondary VM cannot access the metadata datastore, the VMs might fail unexpectedly. Typically, an unrelated failure that terminates FT must also occur when access to the FT metadata datastore is lost by both VMs. vSphere HA then tries to restart the Primary VM on a host with access to the metadata datastore.

  • The VM might stop being recognized as an FT VM by vCenter Server. This failed recognition can allow unsupported operations such as taking snapshots to be performed on the VM and cause problematic behavior.

 
 

Solution: When planning your FT deployment, place the metadata datastore on highly available storage. While FT is running, if you see that the access to the metadata datastore is lost on either the Primary VM or the Secondary VM, promptly address the storage problem before loss of access causes one of the previous problems.

4: Enabling FT on a VM fails: When you select Turn On Fault Tolerance for a powered-on VM, the operation fails and you see an Unknown error message.

Solution: Free up memory resources on the host to accommodate the VM’s memory reservation and the added overhead or Move the VM to a host with ample free memory resources and try again.

5: FT failover fails due to partial storage and network failures or misconfiguration

Solution: Please see this article by VMware

Additional Reading

1: Troubleshooting vMotion Connectivity Issues

2: Top 5 vMotion Configuration Mistakes

I hope you find this post informational. Feel free to share this on social media if it is worth sharing. Be sociable 🙂

Leave a Reply