VCAP6-DCV Deploy Objective 2.3

Objective 2.3 of VCAP6-Deploy exam covers following topics

  • Analyze and resolve storage multi-pathing and failover issues
  • Troubleshoot storage device connectivity
  • Analyze and resolve Virtual SAN configuration issues
  • Troubleshoot iSCSI connectivity issues
  • Analyze and resolve NFS issues
  • Troubleshoot RDM issues

Lets discuss each topic one by one

                               Analyze and resolve storage multi-pathing and failover issues

There can be hundreds of reason for multipathing and failover issues and troubleshooting these issues comes with experience only. Issues with multipathing can be because of issues on storage side (SAN Switch, Fibre configuration etc)  or from vSphere side. In this post we will focus only on vSphere side troubleshooting.

In my lab I am using openfiler appliance for shared storage and my vSphere hosts are configured to use software iSCSI to reach to openfiler. Each host has 2 physical adapters mapped to two disting portgroups configured for iSCSI connection and both portgroups are complaint with iSCSI Port Binding settings

VMware KB-1027963 explains in great details about storage path failover sequence in vSphere. Messages about path failover are recorded in /var/log/vmkernel.log

Change multipathing policy and Enable/disable paths manually

Multipathing policies and path failover can be manually triggered via Web Client or Esxi shell

Changing the Multi-Pathing Policy: Select an Esxi host from the inventory and navigate to Manage > Storage > Storage Devices and select a device from list.

Go to properties tab and select Edit Multipathing

mp-1.PNG

Select one among Fixed/MRU and Round Robin and hit OK. To know more about these polices in detail, please refer this article

mp-2.PNG

Refresh Web Client to ensure policy change has took effect.

Enabled/Disable a Path : To enable/disable a path manually, go to paths tab instead of properties tab of selected storage device.

Select a path and if its Active then click on disable button. If a path is already disabled then Enable button will be highlighted. 

mp-3.PNG

Change MultiPathing Policy from Command Line

Connect Esxi host over SSH and login via root user and fire below command to change the multipathing policy

# esxcli storage nmp device set –d <naa_id_of_device>  -P <path_policy>

For example, to change the multipathing policy of LUN from MRU to Fixed, you need to run below command:

# esxcli storage nmp device set -d t10.F405E46494C45425645447059546D2E6256413D213E61776 -P VMW_PSP_FIXED

Note: Device identifier/naa_id  can be grabbed from both Web Client or via command: esxcli storage nmp device list

Disable a path via command line

To disable a path via CLI, run below command

# esxcli storage core path set --state=off --path=vmhba33:C0:T0:L0

and you will see a path going dead

mp-4.PNG

To enable the path again, run command:

# esxcli storage core path set --state=active --path=vmhba33:C0:T0:L0

In /var/log/vmkernel.log you will see following log entry for this event

2017-12-07T15:19:02.813Z cpu3:102867 opID=87ff36cf)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from NONE to vmhba33:C0:T0:L0 for device "t10.F405E46494C45425645447059546D2E6256413D213E61776".

in hostd.log you will see following

2017-12-07T15:29:27.965Z info hostd[6DA80B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 302 : Path redundancy to storage device t10.F405E46494C45425645447059546D2E6256413D213E61776 (Datastores: iscsi-3) restored. Path vmhba33:C0:T0:L0 is active again.

Changing the Default Pathing Policy

Check current policy: 

# esxcli storage nmp satp list

mp-5.PNG

Change the Default Path Policy: 

# esxcli storage nmp satp set –default-psp=<policy> –satp=<satp_name>

Monitoring Storage Performance

Storage performance can be monitored via esxtop. VMware KB-1008205 lists the steps to identify storage performance issues

hba view in my lab

esxtops-1

device view

esxtops-2

                                               Troubleshoot storage device connectivity

VMware vSphere Troubleshooting guide explains in details on what are the things to check for troubleshooting storage connectivity. Few of them are:

  • Check cable connectivity
  • Check zoning for FC
  • Check access control config – for iSCSI additional checks for CHAP, IP-based filtering and initiator name-based access control is setup correctly
  • Make sure cross patching is correct between storage controllers
  • For any changes rescan the HBA / host

The maximum queue depth can be changed on a ESXi host.  If a ESXi host generates more commands to a LUN than the LUN queue depth can handle, the excess commands are queued in the VMKernel which increases latency.

To change the queue depth on a FC HBA run the following and reboot the host

escli system module parameters set -p parameter=value -m module

For iSCSI run the following

esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=value

Some useful CLI commands:

  • Get info about the storage device you trying to troubleshoot esxcli storage core device list
  • check the multi-path configure with esxcfg-mpath command for example:  esxcfg-mpath -b -d naa.60014054ddcf82083c44f8da7394198a

                                           Analyze and Resolve vSan Configuration Issues

Host with the vSAN Service enabled is not in the vCenter Cluster: Add the host to the vSAN Cluster

Host is in a vSAN enabled cluster but does not have the vSAN Service enabled: Verify the network is correct on host and is configured properly.

vSAN network is not configured: Configure vSAN Network on the virtual switchs that will connect the vSAN cluster

Host cannot communicate with all other nodes in the vSAN enabled cluster: Check for network isolation.

Following commands are useful for retrieving vSAN info:

# esxcli vsan network list

vSAN VMkenel portgroup settingss:

# esxcli network ip interface list -i <vSAN VMkernel PG>

Check vSAN cluster status

# esxcli vsan cluster get

                                                Troubleshoot iSCSI Connectivity Issues

Common issues that can arise in an environment which is based on iSCSI storage are:

1: No targets from an array are seen by all Esxi host or a subset of hosts or to a single host.

2: Targets on the array are visible but one or more LUNs are not visible

3: An iSCSI LUN not visible

4: An iSCSI LUN cannot connect

5: There are connectivity issues to the storage array

6: A LUN is missing

The basic troubleshooting steps for diagnosing above mentioned issues are outlined as below

1: Verify connectivity between host and storage array

[root@esxi01:~] vmkping -I vmk2 192.168.106.6
PING 192.168.106.6 (192.168.106.6): 56 data bytes
64 bytes from 192.168.106.6: icmp_seq=0 ttl=64 time=0.432 ms
64 bytes from 192.168.106.6: icmp_seq=1 ttl=64 time=0.450 ms
64 bytes from 192.168.106.6: icmp_seq=2 ttl=64 time=0.378 ms

2: Verify Esxi host can reach storage array on port 3260

[root@esxi01:~] nc -z 192.168.106.6 3260
Connection to 192.168.106.6 3260 port [tcp/*] succeeded!

3: Verify Port Binding Settings

For iSCSI based storage, please ensure port binding configuration and make sure paths are active and complaint. 

port-binding

4: Verify if large packets can be sent to storage array (If jumbo frame is configured

[root@esxi01:~] vmkping -s 1500 192.168.109.6 -d
PING 192.168.109.6 (192.168.109.6): 1500 data bytes
sendto() failed (Message too long)
sendto() failed (Message too long)
sendto() failed (Message too long)
--- 192.168.109.6 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
[root@esxi01:~] vmkping -s 1200 192.168.109.6 -d
PING 192.168.109.6 (192.168.109.6): 1200 data bytes
1208 bytes from 192.168.109.6: icmp_seq=0 ttl=64 time=0.736 ms
1208 bytes from 192.168.109.6: icmp_seq=1 ttl=64 time=0.390 ms
1208 bytes from 192.168.109.6: icmp_seq=2 ttl=64 time=0.373 ms

5: Ensure that the LUNs are presented to the ESXi/ESX hosts.

On the array side, ensure that the LUN IQNs and access control list (ACL) allow the ESXi/ESX host HBAs to access the array targets. For instructions please refer VMware KB-1003955

6: Ensure that the HOST ID on the array for the LUN is less than 255 for the LUN. The maximum LUN ID is 255. Any LUN that has a HOST ID greater than 255 may not show as available under Storage Adapters

7: Verify that the host HBA’s are able to access the shared storage. For instructions please refere VMware KB-1003973

8: Verify chap authentication settings

If CHAP is configured on the array, ensure that the authentication settings for the ESXi/ESX hosts are the same as the settings on the array. VMware KB-1004029 has instructions for checking CHAP settings.

9: Verify that the storage array being used is listed on the Storage/SAN Compatibility Guide. For instructions please refer VMware KB-1003916

                                              Analyze and resolve NFS issues

Common issues that can arise for NFS based storage are:

1: The NFS share cannot be mounted by the ESX/ESXi host.

2: The NFS share is mounted, but nothing can be written to it. Following log entries can be seen for this issue:

a) NFS Error: Unable to connect to NFS server

b) WARNING: NFS: 983: Connect failed for client 0xb613340 sock 184683088: I/O error

c)WARNING: NFS: 898: RPC error 12 (RPC failed) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (xxx.xxx.xxx.xxx)

Since NFS is also IP based storage, troubleshooting NFS issues start with checking:

1: Testing NFS server connectivity : Verify that the ESX host can vmkping the NFS server. Also verify that the NFS server can ping the VMkernel IP of the ESX host.

2: Verify port on NFS server is open. Default port is 2049

# nc -z <NFS-Server-IP> <NFS-Port> 

3: Verify firewall on NFS side and Esxi host side. On both side firewall should be configured to pass NFS traffic. VMware KB-1007352 explains more on this. 

4: Check the vSwitch configuration and verify correct VLAN and MTU are specified on portgroup designated for NFS traffic. If the MTU is set to anything other than 1500 or 9000, test the connectivity using the vmkping command:

# vmkping -I vmkN -s <mtu_size> <NFS_Server_IP>

5: Permission to mount the NFS exported filesystem.

6: Export configuration on the NFS server. NFS mount should have rw access for Esxi host subnet.

7: Proper authentication set if using NFS 4.1. 

8: If the NFS is configured inside a windows server then verify it is configured properly. use VMware KB-1004490 for troubleshooting steps. 

9: For troubleshooting mount related issues, enable the nfsstat3 service for enhanced logging. By default this is disabled

Verify current settings

[root@esxi01:~] esxcfg-advcfg -g /NFS/LogNfsStat3
Value of LogNfsStat3 is 0

Enable nfsstat3 service

[root@esxi01:~] esxcfg-advcfg -s 1 /NFS/LogNfsStat3
Value of LogNfsStat3 is 1

                                                     Troubleshoot RDM issues

Storage vendors might require that VMs with RDMs ignore SCSI INQUIRY data cached by ESXi.  When a host first connects to the target storage device it issues the SCSI INQUIRY command to obtain basic identification data from the device which ESXi will cache, the data remains unchanged after this.

To configure the VM with a RDM to ignore SCSI INQUIRY cache add the following to the .vmx file

# scsix:y.ignoreDeviceInquiryCache = “true”

Where x = the SCSI controller number and y = the SCSI target number of the RDM

And that’s it for this post.

I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing. Be sociable 🙂