VM replication state going to ‘Not Active’ as soon as vm is configured for replication

After a long back I dived into my lab yesterday and was working again on setting up DR to vCloud Air.I have wrote few blog posts in past on vSphere Replication and DR to vCloud Air using VR. If you are interested in reading those posts, you can read them from below links

This time in my lab I noticed that all my VM’s replication state were going to ‘Not Active’ as soon as its configured for replication. This was a brand new setup and I was sure that I followed all required steps in deploying and configuring the replication appliance. This was strange as I never faced this issue before.

in hms.log on my on-prem VR appliance I was seeing the same messages about replication not active

hms.log snippet

***************************************************************************

2017-03-12 09:22:21.928 TRACE hms.replication.primaryGroup.status [hms-main-thread-13] (..db.entities.PrimaryGroupEntity) operationID=f7c007c1-53d1-4264-a5aa-d283adcb2451-HMS-2625 | The status of group CGID-5f550000-fce6-48d0-8424-9420f788b2c1 changed from [configuring; -1] to [notActive; 6]

2017-03-12 09:22:25.156 DEBUG com.vmware.hms.gateway.vcd.impl.VcdHmsTunnelServiceImpl [hms-eventlog-commons-thread-2] (..vcd.impl.VcdHmsTunnelServiceImpl) | hmsEvent:SecondaryGroupStatusUpdatedEvent[sourceType: SecondaryGroupEntity; sourceId: CGID-5f550000-fce6-48d0-8424-9420f788b2c1; sourceVersion: 6; sourceSiteId: 6456b4af-3dbb-4317-9d3a-8db9f1ec5a01; destinationSiteId: 18ea3b8b-7770-4052-853a-6e50de32be6e; groupMoid: CGID-5f550000-fce6-48d0-8424-9420f788b2c1; status: notActive]

***************************************************************************

Next was to check state of replication fropm the Esxi host where the problematic VM was residing. Following commands were used to determine replication state/status

[root@galvatron:~] vim-cmd hbrsvc/vmreplica.getState 38
Retrieve VM running replication state:
The VM is configured for replication. Current replication state: Group: CGID-5f550000-fce6-48d0-8424-9420f788b2c1 (generation=43489863512893282)
Group State: full sync (0% done: checksummed 0 bytes of 40 GB, transferred 0 bytes of 0 bytes)
DiskID RDID-2129ebf8-8ca4-4711-80f4-be263741f27c State: full sync (checksummed 0 bytes of 40 GB, transferred 0 bytes of 0 bytes)

[root@galvatron:~] vim-cmd hbrsvc/vmreplica.queryReplicationState 38
Querying VM running replication state:
Current replication state:
State: syncing
Progress: 0% (checksum: 0/42949672960; transfer: 0/0)

[root@galvatron:~] vim-cmd hbrsvc/vmreplica.getConfig 38
Retrieve VM replication configuration:
The VM is configured for replication with the following options:
VM Replication ID = CGID-5f550000-fce6-48d0-8424-9420f788b2c1
Destination IP Address = 192.168.109.100
Destination Port = 10000
Recovery Point Objective = 240
Quiesce Guest OS = true
Enable Opportunistic Updates = false
Network Compression = true
Paused for Replication = false

Disk scsi0:0 is configured for replication:
Device key = 2000
Replication ID = RDID-2129ebf8-8ca4-4711-80f4-be263741f27c

I started troubleshooting issue at the host level. First I verified that necessary ports were open and VR VIB was installed on host:

1: Esxi host wasreachable to VR appliance on port 10000

[root@galvatron:~] nc -z 192.168.109.100 10000-10010
Connection to 192.168.109.100 10000 port [tcp/*] succeeded!

2: Firewall was configured to allow Replication traffic

[root@galvatron:~] esxcli network firewall ruleset list | grep Replication
Replication-to-Cloud Traffic true

3: Port 31031/44046 were allowed in firewall

firewall.xml snippet from Esxi host

*******************************************************

<service id=’0020′>
<id>HBR</id>
<!– vSphere Replication traffic initiated from VMkernel on
port 31031 (initial sync), and port 44046 (on-going replication).
See PR 610578. –>
<rule id=’0000′>
<direction>outbound</direction>
<protocol>tcp</protocol>
<porttype>dst</porttype>
<port>31031</port>
</rule>
<rule id=’0001′>
<direction>outbound</direction>
<protocol>tcp</protocol>
<porttype>dst</porttype>
<port>44046</port>
</rule>
<enabled>true</enabled>
<required>false</required>
</service>

*********************************************************

4: Replication VIB Installed on host

[root@galvatron:~] esxcli software vib list | grep vr2c
vr2c-firewall 6.1.0.12929-3656323 VMware VMwareCertified 2017-03-09

5: Verified from esxupdate.log that vr2c-firewall VIB was installed without any errors

2017-03-09T16:52:58Z esxupdate: esxupdate: INFO: —
Command: update
Args: [‘update’]
Options: {‘nosigcheck’: None, ‘retry’: 5, ‘loglevel’: None, ‘cleancache’: None, ‘viburls’: [‘https://192.168.109.100:8043/vib/vr2c-firewall.vib’], ‘meta’: None, ‘proxyurl’: ‘vcentersrv01.alex.local:80’, ‘timeout’: 30.0, ‘cachesize’: None, ‘hamode’: True, ‘maintenancemode’: None}

2017-03-09T16:52:59Z esxupdate: downloader: DEBUG: Downloading from https://192.168.109.100:8043/vib/vr2c-firewall.vib

2017-03-09T16:52:59Z esxupdate: Transaction: INFO: Final list of VIBs being installed: VMware_bootbank_vr2c-firewall_6.1.0.12929-3656323

2017-03-09T16:52:59Z esxupdate: imageprofile: INFO: Adding VIB VMware_bootbank_vr2c-firewall_6.1.0.12929-3656323 to ImageProfile (Updated) ESXi-6.0.0-20160302001-standard

2017-03-09T16:52:59Z esxupdate: HostImage: INFO: Attempting to download VIB vr2c-firewall

2017-03-09T16:52:59Z esxupdate: downloader: DEBUG: Downloading from https://192.168.109.100:8043/vib/vr2c-firewall.vib…

2017-03-09T16:52:59Z esxupdate: LiveImageInstaller: INFO: Firewall Refresh Trigger is required to install/remove VMware_bootbank_vr2c-firewall_6.1.0.12929-3656323

2017-03-09T16:53:00Z esxupdate: LiveImageInstaller: DEBUG: Starting to enable VIBs: VMware_bootbank_vr2c-firewall_6.1.0.12929-3656323

2017-03-09T16:53:00Z esxupdate: LiveImageInstaller: DEBUG: Live installing vr2c-firewall-6.1.0.12929-3656323

2017-03-09T16:53:00Z esxupdate: LiveImageInstaller: DEBUG: Enabling VIBs is done.

Next was to verify things are fine on VR appliance. Performed following checks/troubleshooting on VR appliance

1: VR appliance is listening on 10000

vrms01:/opt/vmware/logs/hms # netstat -tunlp | grep 100[01][0-9]
tcp 0 0 192.168.109.100:10000 :::* LISTEN 3630/java

2: VR appilance is able to hit Cloud provider address

vrms01:/opt/vmware/logs/hms # curl -kv https://au-south-1-15.vchs.vmware.com/api/compute
* About to connect() to au-south-1-15.vchs.vmware.com port 443 (#0)
* Trying 144.130.48.10… connected
* Connected to au-south-1-15.vchs.vmware.com (144.130.48.10) port 443 (#0)
* successfully set certificate verify locations:

3: VR appliance able to hit VCD and Cloud Proxy VIP

vrms01:/opt/vmware/logs/hms # curl -kv https://p15v3-vcd.vchs.vmware.com:443

* About to connect() to p15v3-vcd.vchs.vmware.com port 443 (#0)
* Trying 144.130.48.27… connected
* Connected to p15v3-vcd.vchs.vmware.com (144.130.48.27) port 443 (#0)
* successfully set certificate verify locations:

vrms01:/opt/vmware/logs/hms # curl -kv https://p15v3-cldprxy.vchs.vmware.com:443

* About to connect() to p15v3-cldprxy.vchs.vmware.com port 443 (#0)
* Trying 144.130.48.40… connected
* Connected to p15v3-cldprxy.vchs.vmware.com (144.130.48.40) port 443 (#0)
* successfully set certificate verify locations:

4: Server status reported up on cloud side

vrms01 # curl -sk https://p15v3-vcd.vchs.vmware.com/api/server_status
Service is up.

vrms01 # curl -sk https://p15v3-cldprxy.vchs.vmware.com/api/server_status
Service is up.

There was nothing standing out on VR appliance, so I focussed my troubleshooting back on Esxi host where my VM was residing.

First I checked hostd.log and did not found any log messages thre which would suggest what went went wrong. Following relevant messages were observed in hostd.log

2017-03-12T03:52:21.671Z info hostd[3C980B70] [Originator@6876 sub=Hbrsvc] ReplicationGroup added disk (groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1)
(diskID=RDID-2129ebf8-8ca4-4711-80f4-be263741f27c)

2017-03-12T03:52:21.678Z info hostd[3C980B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk extracted state: sync (pauseFlag=0 needFullSync=0 attachAction=full sync) (diskID=RDID-2129ebf8-8ca4-4711-80f4-be263741f27c) (groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1) (lastOpID=0)

2017-03-12T03:52:21.679Z info hostd[3C980B70] [Originator@6876 sub=Hbrsvc] ReplicatedDisk: Extract state succeeded (diskID=RDID-2129ebf8-8ca4-4711-80f4-be263741f27c) (vmID=38) (groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1) (state=full sync)

2017-03-12T03:52:21.679Z info hostd[3C980B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 1104 : Full sync started by System for virtual machine WinSrv-2012 on host galvatron in cluster galvatron. in ha-datacenter.

2017-03-12T03:52:21.679Z info hostd[3C980B70] [Originator@6876 sub=Hbrsvc] ReplicationGroup initialized replication successfully (state=full sync) (groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1)

2017-03-12T03:52:21.715Z info hostd[3C580B70] [Originator@6876 sub=Vimsvc.TaskManager opID=f7c007c1-53d1-4264-a5aa-d283adcb2451-HMS-2625-33-75-a83e user=vpxuser:ALEXvcadmin] Task Completed : haTask–vim.HbrManager.enableReplication-3100361 Status success

I was started to get impatient now as log files were not suggesting anything. I decided to take help of Dr. Google now and came across VMware KB-2066230

As per this KB “After upgrading to VMware ESXi 5.5, virtual machines protected with vSphere Replication display a Non-Active status“. In my lab I did not upgraded any host as they were freshly deployed with Esxi v6.0U2, but I decided to give a shot to troubleshooting steps mentioned in this KB.

After reading the KB, I checked vmkernel.log on the Esxi host and observed same errors as mentioned in KB.

2017-03-12T05:43:06.817Z cpu2:63352327)WARNING: Hbr: 549: Connection failed to 192.168.109.100 (groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1): Timeout

2017-03-12T05:43:06.817Z cpu2:63352327)WARNING: Hbr: 4573: Failed to establish connection to [192.168.109.100]:10000(groupID=CGID-5f550000-fce6-48d0-8424-9420f788b2c1): Timeout

So clearly my Esxi host was not able to talk to my VR appliance.

As per workaround suggested in KB, I removed the line reading : /net/vmkernelnic/child[000#]/tags/4 = “true” from esx.conf file and rebooted my Esxi host and post reboot as soon as I placed my problematic VM “WinSrv-2012” on this host, I saw full sync started for this VM and after some time it was replicated to vCloud Air without any further issues.