This post is very similar to issue described in my last post. The only difference in last issue and this was I was not able to redeploy edge gateway to get rid of stubborn Org Networks whereas in previous case Edge redeploy fixed the issue quite comfortably.
Let me start with a little bit background of how was this issue discovered and what challenges I faced. I was working investigating a failed deprovision issue when this issue was discovered. Deprovision tasks in our environment are fully automated and we have some portal where these tasks arrives and there is a Resume button which when clicked, kicks the deprovision process.
When the Resume button is clicked that portal initiates API calls to vCD and start deleting stuffs. It starts with deleting vApps, vApp Templates and then proceed to Org Network deletion and then the edge gateway and at last deletes the Org vDC and Org.
Sometimes stuffs at vCD level are in inconsistent state and thus API calls are unable to delete that element and deprovision is halted in portal.
During my investigation I checked the logs and found that API calls were unable to remove one of the Org Network.
Following errors were visible in vCD UI for network deletion failure
[ 695e10af-1677-4c64-bbe1-42250b6c249d ] Cannot delete organization VDC network default-routed (0694f25a-78b9-45b0-be44-e5c8ccda4b91) Failed to delete interface of edge gateway urn:uuid:5286e85d-afb0-4821-b4f4-db87b390ba11 - Failed to delete interface of edge gateway urn:uuid:5286e85d-afb0-4821-b4f4-db87b390ba11 - com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (202): The requested object : vm-3768 could not be found. Object identifiers are case sensitive.
From the logs it was very clear that there are issues with edge backing VM’s. I went ahead with performing edge gateway redeploy without checking the edge VM’s status in vCenter. I was thinking that redeploy fixes this issue 9 out of 10 times so just give it a shot.
To my surprise edge gateway redeploy also failed and also I observed that redeploy task took around 20 minutes (usually it takes 5-7 minutes) and eventually timed out.
Errors related to edge redeploy task failing was
[ e04b76e6-7bb1-4d97-a85c-0df2813a06be ] Cannot redeploy edge gateway M738162563-11503 (urn:uuid:5286e85d-afb0-4821-b4f4-db87b390ba11) com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10020): Failed to deploy edge appliance vse-xxxxx-0. (The name 'vse-xxxxx-0' already exists.) - com.vmware.vcloud.fabric.nsm.error.VsmException: VSM response error (10020): Failed to deploy edge appliance vse-xxxxx-0. (The name 'vse-xxxxx-0' already exists.) - VSM response error (10020): Failed to deploy edge appliance vse-xxxxx-0. (The name 'vse-xxxxx-0' already exists.)
I quickly checked vCenter Server to see what went wrong.I found that in vCenter the deploy template task was complaining that edge VM’s already exists.
Note: Those who are new to NSX, when an edge gateway redeploy is kicked, the existing edge VM’s are deleted from backend and they are freshly deployed and pulls all the edge configuration from NSX manager.
So in this case what was happening was that the existing edge VM’s did not got deleted and when a new edge VM deployment was pushed, the name was conflicting with existing VM’s and thus the redeploy task was failing. Also both the edge VM’s were in shutdown state (weird case)
I have seen this issue quite a bit of time in our environment and the fix of this issue usually was to delete both edge VM’s (as HA was enabled on edge) from backend and then again try redeploying the edge from vCD or Web-Client.
I was getting very happy from inside that aah I figured out the issue very quickly and now it will be resolved in next 10 minutes. But life has planned some more surprises for me today.
I deleted both edge VM’s and kicked redeploy task again from vCD and after waiting for 15 minutes, task got timed out again and I was seeing exactly same error in vCenter as observed earlier. To rule out possibilities of issues with vCD itself, I decided to kick redeploy task once again and this time from vCenter Web-Client.
But I guess today was not my day and again redeploy of damn edge. I wanted to yell at edge gateway but did not as I don’t wanted to give false indication to people that I have lost my mind.
At this point of time I started getting frustrated as in past I was able to resolve these issue in one or two shot, but nothing was working for me today (Poor support engineer 🙁 )
I discussed issue with my peer and explained him all the troubleshooting steps and errors observed and was looking for his advise. After talking to him I got some confidence and again resumed my troubleshooting. These were my troubleshooting steps:
1: Since both the edge VM’s were in shutdown state, I powered them on one by one in vCenter and tried doing force sync from Web-Client but it timed out. I was thinking force sync might help re-establishing edge HA post reboot of edge VM’s.
2: Tried disabling HA on edge from Web-Client and it errored out saying edge VM with haindex-0 is in shutdown state (although I powered it on in vCenter).
This clearly meant that edge VM’s were not at all talking with NSX manager.
3: Deleted one edge backing VM (vse-xxxx-0)from vCenter and tried redeploying and again it failed. This time vse-xxxx-0 VM got deployed but never came online.
4: Renamed both edge VM in vCenter and tried redeploying edge, again it failed. 2 new edge VM’s got deployed but never came online.
5: Powered-on both VM and tried disabling HA on edge via REST API. Observed same error as observed earlier “vm with haindex -0 is in powered off state.
Tailed NSX manager logs (Show log manager follow) while redeploy edge task was going on and observed following errors:
2017-07-04 09:27:29.085 GMT ERROR TaskFrameworkExecutor-28 PublishTask:856 - Failed to revert to last known version for edge edge-10 and job jobdata-275582 com.vmware.vshield.edge.exception.VshieldEdgeException: vShield Edge:10110:Failed to perform operations on Virtual Machine vm-7271 for edge edge-10. 2017-07-04 09:27:29.091 GMT INFO TaskFrameworkExecutor-28 AuditingServiceImpl:143 - [AuditLog] UserName:'AD\mjha', ModuleName:'edge.appliance', Operation:'CONFIG UPDATE', Resource:'vse-xxxx', Time:'Tue Jul 04 09:27:29.088 GMT 2017' 2017-07-04 09:27:29.093 GMT ERROR TaskFrameworkExecutor-28 RollbackStrategy:29 - Rollback of Task Instance taskinstance-1407059 failed java.lang.NullPointerException 2017-07-04 09:27:29.101 GMT INFO TaskFrameworkExecutor-28 JobWorker:243 - Updating the status for jobinstance-1155088 to FAILED
Now I was fed up of this edge gateway. I went for a smoke break and was thinking about what can be done next as I already have tried N number of things. Then a thought struck my mind that I have not checked NSX manager yet.
The moment i came back at my desk, I opened NSX manager UI and bingo I found the issue. vCenter Server was disconnected in the NSX manager.
I bounced the NSX manager management service and vCenter was connected back. Now When I triggered redeploy task once again, both edge VM (old) were deleted and a fresh set of VM’s were deployed and HA was established back between both VM’s.
I deleted one of the Org Network and it got deleted without giving me any further surprises.
Now when I resumed the deprovision from portal, Rest of the stuffs got deleted without any further issues.
Man this one took 4 hours but teached me a good lesson.
I hope you enjoyed reading this post. Feel free to share this on social media if it is worth sharing. Be sociable