Troubleshooting NFS Mount Issue During SDDC Bringup in vCF 3.7

Recently while playing with vCF 3.7.2 in my lab, I encountered an issue where SDDC Bringup process was halted because of NFS mount problem.

If you are experienced with vCF then you would be knowing that during bringup process, NFS share from the sddc manager vm is mounted as a NFS datastore across the management domain by the name “lcm-bundle-repo”.

On checking the hostd.log on Esxi host I saw following log entries

vmkernel.log was full of below error messages

Clearly NFS server (sddc manager) was denying the requests from Esxi host to mount the nfs share “/nfs/vmware/vcf/nfs-mount”.

During troubleshooting I came across VMware KB-1005948 and my issue was exactly the same as mentioned in the KB.

As per KB 1005948

You may see this issue if you have more than one vmkernel port on the same network segment. VMware recommends only having one vmkernel port per network segment unless port binding is being used.

Since this was a nested lab, I kept all portgroups (Management/vMotion/vSAN) on same subnet and to my surprise the default gateway was pointing to vmk2 instead of vmk0.

So one part is sorted out. Still I was wondering why sddc manager is refusing the mount request as Esxi and SDDC Manager are on same subnet and I was not using any kind of firewall to restrict any traffic.

Next I checked the /etc/exports file on the sddc manager vm and got answer of my question immediately. By default, the IP address of vmk0 of all host (of management domain) is explicitly white listed in the /etc/export

So when Esxi was trying to mount the nfs share, it was doing so via the IP configured on vmk2 and that’s why NFS server was rejecting that request. To fix this issue, I could have changed the vmk0 IP to vmk2 IP in above file, but I took a shortcut and I allowed entire subnet in the NFS configuration.

I commented all the lines and added below line

/nfs/vmware/vcf/nfs-mount 10.62.x.0/24(ro,sync,no_subtree_check)

and re-exported the share by firing exportfs -ar command followed by NFS service restart.

# systemctl restart nfs-mountd.service

# systemctl restart nfs-server.service

I retried the sddc bringup task and it completed without any further issues.

Final Thoughts

In a production deployment you were less likely to hit this issue as you would have proper networking setup in place and VLAN’s defined for your Management, vMotion, vSAN and VXLAN traffic and uplinks trunked on the physical switches. 

If you are doing a POC/Lab setup and running everything on same subnet, you may probably hit this issue. But during bringup process, if your Esxi host default gateway is not changed to any other vmkernel port, then you will not face this.

If you have any other thoughts, do leave your comments and  will be happy to discuss further.

I hope you find this post informational. Feel free to share this on social media if it is worth sharing:)