16 Biggest HA/DRS configuration mistakes

Today I was watching one of the VMworld 2012 video “16 of the biggest HA and DRS configuration mistakes” by Greg Shields. It was an awesome session where Greg has discussed the mistakes we the Vmware admin generally making while installing/configuring HA/DRS clusters.

After watching that session I decided to write up on this so that we can be more careful and don’t commit the same mistakes while deploying HA/DRS clusters in our production environment and can rectify if any of the below mentioned mistake we have done unknowingly. So here are some of the important points from that session:

1: Not planning for hardware change. The server hardware is changing very rapidly these days, as server vendors are introducing new features in servers every now and then to make use of Vmware features more efficiently.

May be the today the servers you deployed are one of the latest server but say after 2 years you have planned to expand your cluster and will be introducing new servers. I am very much sure you will not buy an old server matching the capabilities of your existing servers and will go with whatever latest model of server is available in market. Since it’s a new model so may be some new features have been added that was not present in servers deployed 2 years back. Now suddenly you are facing issues like your VM’s are not vMotion’ing or DRS is unable to move VM’s on your new server. Why?

The answer is very straight forward. You have forget to enable EVC in your cluster. Enabling EVC and setting up baselines according to your present infrastructure can save you from these troubles.

The golden rule of expanding your cluster is, try to use hardware that has very similar processors. Do not mix Intel and AMD processors. And even within manufactures, newer processors will support additional instructions not available on older models. Pay attention to the CPUs you buy.

2: Not planning for svMotion (storage vMotion). May be today you are not using svMotion (due to license constraints or any other reason) but in future you will definitely upgrade yourself and plan to use svMotion. The gotcha is we in our general day to day activities are doing things like cloning, creating templates and taking snapshots. Snapshots are good but it can be evil sometimes and you should NOT use them except in rare situations.

Make sure your VMDKs are in persistent mode or use RDMs. The servers must see both the source and target datastores, and the cluster must have enough resources to briefly have two copies of the VM concurrently running.

3: Not enough cluster hosts. As soon as we deploy servers in our cluster we start populating it with creating VM’s and over the time may be the utilization of each host in your cluster is already touching 80-90% and you are very happy that you are making the full use of your hardware. But think if any of your host goes down how your VM’s will be restarted by HA. VM’s has no place to go as all the other remaining servers are running on their full capacity.

So what’s gone wrong here? You enabled the HA but forget very simple logic that HA requires some reserved resources in the cluster to perform failover. The golden rule here is that plan for adequate cluster resources and build-in a reserve factor, typically one full server’s worth of resources. Solution is to use an admission control policy and set the host failures tolerate to “1″.

4: Disabling admission control. This is probably one of the biggest mistake we VM admin generally do. Never, ever, ever do this! Enabling admission control provides you the guarantee that your VM’s will be restarted successfully in event of a host failure.

5: Setting the host failures tolerate to “1″. You know the importance of admission control and have enabled it on your cluster with the setting “Host failures the cluster can tolerate” and set it to 1. Now you are relaxed that you have saved enough resources in your cluster and you don’t have to worry about how HA is going to perform failover in bad days where you just lost a host in your cluster.

Enabling admission control is good but it can sometimes lead to a lot of resource getting waste specially if you are using the above mentioned admission control policy and certain VM’s in your clusters are configured with large reservations. If this is the case then you have to fine tune your cluster by using below mentioned advanced HA settings:

das.slotCpuInMHz : The value of CPU slot size in MHz.

das.slotMemInMB : Value of Memory slot size in MB.

Also not all VMs are tier-1 and don’t deserve the same restart priority. There may be some VM’s (test or dev VM) for which you can afford some downtime. Reserving a capacity worth one full host for failover can be wasteful.

So what to do now? You may be getting confused that earlier it is suggested to use this policy and now I am telling you it’s not a good policy.

The answer is, avoid using “host failure cluster tolerates” policy(unless and until you have very specific use case) and instead use the “percentage of cluster resource” option and configure a percentage that is less than a single host’s contribution to the cluster based on the number of hosts. For example, in a four node cluster each server contributes 25% so set the percentage to something like 20 or 15%.

6: Not prioritizing VM restarts. As discussed earlier that all the VM’s in a cluster are not Tier-1 and don’t deserve same restart policy. There may be some VM’s (Like your Domain controller, DNS server or your web server) which you want to come back online asap in case of host failure. Also there may be the VM’s which should be restarted in correct order for an application to function correctly.

This is the 4th big mistake we do by not prioritizing restart order for the VM’s. If you are operating a very large environment and each of your Esxi host is running more than 50 VM’s then you definitely are in trouble because HA can concurrently perform restart of max 32 VM’s and if you haven’t prioritize the restart order it may happen that you have to wait for quite a time before your critical VM’s comes back online as HA may be busy in restarting non-critical VM’s first.

7: Not updating the % policy. This is again one of the mistake we generally do in our environment. You are using the percentage based admission control policy and as your cluster grows over time, you need to recalculate the host failures percentage otherwise again you are wasting resources in your cluster.

For e.g.:  At the time of configuring cluster you had 4 hosts and on the safer side you have set reserve CPU and Memory capacity to 25%. Over the time you have added 6 new hosts and forget to change the percentage. Ideally for 10 hosts the percentage value should be 10% but instead you have reserves 15% extra CPU and Memory since you forget to update the percentage. So resources are getting waste here.

8: Using heterogeneous servers in the cluster: Suppose in your environment you are having hosts with different capacities like some hosts with 128 GB RAM and other hosts with 256 GB or higher and you have choose to use “Host failure cluster tolerates” policy. The hosts failure cluster tolerates policy does its calculation on the basis of biggest server in the cluster. So you may end up wasting a lot of resources in your environment. Also you are making life of DRS and DPM a bit tougher.

The golden rule is to always have hosts with similar capacities or homogeneous hosts in your cluster.

9: Neglecting Host isolation response: In your environment you can configure the isolation response to shut down the guest VMs in case of host isolation but on a per-VM basis you can change the settings if you have critical apps that you don’t want to go down accidently.

There are 3 configurable options for isolation response and Vmware earlier suggests that shutdown VM’s is one of the best option as it saves you from split brain situation where your VM end up running on 2 different host. Shutting down VM’s can release the VMFS locks and FDM can restart the VM on one of the surviving nodes in the cluster.

With vSphere 5.0 VMware has introduced datastore heartbeat concept to make isolation response more intelligent so you can go ahead and select “Leave VM powered on” as isolation response because to deal with split brain scenario Vmware has fine-tuned HA. But remember if you have converged networks where your management network and storage network are on the same IP/Subnet then don’t use this option.

10: Assuming that Datastore Heartbeats prevents Isolation events: This is one of the biggest misunderstanding that VMware admins have in their mind. we setup the heartbeats datastores and think yeee….I have datastores heartbeats so I don’t have to worry about my host getting isolated as datastore heartbeats are there to protect me in case of isolation. But we are very wrong here.

Datastore heartbeats are used by master to determine the state of the unresponsive host. It enables the master in determining whether host is completely failed or merely isolated. However, the isolation response is always triggered by the slave. Whenever a host is isolated it enters in the election state and elect itself as master (as being isolated alone) and pings its isolation addresses. If it doesn’t get any ping response it declares itself isolated and then triggers the isolation response.

11: Overdoing reservations, limits and affinities. Setting up large reservations on a VM can prevent restart of a VM or delay in restart in case of HA event. Because if there are large reservations on VM’s and none of the remaining host in the cluster have resource equal to those reservation then HA will delay the restart of VM and will ask DRS to defragment the cluster so that VM’s can be restarted. Even if after defragmentation hosts are not capable of holding such VM’s then HA is not going to restart those VM’s.

Also if you have must rules (affinity or anti-affinity) configured between VM’s or between VM and a host then it may happen that in case of HA events particular VM’s not getting restarted as HA can’t break must rules.

A golden rule, use shares over reservations and limit the use of affinities (or anti-affinities). These restrictions impact the DRS calculations and can impact performance and HA operations.

12: Setting Memory Limits on VM: Don’t’ ever do this, ever, ever! When you set Limit on a VM then the VM can use the physical memory upto limit only and all the allocated memory above Limit comes from swapping. If you want to restrict your application (like SQL) from eating up all the allocated memory, limit the memory from inside application and not at VM level.

13: Thinking you are smarter than DRS: Don’t play with DRS settings unnecessarily. No human can calculate all of the variables and come up with the right answer. Let the software do its job.

14: Being too liberal. Migrations take resources, be they network bandwidth or CPU time. Don’t have DRS continually moving workloads between servers. Configure thresholds to do sensible migrations when resources really are out of balance. VMotion was cool 5 years ago, but no need to have DRS continually move workloads just to be cool.

15: Combining VDI workloads and server workloads in the same cluster: Esxi hosts running VDI workloads tend to experience more load than those running server workloads and thus it forces DRS to work more often and much harder.

The golden rule here is to separate VDI workloads in their own separate cluster.

16: Too many cluster hosts. Although the technical limit is 32 hosts per cluster, the sweet spot is 16-24 hosts. Any larger and the calculations DRS does every five minutes become very complex and consume more and more resources.