Storage design Considerations- Part 2- Designing for Availability

In my last post Storage Design Considerations- Overview I have discussed few points that are very crucial and must be paid serious attention while designing the storage infrastructure.

In this post I will discuss about the importance of the availability of the storage component in a virtual infrastructure and important considerations:

Availability of the storage component is very crucial. No vSphere or SAN admin will ever want to face a situation of complete outage from storage point of view.

Performance and Capacity issues are not that much disruptive and if very well planned and monitored then those issues can be rectified without any downtime.

To avoid situation of complete outage we should build redundancy on each and every components of the storage devices. All your physical servers can be connected to a one large piece of storage device and failure of it means a situation of chaos for you. So every components and connections have sufficient levels of redundancy to ensure that there are no single points of failure

Whenever we talk about availability we always talk about the concept of the 9’s. What I mean by this is explained as follows:

Most common service-level agreements (SLAs) use the term 9s. The 9s refers to the amount of availability as a percentage of uptime in a year as shown in table below

str1

Using 9s concept will give you a quantitative level of desired availability. To calculate the value for the uptime or availability percentage the following concept is used:

For each item an estimate of how frequently it will fail (mean time between failures [MTBF]) and how quickly it can be brought back online after a failure (meantime to recover [MTTR]) is calculated and then the below formula is used to calculate the applicable 9s value:

Availability = ((minutes in a year − average annual downtime in minutes) ∕minutes in a year) × 100

The below example is taken from vSphere Design book by Scott Lowe and it can help us to understand by what we are talking here:

Suppose a router that on average fails once every 3 years (MTBF) and that takes 4 hours to replace (MTTR) can be said to have on average an annual downtime of 75 minutes. This equates to

Availability = ((525600 − 75) ∕ 525600) × 100 = 99.986%

As soon as you introduce a second item into the mix, the risk of failure is multiplied by the two percentages. Unless you’re adding a 100 percent rock-solid, non-fallible piece of equipment the percentage drops, and your solution can be considered less available.

As an example, if you have a firewall in front of the router, with the same chance of failure, then a failure in either will create an outage. The availability of that solution is halved: it’s 99.972 percent, which means an average downtime of 150 minutes every year.

While designing the infrastructure we should be aware of items that has most likelihood of failure.

When multiple items are needed to handle a load and failure of any one of the item can create outage then adding more nodes are not going to reduce the potential of failure.

Availability doesn’t always means uptime. If your design is such that failing of a component is affecting your performance so badly such that the architecture is not at all workable (getting very poor performance) then I am sure no one will be impressed by your zero-downtime figures for the month.

Typically for a SAN storage device we can use redundancy at following levels:

1: Fabric Switch Level: Having more than one fabric switch is very general in SAN environment. Both the fabric switch must be interconnected and the storage processors must have connectivity to both fabric switch.

2: Physical Server Level: Each HBA card on physical server should connect to both fabric.

3: Rack Level: The Rack where your storage device is kept must derive power from 2 alternate power sources and the PDU on the storage device in turn can be connected to different racks.

4: Software Level: Using multipathing techniques to reach storage array via multiple paths at a given time.

If you are intended to use iSCSI and relying on traditional Ethernet connectivity then redundancy can be achieved at following level:

1: iSCSI Switch level: Like SAN switches your iSCSI switches must be redundant and should be interconnected. Also the traditional switches which is fetching connectivity to your iSCSI switch should be redundant in providing the ethernet connection.

2: Software Level: Multiple port-groups can be configured for iSCSI connectivity and uplinks connected to these port-groups must be coming out of redundant iSCSI switches.

3: Network Infrastructure Level: For iSCSI environments, we should use separate network infrastructure (a dedicated non-routable VLAN) to connect to storage devices and we should avoid using any existing network that is used for some other purpose like servicing the different networks running in your environment. This will minimize the risk of management network and storage network being disrupted at the same time

Leave a Reply