At the Ready

High-availability storage hinges on being ready for system failure.

Greg Schulz, author of several books on storage, is founder of the StorageIO Group, an IT industry analyst consultancy whose blog can be found at storageioblog.com and at twitter.com/storageio.

There’s no such thing as a data recession, nor is there any trend in sight indicating that data and associated IT resources will need to be less available in the future. Quite the contrary: As more data is generated and stored for longer periods of time, it will need to be more available and accessible, not less.

The design of a high-availability data storage infrastructure can be as varied as the environments and applications that it supports.

There are several techniques, technologies and best practices that can be aligned to diverse needs and budgets to counter threat risks and meet a university’s high-availability requirements. But six essential steps will help enable HA for storage environments:

Develop strategies to address issues, threats and mission objective requirements.
Establish a plan that includes applicable technologies, techniques and ongoing activities.
Implement a plan that includes technology deployment, configuration and day-to-day support.
Document and integrate with change control and continuity-of-operations processes.
Measure and rely on recurring testing to validate the plan and the technologies.
Use problem determination, isolation, resolution and post-mortems to avoid future issues.

Basic tenets of HA for storage (which also applies to networks and servers) are fault isolation and fault containment. That is, eliminate single points of failure (SPOF) and configure systems so that (if the SPOFs cannot be eliminated) any resulting fault or error condition will be contained to prevent a rolling disaster. For example, you could configure a pair of networking or storage adapters to have separate paths to a shared storage system; in the event of a failure, you would have access to the storage on the surviving adapter.

Keep in mind that HA is a balancing act between the availability needed to protect against the most likely scenarios (or scenarios that would have the most dire impact) and your budget. There is the perception that components that have more “nines” of availability will enable HA. More nines of availability are good if you can afford it, but more important is how well the components work together. Overall availability is the sum of all of the pieces working together.

Measuring Availability

Availability	Nines	Annual downtime
99%	Zero	3.65 Days
99.9%	One	8.77 Hours
99.99%	Two	52.6 Minutes
99.999%	Three	5.26 Minutes
99.9999%	Four	31.56 Seconds
99.99999%	Five	3.16 Seconds
99.999999%	Six	½ Second
Availability expressed in nines

Availability is often discussed in terms of five nines, six nines or higher. It is important to understand that availability is the sum of all components and their configuration. The amount of downtime per year is calculated as a percentage: (100% – N)/100%, in which N is the desired number of nines of availability. Availability is the sum of all components combined with design for fault isolation and containment. How much availability you need and can afford will be a function of your environment, application and requirements, and objectives.

The reality is that applications can be looked at from the standpoint of a specific layer or resource, or from end to end, which is what a user of IT services sees.

Anticipate and Prepare for Failure

Availability is only as good as the weakest link. In the case of a data center, that weakest link could be the applications, software, servers, storage, network, facilities, and processes or best practices. Virtual data centers rely on physical resources to function; a good design can help eliminate unplanned outages to compensate for individual component failure. A good design removes complexity while providing scalability, stability, ease of management and maintenance, as well as fault containment and isolation.

As part of the configuration, costs could be saved by using a single switch, but even with five or six nines of availability that switch and its firmware or software still present a single point of failure. You should therefore configure a pair of switches, each on its own network, to avoid device failure, software or configuration errors or network disruptions.

There is a tendency to try to reduce costs by replacing multiple smaller devices with a single, larger higher-availability device — for instance, using a large switch in place of two separate switches. In that scenario, even with a manufacturer who boasts support for more nines than the competition, the physical frame itself might have a common SPOF — for example, a backplane that creates the potential for multiple component failures.

The bottom line: If something can fail, it will; it’s just a matter of time. HA is about mitigating risk while balancing the PACE — performance, availability, capacity and economics — of business requirements.

Any technology (hardware, software, network or service) can fail at some point, because of either the technology itself, its configuration or other error, or from acts of nature or man. Most manufacturers will claim that their products have no single points of failure, and thus will not fail. But they also typically describe how to implement fault isolation and other capabilities so that if and when their products fail, they do so gracefully and predictably.

Look for a storage system that is resilient, yet scales with stability and flexibility — meaning that as performance increases availability does not suffer, or as availability increases, performance and capacity do not suffer. Likewise, combine individual component availability with sound configuration best practices, keeping in mind that even highly available components can break down because of technical or human error. In the end, it’s how you configure components that reduces the impact of a failure and maintains HA.