Gilbert McCallister and Greg Johnson of VCU Health System say continuity of operations for the university's hospital can be a matter of life and death.

Aug 06 2010

Why Disaster Recovery Systems Are a Matter of Life and Death

Colleges and universities depend on IT today for everything from classroom technology to medical facilities and daily business tasks, which is why more schools are willing to invest in disaster recovery systems.

Colleges and universities depend on IT today for everything from classroom technology to medical facilities and daily business tasks, which is why more schools are willing to invest in disaster recovery systems.

For the Virginia Commonwealth University Health System in Richmond, continuity of operations is a matter of life and death.

“If our clinicians don't have electronic access to patient data when and where they need it, you're potentially placing a real person's health in jeopardy,” says Greg Johnson, director of technology and engineering services and chief technology officer for VCU Health System. “Having a system down for hours or even minutes is simply not acceptable.”

For this reason, Johnson and his team at VCU invested heavily in technologies that keep the hospital network running in the face of any disaster, whether it's a simple electrical outage or something more serious, such as an event that can cause physical harm to students, staff or campus buildings.

VCU's strategy was to upgrade the overall network architecture and build in redundancy, data availability and failover. To do this, the medical center relies on two data centers that actively back up each other while virtualizing servers through VMware. VCU also upgraded its storage architecture to a single-vendor platform, implementing an IBM XIV open disk storage system, IBM Tivoli Storage Manager and an IBM Virtual Tape Library.

Surprisingly, as recently as five years ago, VCU Health System still relied on traditional tape backups located at an offsite recovery center. When Johnson arrived in his position a year later, he set his sights on a better approach.

“It simply is impractical to rebuild a complex environment from a series of tapes on leased equipment,” Johnson says. “So one of my first decisions was to change the paradigm entirely and think more along the lines of anticipating disasters and focusing on continuing business rather than trying to recover after the fact.”

The new system is about 90 percent complete and has already passed its first test. When a brownout occurred in the primary data center a year ago, the servers automatically moved their content and operations over to the secondary data center.

“None of our end users even knew there had been a blip,” Johnson says. “The changeover happened seamlessly, and everything just kept going, which is our goal no matter the circumstance.”

A Real Trend

With its stringent requirements for 24x7 uptime, VCU Health System may seem unique, but a majority of colleges and universities now recognize the need for a robust disaster recovery and continuity of operations system.

“I think all schools have [disaster recovery] plans now or are putting them in place,” says Ron Bonig, research director for higher education at Gartner. “Not all of them can afford a second data center, obviously, but whatever plan they are looking at seems to be much more robust and practical than before.”

Disaster recovery and continuity of operations systems used to be a hard sell to higher education officials in the 1990s and early 2000s. However, Bonig says that reticence changed dramatically following several wake-up calls, most notably the Sept. 11 attacks, Hurricane Katrina and the threat of pandemic flu.

Another important driver, says John Snider, manager of disaster recovery for the University of Minnesota, is that colleges now depend on IT for more than just administrative tasks. IT is a vital classroom tool today, which is why administrators are more willing to make the required capital investments in disaster recovery – even in an era of strained budgets.

“Students, faculty and other individuals want to access their systems anytime, from anywhere,” Snider says, noting that what is termed “a disaster” is often simple, such as the loss of a controller board or some other equipment failure.

“But the bottom line is that infrastructures do fail,” Snider says. “It's really not a question of if, but a question of when. And so it is absolutely necessary to respond appropriately, efficiently and effectively in a timely manner when something does go wrong.”

For its part, the university upgraded from a tape backup system to an integrated disaster recovery system that relies on a variety of tools and tactics to achieve redundancy. These include data replication at the storage and host levels, clustering similar data and virtualizing servers. The university also takes frequent and scheduled point-in-time images of storage and file systems (making it faster and easier to find the latest data points after an outage) and relies on extra switches on the storage area network to ensure multiple paths for the transmission of data.

“We want to have all of these options available to us for disaster recovery purposes, and it's all about adding layers of protection,” explains Jacqueline Campbell, manager of the storage and backup team within UM's IT office.

That's why the university stores mission-critical servers and data in more than one place, Campbell says.

“With this strategy, if we're at a critical point – say, during student registration – and something happens with the system, we're not down for three days. We can fail those systems over very quickly and bring the data back online so those systems are available for those key resources.”

A Long Haul

Implementing a disaster recovery and continuity of operations system can be expensive, time consuming and complex, but it's become easier in recent years through advances in virtualization, storage, networking and encryption. The IBM XIV system used at VCU, for example, goes beyond traditional RAID technologies and uses a virtual grid architecture that delivers instant redundancy and recovery and an almost unlimited ability to scale, says Gilbert McCallister, technology and vendor relations consultant for VCU Health System.

IT officials can also choose from a range of strategies to ensure data redundancy, replication and failover – from virtualizing database servers to implementing redundant switches on storage area networks to investing in a secondary data center at an offsite location. Some universities have even tried “twinning,” an arrangement between two higher education institutions in which each backs up the other's data and systems.

However, twinning can pose unforeseen risks. Bowdoin College in Brunswick, Maine, found this out when it entered into a tentative memorandum of understanding with Los Angeles–based Loyola Marymount University that called for each institution to act as the other's failover site. The two IT departments went ahead and took steps to cross-train their staff members, standardize e-mail, learn applications and phone systems and create a knowledge base in a wiki to ensure that the respective help desks could support both operations.

In the middle of that effort, the CIO at Loyola left to take another job and took important staff members with her. “What we realized is that in that type of arrangement, you're not going to be the first priority and you're too vulnerable to personnel changes and other things that are out of your control,” explains Mitchel Davis, CIO for Bowdoin.

The percentage of organizations with disaster recovery plans that have had to act on those plans on at least one occasion

Source: 2009 Disaster Recovery Research Report, Symantec

The college has since turned to a new disaster recovery strategy. The DR team increased the percentage of servers running VMware in the primary data center from 60 percent to 96 percent and outsourced the storage of backup tapes to a company that specializes in the service. It is also now selecting an offsite location for a new continuity of operations/disaster recovery solution.

For all the difficulty, VCU's Johnson says disaster recovery is simply a cost of doing business today, an unintended consequence of our reliance on IT. He says IT departments at universities have done such an excellent job building up technology infrastructure over the past decade that today the risk of losing a data network is far greater than in the past.

“You've got to be ready for that risk and that unforeseen disaster, because no one can survive having their data and systems out of commission for days or even hours,” Johnson says. “The potential consequences are just too immediate and too great.”

Mastering Disaster

Implementing a robust disaster recovery and continuity of operations system is not an easy or brief endeavor. Colleges and universities that have been through the process say the following five steps can smooth the way.

  1. Engage: All stakeholders need to understand why the institution is investing in a disaster recovery and continuity of operations solution. One way to start is by creating a formal disaster recovery team to focus on the issue.
  2. Calculate: It's important to convey in conversations with the institution's top officials what a disaster recovery system will cost. They need to understand not just the initial capital outlay, but what it will cost to maintain the system.
  3. Plan: It's not enough to virtualize servers or build a second data center. Colleges and universities need to assess their infrastructure and understand how everything works together so they can plan and prioritize redundancy levels and response times for various systems.
  4. Test: Don't wait for a disaster to see if a recovery and continuity solution holds up under the pressure. Test your system to discover any gaps in coverage or documentation and to make sure that everyone understands their roles and responsibilities in an emergency.
  5. Monitor: Implement a change control process to monitor and identify any changes that might affect the ability to perform disaster recovery. Review and update the overall plan annually or whenever a significant change occurs.
<p>David Stover</p>