After a decade of essentially being ignored, Lock Haven University's disaster recovery plan had become woefully inadequate.
"We'd never really had a problem," says Bo Miller, telecommunications manager at the Pennsylvania university.
In hindsight, Lock Haven's good fortune was also a curse, he says. It became increasingly difficult to get anyone outside of the IT department to make disaster recovery and continuity of operations planning a priority, Miller says.
"People think, 'Why are we spending a lot of money on this? It never happens,' " he recalls. "Well, it did."
In July 2011, the primary cooling system in Lock Haven's data center gave out. "We had a severe data loss," Miller says.
The IT team managed to recover 90 percent of the lost data from backup tapes, but that still left a 10 percent loss of original content for distance-learning courses — content that the faculty had to re-create. From that perspective, Miller says, "90 percent's not good enough."
When a heat wave crippled Lock Haven's data center, the campus administrators learned a lesson that more and more colleges and universities are figuring out: Disaster recovery isn't just about hurricanes, fires or shootings. DR ultimately is more about resilience than it is about disaster, regardless of what caused the problem.
Institutions strive for the elusive "five nines" — uptime of 99.999 percent — when it comes to individual hardware or software components, whereas traditional disaster recovery centers on whole-site failure. Today's DR plans morph these two ways of thinking, says Rachel Dines, a senior analyst at Forrester Research. "It's more of an evolution than a revolution," she says of current views about disaster recovery.
By focusing more on recovery and less on disaster, campuses have revamped their approach to DR to ensure continuity of operations no matter the type of crisis. Here's a look at how the IT teams at Lock Haven, the University of Colorado at Colorado Springs and the University of Texas at Austin now tackle COOP — without fixating on the disasters that could befall their campuses.
Ready for Anything: Lock Haven University
"The bottom line is that you really need to explore what your vulnerabilities are, and I'm not talking end-of-the-world scenarios here," Miller explains. "If you think broadly about what causes disasters, then you're missing the most obvious things."
80% Percentage of organizations across all industries that have disaster recovery plans — of the 20% of organizations that don't, most are actively creating them
SOURCE: Forrester Research
Lock Haven, for instance, had considered what would happen if a plane crashed into a building but not what it would do if the air conditioner died in 105-degree heat. "We don't normally plan for that kind of thing in central Pennsylvania." A good DR plan should take into account the unusual but commonplace occurrences that can bring a data center to a screeching halt.
Now the Lock Haven IT team plans for whatever may come their way by repeatedly asking two simple questions:
- How capable would the IT staff be at restarting operations if key systems were to go down?
- If the answer is "not very," how willing is the IT team to expose the campus to that kind of risk?
Based on their initial answers, the IT team prioritized the systems by risk and then analyzed what types of infrastructure investments would be necessary to make the risk levels tolerable. Lock Haven worked with a team of CDW•G engineers to create failover and infrastructure work-arounds for system failures that posed the greatest risks.
The IT team reconstructed the school's data center, deploying redundant APC cooling and power units and virtualizing servers and desktops using VMware. Lock Haven also installed two NetApp storage arrays that replicate data and reduce the campus's recovery time objective. Miller says the IT team has now begun designing a new fiber-optic network and plans to construct a fully redundant data center across campus from the existing one.
Miller's team had never worked with blade servers, virtual switching or some of the other technologies rolled out during the upgrade. "In many ways, we were headed into unchartered territory," he says.
In hindsight, Miller acknowledges that some good came from the situation. "It really did put the exclamation point on why we needed to go forward with disaster recovery planning," he says. "It became an important issue to everyone, not just those of us who stay up at night thinking, 'What if?' "
Beyond Campus: University of Colorado at Colorado Springs
The University of Colorado at Colorado Springs has fine-tuned its disaster recovery plan multiple times since the Sept. 11 terrorist attacks in 2001. But in all that time, school officials never anticipated the situation they faced this summer. When making adjustments to the campus DR plan, the administration had focused its attention on emergencies that might occur on campus, keeping the school community safe, and continuing classes and operations.
"I don't think we ever thought we'd be in a position where we needed to support the city," says Jerry Wilson, the university's chief technology officer.
Then came the wildfires this summer. Fire raged through the Colorado mountains, eventually sweeping down into the neighborhoods of Colorado Springs. Because school was out for the summer, some residents fleeing their homes were allowed to move into university housing.
The Air Force Academy also evacuated faculty and students to the campus, and offered classes in UCCS buildings. As a result, Wilson's team had to give the Air Force faculty a crash course on how to use UCCS systems.
The city also held public meetings on campus because Mayor Steve Bach needed a large space equipped with a sound system, maps and other critical equipment. When the mayor contacted the university, Wilson asked him when he needed the system to go live; the mayor replied, "In, like, two hours."
Meanwhile, the media also set up an incident command center at UCCS, and they sought wireless access. Wilson and the IT staff reconfigured networks to provide the bandwidth.
"There were a lot of those kinds of things, which is interesting because we had never really planned for that kind of stuff," Wilson says. "Yet, everybody fell into place and took on their roles. We stepped up to the plate for the city, and things went really well."
The reason the university was so effective in a crisis that it had never anticipated, Wilson says, is that its disaster recovery plan had a foundation — a tech-saturated emergency operations center, a well-defined and updated DR plan, and a prepared response team ready to tackle situations as they arose. The team had regularly put its plan and tools through their paces by simulating disasters in the UCCS EOC, a designated space on campus that they know to report to in an emergency. It has all the technology and equipment they might need during a disaster: desks, mobile and landline phones, computers, projectors, TVs and so on.
"If an issue comes up, you can deal with it. But if you don't have some structure in place, it's going to be tough," Wilson says.
Thinking Long Term: University of Texas at Austin
The University of Texas at Austin has faced its fair share of tragedies and close calls, most recently in September 2011, when wildfires burned northwest and southeast of the city, destroying more than 1,600 homes. Austin sits in the heart of what Kiplinger identified this summer as the third state in the country most at risk of disaster.
Regardless of the situation, UT's IT Services Organization has a disaster recovery plan that focuses on the results of disasters rather than their causes, explains Michael Cunningham, IT services director at UT Austin's University Data Centers. For instance, instead of planning for tornadoes, which are relatively frequent in central Texas, it looks at potential results from any number of weather-related disasters: loss of electricity and flooding.
"So you build your plan around, 'What if I lose that?' " Cunningham explains. "Then if it's a hurricane, tornado or a big truck that plows into a pole, you're ready for the loss of electricity." Key portions of infrastructure and services at UT are redundant, so if one system goes down, many components can fail over to their redundant side or to a second system at a backup site two miles away from its primary data center.
It also responds to crises by activating its incident command and crisis-management response team, which includes each of the eight IT directors and the CIO at the university. At the start of any potential disaster, the team gets on a conference call to plot the direction they plan to take. If phones are down, the team has a list of communication channels it can use to stay in touch.
The team regularly tests and updates its plan by holding mock drills, table-top exercises and in some cases failing over systems to make sure they work and don't affect other systems. By doing the exercise cross-functionally, they can spot gaps that one department might not anticipate, says Cunningham.
It's important to make time for planning on a regular basis, Cunningham adds: "The easiest mistake is to assume, 'Oh, this will never happen to me.' "