Both Katy Independent School District in Texas and Beaverton School District in Oregon employ experienced and talented IT staffs. Both staffs take data center uptime seriously. And both districts have proactively invested in disaster recovery solutions for years.
But when calamity struck, one experienced relatively few problems, while IT officials in the other were blindsided by a deluge of unforeseen events, some of which were outside of their realm of responsibility.
Layered DR Solutions Create Resiliency
When Hurricane Harvey hit Texas in August, 6 inches of water flooded the building that houses Katy Independent School District’s data center.
“We lost power to the whole facility,” recalls Joe Christoffersen, technical operations director for the district. “The data center went dark.”
Katy ISD, however, was never in real danger of losing access to its data. The district had invested in several layers of DR solutions. In fact, the floor of the data center was raised 18 inches above ground, meaning that the floodwaters didn’t approach the level of server racks and other infrastructure equipment.
Water did seep into the conduits for one of the data center’s uninterruptible power supplies, but the data center had a second UPS that remained unaffected, allowing the district to get resources back up and running quickly. Even if the district had lost both UPS systems, much of its critical data was backed up at a nearby colocation center. And if the storm had also knocked out that facility, the district could restore its data for key systems from the cloud within a day or two.
“You have that peace of mind,” Christoffersen says. “But there’s still an urgency to restore those systems if something does happen.”
Small Backup Errors Cost Districts Time, Money
Unlike Katy ISD, Beaverton School District wasn’t hit with the costliest tropical storm to make landfall in history, but they battled a far more insidious enemy, producing a crisis that district officials now simply refer to as “the event.”
Just before the start of a recent school year, IT staffers came to work one day and discovered that their entire data center was down, for no reason that they could immediately identify. When they tried to reboot, about half of the hard drives failed, including those dedicated to the district’s student information system, email, finance system and a number of other applications.
Eventually, the IT shop brought most of its systems back, but one major problem remained: An employee had made an erroneous configuration change a year earlier, and as a result, the district had no full backup of its human resources financial data.
“We had 17 days to pay 5,000 people,” recalls Beaverton CIO Steven Langford. “We didn’t know how much money they made. We didn’t know their withholdings for taxes. We didn’t know their vacation sick leave balances. We didn’t even know who had stopped working for us — and this was the day before school started, with 40,000 students coming back.”
IT officials eventually found a company that was able to recover payroll data from a heat-damaged backup tape. The district still had to build back reports from scratch, though, and the impact of the incident continued to be felt throughout the school year. Once things calmed down, officials took a step back to examine why their DR plans had failed.
DR Needs a Holistic Approach
While these catastrophes typically result in hard-won lessons that help districts better prepare for the next disaster, careful planning and testing can help schools skip these trials-by-fire and give officials confidence that DR solutions will perform well when called upon.
Phil Goodwin, research director for data protection, availability and recovery at IDC, says that DR plans must be as comprehensive as possible if they’re to survive a real-life disaster scenario.“Disaster recovery is the classic triumvirate of people, process and technology,” Goodwin says. “We technologists too often focus on the technology — how we’re going to migrate data and bring up workloads. But when you get into a disaster, the people who would manage those systems may not be available. The employees are as affected by the disaster as the school district is.
“Disaster recovery is really not an IT activity — it’s an organizational activity. It really has to take into consideration the entire organization, not just the IT group.”
Infrastructure Pays Off in Peace of Mind
Often, the smallest factors can make the difference between a school district that emerges from a disaster unscathed and one that experiences significant system failures. Too frequently, IT managers don’t know how their DR and continuity of operations plans will perform until they’re tested in real-life situations where sensitive and valuable data is in peril.
Until a couple of years ago, Maine Township High School District 207 in Illinois routinely faced resource downtime as a result of ransomware and distributed denial of service cyberattacks. To resolve the issue, the district installed private fiber and invested in SimpliVity infrastructure that allows affected resources to automatically shift their workloads to other, unaffected resources throughout the district.
“Now, if we get any kind of CryptoLocker or other malware that disrupts one of our servers, we’re able to restore resources across our WAN in seconds, rather than hours,” says Jonathan Urbanski, director of technology systems for the district.
In the wake of the hurricane, Katy school officials had plenty to worry about, even without a significant IT outage. In fact, the availability of IT systems and staffers was crucial to helping the district recover. During the storm itself, the schools turned high school and junior high buildings into temporary shelters, and officials relied on the continued availability of the district’s phone system to allow people to call in with donations and to check on friends and family. One school had to be relocated due to flooding, requiring extensive IT work.
“We used the whole IT team for the relocation,” Christoffersen says. “They were building a new network and setting up computers, printers and phones. If they were rebuilding servers and network infrastructure in the data center, that would have been an additional burden.”
Good DR Supports Continuity of Operations
Many of the best DR solutions are those that not only allow organizations to recover data after a large disaster, but also support continuity of operations during more routine outages.
While Beaverton’s schools had a number of DR tools in place, a cascade of problems set the district on its heels. The fire suppression system protecting the data center had been recalled, but district IT officials didn’t know that. The system malfunctioned, setting off gas canisters throughout the data center and causing vibrations throughout the room.
“For hard drives spinning at 15,000 rpm, it was like a sonic boom,” says Langford. But alarms weren’t sent to either district IT officials or to the local fire department. Then, the failure to back up payroll data for an entire year went unnoticed until the IT team tried to recover it.
“We learned that we really didn’t have systems set up — both in the technology vein, but also the human systems around our data center — to be successful,” Langford says.
“We started looking at the systems that we had protecting our data and data center, and we realized we needed a lot of help,” he says. “We didn’t have the right governance. We didn’t have the right planning.”
Photo by Robbie McClaran
Beaverton schools partnered with CDW•G and its partner Wipfli, whose solution architects and consultants brought a continuity of operations plan to the district. “The consultants helped us approach it from a business point of view,” Langford says. “We have a business to run. We have assets to protect, and we have requirements we must meet, so we built a plan around that. In the education world, people were not used to thinking like that."
That process resulted in a robust continuity of operations plan that goes beyond technology, accounting for the people- and process-based factors that can make the difference when disaster strikes. “If an earthquake happens, or we have another natural disaster or data center crisis, we can execute the plan,” Langford says.
Now, there is more knowledge across the organization about what role each department is expected to play in protecting important data.
“Hopefully, we’re never going to need it,” Langford adds. “But the planning work was very valuable in increasing staff confidence that we’re going to be ready. If this happens again, I know I’m going to have my systems up, I know how fast we can do it, and I know what data we have that we need to recover."