At George Mason University, careful planning, phased deployment and plenty of testing help stave off disaster.
The disaster recovery program at George Mason University, in Fairfax, Va., has evolved dramatically over the past 25 years. Originally, James Madison University, 100 miles away in Harrisonburg, was our sister school for disaster recovery. The plan was that we would support each other in the event of a disaster. But logistically, the plan wasn't very realistic: Our system configurations were different, and neither institution had a lot of extra capacity.
Twenty years ago, the buzzword in DR was rapid replacement. In the event of a disaster, the goal was for equipment manufacturers to deliver new servers and computing equipment within 48 hours. The problem with this plan was that auditors insisted on a test to prove that the new equipment worked, but it is very difficult to simulate a disaster.
The next evolution of DR support was the megacenter, where service providers hosted a redundant version of an organization's system for a fee. Yet, this was expensive, and keeping the service provider's system updated remotely proved challenging.
Today, with the rise of virtual servers and more affordable storage area networks, it's possible to build and manage a university DR site at a remote campus. Mason's site is housed on its Prince William campus in Manassas. Based on our experience managing this program, here are the top five best practices we follow:
1. Include disaster recovery funding in new IT project budgets.
At Mason, every new IT project includes DR from the start. Too often, DR costs are ignored or overlooked, and new systems are deployed without an adequate DR architecture. If an auditor wants to know what happens to a new system in the event of a disaster, even if a retrofit is possible, you have to be able to meet that demand.
2. Phased deployment is the right approach, but plan for replacement of the technical infrastructure periodically.
Developing a DR infrastructure for a medium to large organization is complex because the many interdependencies require significant staff time and collaboration. A phased deployment can simplify the process.
Start with a basic network, storage and server architecture that is scalable, and then build on it. Target mission-critical systems first, and end each phase with rigorous testing. As the deployment progresses, testing should include the entire DR infrastructure.
Also, be prepared to upgrade, replace and even re-engineer some or most of the DR technical infrastructure. The best approach is to replace equipment incrementally. Our basic equipment lifecycle is five years. Some years we upgrade the switches, then the security equipment, the servers and storage equipment. Factor in the cost of replacement gear for the DR site when planning the normal cycle for the production system.
3. Choose your configuration carefully, with an eye to needs and costs.
An active-passive system design is less complicated and more cost-effective than an active-active configuration, but the effect on fall-back time is significant.
In an active-passive system, data from the production system replicates to the DR site with servers standing by to be activated during an emergency. Active-active is a more complex clustered environment that requires more computer hardware, additional software licensing and significant systems administration. The advantage of an active-active architecture is that the failover is automatic, and the fall-back time after the DR event has subsided is minimal compared with the active-passive architecture.
With active-passive, failover can take a few hours, and falling back to the primary production environment requires a significant amount of work because everything must be copied back to the production systems. For example, if a disaster occurs and the organization begins running on the DR site, all the work that takes place – from editing documents to filling out time sheets – ultimately must be copied back to the primary site.
With Mason's active-passive configuration, fall-back for our systems takes about 48 hours. In the clustered active-active scenario, fall-back time would be less, but the downside is cost, complexity and maintenance support.
Mason uses active-passive primarily because active-active is too expensive. The reality is that most universities cannot afford active-active, nor do they require that level of protection. Organizations that need to have their production systems up and running right away, such as banks and insurance companies, can usually afford an active-active configuration.
4. Test your systems frequently and rigorously.
It is better to find a configuration problem while testing than in the middle of a crisis. Testing also keeps everyone current on DR processes and procedures so that when a real emergency occurs, the technical staff knows what to do.
Typically, we start with basic testing when we replace a server or any networking gear. We also run configuration tests and test the operating system and applications systems components. It also makes sense to test a new service and then test the overall DR system. Use a copy of the production databases so the tests can be run without accidentally destroying any production data. As a rule, we operate in an active-passive configuration, but directory services such as the Lightweight Directory Access Protocol (LDAP) are configured as multimasters and are deployed in an active-active configuration.
Careful planning is a requirement, but don't go overboard with detailed DR documentation because it becomes outdated quickly. If you experience a real emergency, your staff won't have time to read through it anyway.
5. Use risk mitigation techniques to guard against preventable situations.
There are preventive practices and technologies that can be installed to reduce the risk of needing to go live with your DR plan. These include 24x7 maintenance support for production systems; RAID disk subsystems; system backups (onsite and offsite); redundant servers (managed by load balancers); uninterruptible power supplies and emergency generators; and a data center smoke detection and fire suppression system. This is all standard IT equipment that should be included in every data center environment.
These risk mitigation techniques won't protect the campus in the event of a natural disaster such as flooding, but they can increase the odds that the university's systems will continue to operate under normal conditions, and they offer a first line of defense should disaster occur.
Disaster Recovery Failover Process
George Mason University shares a step-by-step summary of its failover plan:
- Shut down local databases (if possible).
- Disable data replication from the storage area network (SAN) to the remote DR SAN.
- Initiate VMware Site Recovery via Site Recovery Manager (SRM).
- After the SRM failover is finished, initiate network routing changes.
- Present DR SAN storage to the remote DR servers (SRM automatically provides this function for VMware servers).
- Promote DR Active Director domain controller.
- Check status of DR LDAP, DR Kerberos and DR Active Directory.
- Run scripts to switch the domain name system (DNS) for servers or update the DNS manually if necessary. This includes the emergency static web page.