Aug 20 2007

Backup Booster

De-duplication squeezes the redundancy out of data.
{mosloadposition mostpop}

As students and faculty grow more technically savvy and data volumes continue to expand, universities and colleges find it increasingly difficult to conduct daily backups in the backup window allotted. Even worse, the explosion of data means more complex tape library and virtual tape library (VTL) systems, making data recovery a long and often laborious process. And that’s if they’re able to recover the data at all.

To improve backups, higher education institutions are turning to a new technology called data de-duplication. Through a series of complex algorithms, data de-duplication identifies redundant data and prevents it from being backed up more than once. This decreases the amount of data that gets backed up, decreasing the times of backups and restores, and requires less storage capacity.

De-duplication may eventually find its way into primary storage, reducing the capacity a storage network requires to hold all of an institution’s data. But for now it is used primarily for backups, and — taken to its next logical step — disaster recovery.

Data de-duplication uses a disk as the backup medium, which is faster and easier to restore from than tape is. De-duplication is rapidly becoming an integral part of VTLs, which work like tape backups but move data to disk instead.

“In the context of backup data, the same data keeps getting stored and backed up over and over again, consuming more storage space that impacts cost and creates a chain of inefficiency,” says Tony Asaro, senior analyst for the Enterprise Strategy Group analyst firm in Boston. Asaro says he sees de-duplication moving to other areas of storage besides backup.

Pomona College in Claremont, Calif., and Trinity Western University in Langley, British Columbia, are among the institutions using de-duplication. Trinity Western employs Data Domain VTLs at separate sites for disaster recovery. Pomona uses EMC software to simplify local and offsite backups in its data center on campus.

Pomona Director of Network Services Anthony Nguyen estimates that, before using de-duplication, it took nearly 12 hours to do an incremental backup and 24 hours to conduct a full weekly backup. Now, with redundant data eliminated, he says he can do two full backups per day in less than four hours for each cycle.

Close the Window

“It got to the point where the backup window was growing so much that it was bleeding over into the business day,” Nguyen says. “And if a backup failed, we often wouldn’t find out until the next day, requiring us to redo the backup anyway. It severely affected performance.”

The multiple backups per day now let Nguyen take snapshots of the storage network at 12-hour increments, enabling his staff to recover files to a set time, almost to the hour requested.

“Every time we look for something, it’s there,” Nguyen said. “Before, it was hit or miss. We couldn’t guarantee the file would be recoverable.”

Nguyen plans to expand to eventually use de-duplication to replicate to a second site for disaster recovery.

Trinity Western University, a liberal arts college near Vancouver, British Columbia, uses a Data Domain appliance and its replicator software to facilitate disaster recovery. Trinity replicates data from its IBM storage area network to an offsite disaster recovery site across town, says Ryan Hanawalt, former director of IT. The nightly backup cycle is now seven hours shorter and sends a fraction of the raw data over the wide area network.

Hanawalt says de-duplication lets Trinity Western get the benefits of doing a full backup each night without sending the entire 15 terabytes of data in the SAN over the network. Incremental backups, while able to reduce the amount of data backed up each night, do not provide the same full level of protection. And as in the case of Pomona College, recoveries are much easier and much faster. According to Hanawalt, it now takes him a quarter of the time to recover a file, folder or mailbox. Replicating data from one data center to another gives the institution copies of all data at both sites, so either can be used as the main center if the other one crashes.

In addition to simplifying the backup cycle, making it more reliable and speeding time to recovery, data de-duplication technology can help reduce storage expenditures. Nguyen estimates that he’s able to save nearly $300 per week on tape costs alone. Hanawalt says his backup administrator saves 10 hours per week by not having to load, manage and check tapes. He now spends that time on more proactive projects.

Real Deal Compression Rates

Data de-duplication vendors make claims as to how they’ll drastically compress data, but those claims usually come with a disclaimer that compression rates vary according to the type of data being compressed. Trinity Western University in Langley, British Columbia, determined its achieved compression ratios using a Data Domain DD460 appliance by recording the total network size and the amount of data backed up each night. You can see that on June 10, the storage network was just over 14 terabytes but took up only 1.6TB of backup capacity. And during that night’s backup cycle, only 18.4GB of raw data had to be backed up. Compression ratios ranged from 8.46:1 to 8.59:1 for cumulative backups and went as high as 14.1:1 for daily backups.

Date Cumulative Volume (GB) Backup Volume (GB) Compression Ratio Daily Volume (GB) Daily Backup Volume (GB) Compression Ratio
June 10 14034.7 1597.5 8.46 195.3 18.4 10.6
June 11 14472.5 1628.6 8.56 437.8 31.1 14.1
June 12 14680.1 1646.9 8.59 207.6 18.3 11.3
June 13 14798.5 1663.1 8.57 118.4 16.2 7.3

Data De-Duplication in disaster recovery

Trinity Western University is currently in the process of setting up a remote disaster recovery facility near Ottawa that uses data de-duplication technology to completely eliminate the need to ever have to recover data from tape. The facility — more than 2,600 miles away from the main data center near Vancouver — would complement Trinity Western’s existing local offsite facility and allow for quick, reliable data restores from disk in case of a major regional disaster such as an earthquake or wide-scale flooding.

A Data Domain DD460 Restorer backup appliance will replicate data from the offsite data center in Vancouver to the facility in Ottawa, using Data Domain’s global compression technology to identify and eliminate sending duplicate data. Because the volume of data being migrated is significantly reduced, Trinity Western’s entire storage network — 15TB of data — will be recoverable online, saving the school bandwidth and infrastructure expenditures.

“We’re now able to achieve our disaster recovery requirements more easily,” said Ryan Hanawalt, former director of IT.