The Virginia Commonwealth University Health System runs the largest medical center in the state. With 31,000 admissions, 500,000 outpatient and 80,000 emergency patients every year, plus the education and research demands of a teaching hospital, the Richmond-based facility generates a lot of data.
To be exact, 110 terabytes of data can be found at any given moment, spread out over more than 100 business and clinical systems, and all that data needs to be backed up and immediately available in case of emergency. “Our weakest point is the backup,” says Greg Johnson, chief technology officer and director of engineering services for the health system. “We have to keep backups at least 90 days, and even with incremental backups that 110TB can quickly grow to three or four times that amount.”
Rather than invest in and maintain half a petabyte of disk storage just for backups, Johnson is evaluating data deduplication options. He's considering upgrading to the latest version of IBM Tivoli, which includes deduplication capabilities.
“With deduplication, we're looking at getting anywhere from a five- to 10-times reduction in the amount of space we need,” Johnson says. “Even a conservative 5-times reduction means that 110TB shrinks down to 22 TB of actual storage.”
IBM Tivoli is one of a growing number of software systems that includes deduplication. Today, nearly all major backup software packages offer deduplication, including Acronis Backup & Recovery 10; BakBone NetVault: Backup and NetVault: SmartDisk; Barracuda Networks Backup Service; CA ARCserve; and EMC Avamar.
Deduplication is a relatively new technology, but the principle is fairly simple. In every organization, there are pieces of data that are repeated dozens, hundreds or even thousands of times across all the files stored on a network. These could include whole files – such as a memo sent to everyone in the organization and saved to every hard drive on every computer – but much of the replication occurs within files; for instance, a signature block appended to every outgoing e-mail or a logo embedded in every PowerPoint.
Average time in which a data deduplication system pays for itself in reduced storage needs, improved IT productivity and shorter backup and restore times
Source: IDC, 2010
Rather than save these scraps of data over and over again, deduplication scans every file for redundancy and replaces repeated data with a pointer to the original. “It's like a bouncer at a club,” says Mike Fisch, senior contributing analyst at The Clipper Group. “To get in, you have to be original.”
Deduplication offers a number of benefits when integrated with a backup strategy. First, it reduces the size of individual backups by eliminating redundant data. It also reduces the storage capacity required for subsequent backups because today's backup image likely shares much, if not most, of its data with yesterday's. With deduplication, backups can store exponentially more data over time than the actual space they take up. “You can easily get to 20 times, or even 50 times [the amount of data],” says Fisch. “So you can back up a lot more data to disk.”
For William Salsbery, manager of infrastructure operations at Temple University in Philadelphia, deduplication offers a second benefit that's just as compelling as the reduced size. Because deduplicated backups are smaller than traditional backups, they can be run more quickly and are easily transferred over the network or to offsite storage. That means lower bandwidth and overhead consumed by backup, and less time lost for recovery.
Salsbery says the campus has two data centers separated by three or four blocks. “The systems in the primary data center get backed up to the servers in the secondary data center, and the ones in the secondary center get backed up to the primary center.”
Without dedupe, Backing up across the network like that could be not only time-consuming but a huge drain on bandwidth and server resources. Smaller backups also help Salsbery meet the university's requirement to maintain 30 days of backup on disk. With full backups averaging about 32TB, storing 30 days' worth would require nearly a petabyte of storage. Using deduplication features in EMC Avamar, all 30 daily backups consume less than 22TB.
Deduplication is catching on for good reason – it saves time, money and hassle, things that overburdened higher education IT workers can greatly appreciate. “In this economy, you can't just keep throwing money at disk space,” Virginia Commonwealth's Johnson says. “If you're going to get the most out of it, then deduplication becomes absolutely critical.”
Make the Most of Data Deduplication
1. Keep backups longer: The more backups you have, the more likely you are to find the redundancies that makes deduplication work. Subsequent backups will get smaller and smaller – incidentally freeing up the space you need to keep more backups.
2. Know your data: Video, photographs, scanned documents and audio tend not to yield much gain. If much of your backups comprise these types of file, consider bypassing deduplication to save overhead.
3. Don't encrypt or compress before dedupe: Encryption and compression eliminate many of the patterns that deduplication looks for. Apply them after the data has been deduplicated.
4. Dedupe as widely as possible: The more data deduplication has to work with, the more opportunities for finding redundant data, so include as many computers, servers and virtual devices as possible in your backups.
5. Let the stats worry about themselves: While it is satisfying to see you are saving 80 percent, 90 percent or even more of your disk space by using deduplication, don't worry about improving your ratio. Configure your software in the way that works best for you and don't worry about what the statistics say.