Deduplication removes redundant data from storage — originally, duplicate files, but more recently, duplicate blocks or even sub-blocks, called “chunks.” If a new file is stored and parts of it are identical to a file already on the storage system, only the different parts are stored.
Hardware that deduplicates near-line storage intended for backups has been available for several years. These work well because data varies little from one backup to the next. However, deduplication of online or primary storage is relatively new, and not all types of data will benefit from the process. Here are some pointers that storage administrators should be aware of in order to optimize deduplicated storage.
1. Understand the varied approaches.
The two types of deduplication present some trade-offs. In-line processing intercepts all data as it is written to storage, removes any bits that have already been stored and writes the rest. This requires heavy-duty processing power and can introduce latency because the data must be compared before it is written. It also generally costs more than a post-processing system.
In contrast, post-processing deduplication requires a “landing zone” to store data until it can be processed, which means that as much as 50 percent of the available raw capacity is consumed. An array with a raw capacity of 20 terabytes will show a capacity of 10TB, with the other 10TB being consumed by the landing zone. Data is stored in the landing zone until it can be processed, then moved to the other half of the array.
2. Dedupe the right kinds of files.
Deduplication works most effectively when there are multiple copies of the same sets of files. For example, the .vmdk or .vhd virtual hard-disk files used by virtualization platforms such as VMware or Hyper-V often contain the same basic sets of operating system files and should be ripe for deduplication. In contrast, video files or databases tend to have large sets of unique data and will not see much reduction from deduplication. However, if video data were being processed with multiple working files of the same data, then deduplication would be appropriate.
3. Allow time for deduplication.
Post-processing deduplication examines data saved to the storage system to find duplicate blocks or chunks of data. Typically, this scheduled process runs overnight; hence the need for a landing zone to store data until it can be processed. If the system is being used 24x7, you may need to move to an in-line deduplication system, which generally requires a greater upfront investment.
While in-line deduplication can be subject to latency, this should only be a concern for data that requires high-performance storage.
4. Build in additional redundancy.
If the deduplication engine were to fail, the stored data would still remain. However, the system might no longer know how to find the original primary copy of the data. Therefore, IT should either mirror the data to a second system or perform snapshot backups that capture the full set of data before deduplication.
5. Prepare for some guesswork.
Predicting compression ratios can be difficult. Deduplication doesn’t actually compress data, it just removes duplicates. Effective compression ratios of 100:1 or more are possible with the right kinds of data. But user directories or other types of data may contain relatively little duplication, complicating the task of estimating how much compression you’ll achieve. Run the deduplication system for a while to gain a real sense of how well it works in your environment.
6. Try before you buy.
Make sure to test a system thoroughly before you purchase it. Some storage systems offer deduplication as a standard feature. In these cases, the only additional cost to using deduplication may be the storage lost to the landing zone. If you believe your environment can benefit from deduplication, you can deploy one of these devices and if it doesn’t work out, you’ll have a functional SAN appliance that can be used like any other.