Security

Backup Metrics for HPC Environments

Georgia Tech’s PACE team shares tips for avoiding costs and downtime during required backups.

Neil Bright is a research scientist and chief HPC architect at the Georgia Institute of Technology’s Office of IT.

The challenges of backing up high-performance computing (HPC) data are not substantially different from those of any other environment with large volumes of data.

Managers must ensure mechanisms are in place to protect the integrity, confidentiality and availability of the backups, and likely also must deal with constraints when it comes to the time required to perform backups (along with the inevitable financial considerations). Most important, of course, teams must be able to extract data from the backup solution and restore those files on an as-needed basis.

The philosophical approaches, methods and technical implementations used by the PACE — Partnership for an Advanced Computing Environment — team at Georgia Tech provide cost-effective daily backups of the two petabytes of storage available to researchers and further support all of the IT infrastructure in our HPC environment. All of our backups are file-based, and use a disk-to-disk method.

Backups

Our team views the backup window essentially as planned downtime, and any downtime means lost productivity. The key to reducing or eliminating downtime for backups is to create an asynchronous process — to decouple backup windows from the time it takes to actually transfer the data. When teams can't perform backups on live data, file system snapshots are an excellent way to accomplish decoupling. It takes only seconds to create a temporary snapshot, which provides the backup process with an unchanging source. The team gains additional flexibility in deciding how often to perform backups.

Sometimes called a recovery point objective (RPO), backup frequency is tied directly to the time period in which data may be lost should an incident occur: The more frequent the backups, the shorter the period of exposure. Set an upper limit for this frequency by determining the rate at which the data may be copied from the primary storage to the backup solution. The snapshot approach previously described allows for short backup windows — for instance, only the time required to shut down and restart a database — because data replication is performed in the background as soon as service is restored. As a result, teams can lower their backup solution's performance requirements — and costs.

Our team also must consider retention time, or the planned duration in which a backup copy of data must be maintained. Generally, we determine this time according to business need, any required compliance controls and the available budget for the backup solution. Designing the solution for easy expansion is critical in any dynamic environment. We prefer expandability over capacity in the initial implementation, which allows for more precise "right-sizing" of storage capacity under real-world conditions following any backup solution assessment. On the surface, the storage required for the backup solution seems easy to calculate as a simple product of backup frequency and rate of change. As we've found, however, human nature can make the calculation far less predictable. When users undersize primary storage, it tends to result in a shorter lifetime for individual files because users are apt to delete less used files in favor of the data du jour. That, in turn, requires our team to transfer and retain more data during each backup period.

Restores

All the backups in the world are useless if they're unavailable when needed. It is essential for teams to regularly test the entire backup/restore process to ensure backups are taking place and that there are no problems. Storage will fail in mysterious ways, so a file system that makes heavy use of internal checksums may help to guarantee end-to-end data integrity.

Pay attention to recovery time objective (RTO) — the time period in which data must be restored after an incident. If your team considers the definition to mean that all data must be restored, the cost of the backup solution will rise rapidly. It may help to consider RTO to include two components: The first targets the most common use of restores, users who accidentally remove a small number of files. By designing for the common case, acceptable RTO values are achieved easily at low cost.

The second RTO component considers the case of a substantial or even complete loss of primary storage. If the backup solution is designed to function as the primary storage, a team can restore full functionality without requiring large copies of data. That approach is not without risk, however. Teams can implement both options far more easily with file-based solutions as opposed to block-based solutions.

Our Implementation

Within PACE, our user-facing storage servers use a Portable Operating System Interface file system that includes snapshot capabilities, and exports data via the Network File System (NFS) protocol. We use many individual storage servers for user-facing data to provide increased aggregate bandwidth to storage, to compartmentalize failures, and to implement more practical performance and capacity guarantees. The team decreases support burdens by configuring each instance nearly identically, differing only when it comes to capacity. For users who require a much higher degree of performance, we provide a separate pool of storage for temporary use.

Likewise, we maintain multiple backup servers, configured as very high-capacity versions of the user-facing storage servers. Our backup servers also implement file system compression, which maximizes available storage.

rSync — freely available software at rsync.samba.org — is a key component of our backup solution. When the time comes for a backup, a backup server initiates access to the appropriate user-facing storage volumes, uses rSync to replicate any changes since the last backup, and then takes a local snapshot. We name snapshots with a time and date stamp, which makes for easy removal once our retention objectives are met. Multiple backup servers and multiple user-facing servers allow us to meet desired RPOs.

Finally, by choosing a file-based backup strategy, we can employ a backup server itself in the event of a catastrophic failure of user-facing storage — simply enable NFS services and export the appropriate set of files. While that reduces the ability of the backup server to perform its primary function, it can serve as a temporary bridge while the user-facing server is repaired.

Truly risk-averse teams should consider a backup for the backup, and tape-based solutions are a good option. Teams can manage RPOs and RTOs easily as the data is already decoupled from user access, thereby lowering the costs of the backup-backup solution. Tape offers another nice advantage in that teams can easily transfer cartridges off-premises for an extended duration. Cloud-based solutions can be considered as well, but these depend greatly on adequate external networking.

Sashkinw/Thinkstock