6 Tips for Managing a High-Performance Computing Cluster
The face of supercomputing has changed — a lot. A new breed of commodity servers connected via ultrafast networks — such as Infiniband — have supplanted the monolithic mainframes of yore.
As opposed to common enterprise workloads, where each server may solve a multitude of problems concurrently, individual servers in a high-performance computing (HPC) system exchange messages with other servers at high rates of speed in order to collaboratively solve a single — but huge — problem.
Even in a university environment, HPC deployments can grow to thousands of individual servers. Consider the following six strategies to ease administrative burdens and provide for a better user experience.
Tip 1: Allow configurations to diverge.
A management platform that allows configurations of individual machines or groups of machines to diverge from the standard — as long as those settings are managed — will allow customization for specific tasks or users. That philosophy also allows system administrators to perform small, incremental upgrades without the need for extended downtime to provision systems from scratch.
Tip 2: Use multiple versions of applications.
This goes hand in hand with diverging configurations.
In a large environment, you will inevitably encounter a situation where one research group absolutely requires one version of an application, and another group simply can't live without a different version. Ensuring that the management platform can deploy both versions on the same compute resources is key.
Tip 3: Support bundles.
An easy-to-use tool that allows users to submit problem tickets can be valuable to troubleshooting efforts. Not only can this tool direct comments from the user into a preferred ticketing system, it can gather data of interest from the HPC environment. What jobs do the users have running or queued at the moment? What does their environment look like? What applications do they have loaded into their environment?
Automated gathering will pay dividends when it comes to making effective use of the user support organization (users also will appreciate a less tedious experience).
Tip 4: Tier the deployment environment.
A multitiered testing environment brings traditional enterprise IT practices to the HPC realm. Institute a "development" sandbox so system administrators can install and configure new releases of operating systems, middleware and applications.
Once a release is ready for broader testing, move it onto a "test" platform, where regression tests are run and trusted users are invited to test the new release. After any remaining problems are fixed, deploy the release onto production resources.
Tip 5: Virtualize.
Virtualization can be a useful tool in the HPC arena. The usual server consolidation approaches apply for IT infrastructure services. To implement remote visualization of HPC data sets, use a Virtual Desktop Infrastructure approach.
Some HPC management platforms also can manage virtual machines, providing "virtual clusters." Although those work well only for certain types of applications, they can prove quite powerful, particularly in a secure research environment using protected data.
Tip 6: Tap the cloud.
Computational demands from researchers are frequently cyclical. Like waves on a lake, research cycles from multiple faculty can result in tall peaks. A management tool that can dynamically provision resources from public-cloud providers lets HPC administrators scale capacity based on a lower mean demand and respond rapidly to peak resource demands.