Data Center

Achieving the Compute and Networking Balance

Careful choices in system design, continuous monitoring and mindful change management will keep the data center performing at optimal levels.

Neil Bright

Neil Bright is a research scientist and chief HPC architect at the Georgia Institute of Technology’s Office of IT.

IT organizations face numerous demands from their customers, and application performance is a major focus. Server platforms are only one of many factors that influence application performance — storage, networks, operating systems and software all greatly impact end-user experience.

Hardware errors and subtle configuration changes can quickly turn a highly tuned research or administrative platform into an unproductive service or cause missed deadlines. Careful choices in system design, continuous monitoring and mindful change management will keep your systems performing at optimal levels.

Designing Systems for Sustained Growth on a Budget

An easy way to design computing platforms is to acquire the fastest servers, the fastest networks and the fastest storage available, but doing so doesn’t guarantee the best result. A better metric for system design is that of price/performance. This approach generally eschews the least costly system as well as the highest performing components in favor of the components that provide the best value.

A balanced design, where the performance of one subsystem does not greatly exceed the performance of the others, is key to managing cost. Add in a data-driven assessment of performance requirements and the result is an effective system that achieves performance expectations while keeping costs in check. Analyze your application requirements in terms of CPUs and memory, but also look at the network and storage requirements needed to deliver a desired level of end-to-end performance.

Classically, research applications stress CPUs, but other approaches such as Hadoop and Big Data analytics that stress storage and networks to a much greater degree are becoming increasingly popular. Design the overall system to accommodate those stressors and avoid overprovisioning. For example, overprovisioning fast CPUs for a storage-bound application will result in extra cost and a lot of idle CPUs consuming your electrical and cooling budgets.

Software and Operating Systems Scale Out, Not Just Up

Gone are the days when simply replacing CPUs results in increased performance. Moore’s law is still doubling transistor count about every 18 months, but those transistors are going to additional processing cores, caches and other processor capabilities rather than increasing per-processor performance.

The world today is highly parallel, and software must scale out rather than up. Refactoring software to take advantage of parallelism in modern processors and general-purpose graphics processing units can result in substantial performance increases. Optimizing compilers, mathematical libraries, highly optimized architecture-specific routines and other middleware can all contribute to performance increases.

It is important to note that while optimizing the CPU utilization is by far the most common strategy, optimizing network and storage access patterns and utilization can yield great results as well. While the performance benefits of these optimizations can be great, IT administrators and researchers must take care to avoid overspending on optimization. If you find that applications spend long periods of time in production runs, ramp up optimization efforts. For applications that are still evolving, the shortest time to solution may not involve significant optimization.

Operating systems also have a great impact on performance. Seemingly subtle or benign operating system updates or driver configurations can drastically change performance properties in unexpected ways — by as much as 50 percent in extreme cases. Host-based firewalls can be a hidden source of application slowdown, particularly if they perform stateful packet inspection. Consider forgoing host firewalls in favor of dedicated network devices for this task. Performance monitoring is key to preventing detrimental effects.

Monitoring Through Microkernels Keeps Readings Accurate

IT administrators have long understood the need for functional monitoring of information systems; however, monitoring performance should also be required. Microkernels — small, short-running applications that mimic behavior of key portions of important applications — are of great benefit here. Establish a suite of microkernels, then use them to establish baseline performance of individual components, subsystems and the interfaces between functional units. For best results, do not test representative samples, but rather all possible combinations.

An all-to-all test that measures the network latency and bandwidth between each pair of servers, for instance, will reveal a degraded network cable that otherwise might have gone undetected. Run such tests often, especially before and after configuration changes. Compare not only the result of the computation, but also the time to completion and any deviation from normal. Doing so will provide confidence that the configuration change did not introduce unanticipated functional or performance errors.

Public clouds are certainly a valuable addition to the toolkit of an IT architect, but managing performance on these platforms is a far more difficult task. Not only does the customer generally have less flexibility in choosing server capabilities, but also a significant portion of the supporting infrastructure is completely invisible. As these workloads typically are implemented on virtual infrastructures, the underlying hardware may change, confounding factors even more.

Frequent testing with the micro-kernel approach above can mitigate the unintended consequences of the cloud approach and help IT administrators ensure performance targets. In general, this monitoring approach is useful in any situation where a critical component is outside direct control. Partners may expose some level of performance metrics, but direct monitoring via application microkernels is a much more accurate approach.

Simplify, Consolidate, Integrate

As system complexity increases, troubleshooting becomes more difficult. Converged networks simplify management, monitoring and capacity planning. External infrastructure offers decreased or limited visibility, and the integration points needed to marry those elements with internal infrastructure become highly critical components, further increasing complexity.

Carefully consider the benefits versus such risks when choosing to add external elements to any system. Finally, take time to thoroughly understand and document your dependencies. In addition to helping with future troubleshooting, these activities identify good points to apply microkernel monitoring.

Paul Hardy/Corbis