“It’s vital that the scientists, engineers and researchers have access to the data however they need it," says Alex Green, Senior Systems Architect, Laboratory for Atmospheric and Space Physics, University of Colorado Boulder. 

Oct 14 2021

Integrated Storage Solutions Reduce Silos for Higher Ed Researchers

Universities bring the sciences to life with large-scale data solutions that streamline analysis.

In the world of academic research, gigantic data sets require extraordinary storage solutions.

Take the University of Colorado Boulder’s Laboratory for Atmospheric and Space Physics (LASP), for example. Researchers there study everything from climate change to solar physics, gathering raw telemetry from near-Earth spacecraft and far-flung sources such as the Mars Atmosphere and Volatile Evolution (MAVEN) spacecraft.

“We collect, process and archive terabytes of data a month,” says LASP Senior Systems Architect Alex Green. To handle that massive data store, the lab recently upgraded to a suite of NetApp solutions.

LASP is not alone. Across academia, research institutions are seeking out robust systems to help them collect, manage, store and distribute their oversized data tracts. The right solutions offer automation, ease of management and the extreme performance of all-flash storage, says Chris Wessells, senior higher education strategist at Dell Technologies.

From Siloed to Streamlined Research Data

At LASP, the huge scale of space-based research creates a range of challenges. It’s not just the size of the data, although that is not a trivial matter. It’s also the variety.

The organization works on numerous research grants and missions supported by multiple external partners. “That led to a lot of siloed solutions. We might have 10 to 15 active missions at a time,” each with its own storage solution, says Green. “We end up with a wide range of different technologies from different vendors that we need to manage and maintain.”

In seeking a more robust storage solution, the team sought to consolidate 60 to 80 percent of existing systems. That meant finding a fix that could accommodate all those heterogenous parts.

“Part of the appeal of the NetApp approach was that we could do what I would call ‘Lego blocks,’ putting together multiple nodes in order to scale out,” Green says.

The components include NetApp’s all-flash arrays; StorageGRID, a software-defined object storage suite; and ONTAP FabricPool, a feature that automatically moves data between a high-performance local tier and a cloud tier based on access patterns.

Alex Green
It’s vital that the scientists, engineers and researchers have access to the data however they need it.”

Alex Green Senior Systems Architect, Laboratory for Atmospheric and Space Physics, University of Colorado Boulder 

“In our primary cluster, we have three all-flash or AFF NetApps, and those are our primary storage. We also wanted a slower, cheaper tier, and the StorageGRID cluster gives us that, with direct S3 on-premises storage,” says Green.

FabricPool delivers the auto-tiering, a key capability for managing the expense around massive data storage.

“If data is not actively being used on all-flash, it automatically goes to the slower S3 storage. Then, if you start using it again, it pulls it back into the flash,” Green says. “That was a key requirement, to have all-flash for active workloads along with some cheaper storage to drive down that cost.”

The NetApp solution fulfills another key need: It offers multiple storage protocols.

“We have a wide variety of people who access the data, so we need to be able to provide their data in S3 or CIFS or NFS. Whatever they want, we can do,” Green says. “It’s vital that the scientists, engineers and researchers have access to the data however they need it. That multiple protocol support — and how it handles permissions between all those different systems — is critical.”

MORE ON EDTECH: To improve higher ed data security, address these risks in research projects.

A Storage Solution That Captures Sounds of the Earth

While LASP draws its data from the skies, other researchers work closer to home. Bryan Pijanowski, director of the Discovery Park Center for Global Soundscapes at Purdue University, is gathering the diverse sounds of our planet, from the creak of a glacier to the croak of a frog.

His project, Record the Earth, involves a study of the soundscapes of every major biome in the world. “There are 32 of them, and I’ve done 27 so far,” he says. “In terms of data, I’m in the petabyte range right now, somewhere around 4 to 5 million recordings.”

In addition to struggling with the size of the data, Pijanowski’s previous storage solutions were fragmented and cumbersome. He needed a place to consolidate field data so he could perform analysis and calculations.

RELATED: To effectively manage higher ed data, address sprawl.

“We had the best of the best, all the different pieces we needed, but they were not well integrated,” he says. “The hardware solutions, the software solutions, the people solutions were all siloed. As a result, we were getting stuck all the time.”

For a solution, Pijanowski turned to Hewlett Packard Enterprise. With Edgeline Converged Systems and ProLiant servers on the front end of the process, HPE helped the center build an environment to usher the data from ingestion to visualization via HPE Apollo systems. Files are loaded onto a server and distributed using a combination of Apache Hadoop, Spark and Kafka before landing in a MongoDB distributed database. As a final step, files move to Tableau for data visualization.

Rick Merrick
Organizations need to decide on a strategy first, one that takes into account the data needed to make your most important decisions or deliver your primary products.”

Rick Merrick CIO, TCS Education System

Integration was the key to success — and the unifying factor that drives the usability of the data.

“I need the hardware, the software and the people to all work together,” says Pijanowski. “When I think about storage, I also have to think of the analysis and the outcomes. How do I get the research done?”

With a process that spans multiple disciplines, he says, “I need a system that works together.”

EXPLORE: Higher education turns to data analytics to bolster student success.

HPE’s contributions to the project included the necessary technology and skill set as well as a collaborative mindset, says Pijanowski.

“It became a partnership between Purdue and HPE — the data scientists, my graduate students and postdocs — all working together to make sure that what we’re developing has the needed scalability,” Pijanowski says.

The new solution gives researchers more than a place to warehouse all the ambient sounds of the planet; it gives them a way to put that data to work.

“I’m now beginning to think about a world where I’m able to get the data off the sensor by streaming it back to my lab in real time,” says Pijanowski. “I can ask, ‘Are certain birds gone? Have insects stopped singing?’ These are important to managing our natural resources.

A Thoughtful Strategy and Design Behind Data Storage

As institutions boost data storage capabilities, a carefully designed approach is essential, says Rick Merrick, CIO at TCS Education System. Based in Chicago, the nonprofit serves more than 10,000 students at five partner schools across the country. 

“Organizations need to decide on a strategy first, one that takes into account the data needed to make your most important decisions or deliver your primary products,” Merrick says. “That strategy will include the architecture needed for the data collection, which will drive to the proper solutions.”

Future-proofing matters too, says Dell’s Wessells. “New storage solutions must have scale-out capabilities to address future growth in the data environments managed by campus IT and research teams,” he says.

Finally, it makes sense to find solutions that minimize updating work for IT teams, Green notes. In his case, having a common NetApp platform has significantly streamlined the upkeep effort.

“When we have our quarterly updates, we’re applying the same patch level across all of them,” he says. “That means that we’re not trying to patch multiple systems multiple times.”

EDITOR'S NOTE: An earlier version of this article incorrectly stated the number of students the TCS Education System serves. EdTech regrets the error. 

PHOTOGRAPHY BY WILLIE PETERSeN