Just how big is "big" when it comes to Big Data? One of the best places to understand Big Data and the challenges it poses is at the university level, where, among other computing efforts, researchers are seeking to unlock the secrets of the human genome.
"The data for one DNA sample is close to one gigabyte," says Ion Stoica, a professor in the Electrical Engineering and Computer Sciences Department (EECS) at the University of California, Berkeley College of Engineering. "In many cases, you have significantly more than that for sequencing. If you want to compare multiple DNA sequences — maybe among different patients or even from the same patient, as in the case of cancer genomics, where some cells are affected and some are not — then you have to look at many in parallel."
How much data is that?
"You can make this challenge as big as you want," Stoica says. "But we're talking many hundreds of terabytes, or even petabytes."
At the University of North Carolina at Chapel Hill, the Renaissance Computing Institute (RENCI) supports genomics research with Big Data solutions.
"The projection is for petabytes of data across different scientific disciplines," says Michael Shoffner of the RENCI Data Sciences Group.
How universities store that and other Big Data is as much a question of speed as it is of capacity. Both UC Berkeley and UNC are designing their infrastructures to strike the right balance.
In addition to his role in UC Berkeley's EECS department, Stoica is co-director of the university's AMPLab, which is working to integrate algorithms, machines and people to make sense of Big Data. The AMPLab's Big Data research supports cancer genomics and urban planning efforts, and received a $10 million grant from the National Science Foundation in 2012 as part of the federal Big Data Research and Development Initiative.
Much of the AMPLab's research is focused on speed for supporting faster data processing, interactive queries and other high-performance computing tasks. "Disk storage is abundant," Stoica says, "but access to data on disk can be slow. Ultimately, we want to process and make sense of Big Data, not just retrieve it."
Even as the AMPLab has boosted its storage capacity, it has introduced two key technologies to enable faster solutions: solid-state drives (SSDs) and in-memory storage. Instead of spinning disks, SSDs have high-speed flash memory that allows for greater throughput. They're also more expensive than traditional disk drives, but Stoica argues that if researchers can produce valuable results faster, SSDs ultimately pay for themselves.
"By keeping the data you access most often on SSDs, it justifies the investment," he says.
The original storage system was built in 2006, with about 10TB across a cluster of 40 Sun Microsystems (now Oracle) servers and another 20TB in a single Sun Fire X4500 server for local, shared storage. (Oracle has since discontinued the X4500 line.) "The X4500 has proved rock solid and predictable, but it's showing its age in keeping up with growth trends," says Jon Kuroda, an engineer in UC Berkeley's EECS department who builds and manages the AMPLab's infrastructure.
41.4% Share of CIOs at public research universities who rate IT investments to support research and scholarship "very effective"; 32.6% at private universities
SOURCE: "The 2012 Campus Computing Survey" (The Campus Computing Project)
AMPLab's new data cluster comprises 30 Intel servers, each with multiple 1TB, 2.5-inch Seagate hard disks. Ten of the servers also include almost 1TB of SSD storage each, and all 30 have 256GB of memory (the older servers topped out at 16GB). The cluster's total capacity is more than 200TB of storage and nearly 8TB of memory.
"Almost immediately, we started rearranging drive configurations to give some systems 20TB-plus of local disk, give others a small amount of SSD storage, and spread the SSDs throughout the rest of the systems," Kuroda says. "We will eventually settle on a standard configuration for a large part of the systems and leave the rest free for experimentation."
The clustered storage approach is possible, in part, because the AMPLab can afford higher network speeds. Its current cluster uses 10 Gigabit Ethernet and Gigabit Ethernet interfaces. But what really allows his team to push the Big Data envelope is the collective server memory.
"It's more challenging to actually analyze large data sets than figure out where to store them," Stoica says. "You are still going to have disk drives, but if you look at data analytics workloads from sites like Facebook, Microsoft Bing and Yahoo, the inputs of over 90 percent of the jobs would fit in memory. Then you can work on ways to improve on-disk processing."
RENCI engineers are looking inward as they hone their Big Data storage. The institute currently has 2PB of spinning disk storage and tape archive capacity across its existing storage infrastructure.
"We consult with partners to predict storage based on their projected needs — how much and in what form," Shoffner says. "The majority of the data our researchers generate is semi-structured. We handle very little structured data from [relational database management systems]."
Unstructured and semi-structured data are the biggest drivers of Big Data, and making it available at high speeds to multiple research partners can prove difficult. In addition to a new eight-node 500TB cluster, engineers are acquiring a Quantum StorNext G302 gateway appliance to speed access over Ethernet connections (up to 1.6 gigabits per second).
RENCI is using the StorNext File System to move data from spinning disks to a Quantum Scalar i6000 tape library as part of a tiered-storage approach. Going forward, as its storage capacity is scaled, RENCI is considering SNFS as the core of its next-generation infrastructure.
"If researchers can't quickly access the data — find that needle in the haystack — then we haven't enabled discovery," Shoffner says.
Big Data Storage Requires Planning and Policy
Whether planning Big Data storage for speed or capacity, colleges and universities often face a data management challenge. Pennsylvania's Lehigh University recently undertook a strategic storage initiative to better understand its customers' needs and the state of its existing storage infrastructure.
"We used storage as a vehicle for understanding Big Data and assessing our agility to anticipate customers' needs and support new technologies," says James Young, director for administration and planning in Lehigh's Library and Technology Services department. The project's success was due to collaboration across organizational boundaries, Young says, and resulted in a cultural shift toward a new centralized storage system — one to handle IT, client services, libraries and research.
"We've got multiple silos of data all over campus," says Michael Chupa, manager of research computing at Lehigh. "Until now, there hasn't been an institutional focus on data policy standards. If we can centralize data, we can maintain oversight and drive our customers toward something we can support and scale up."
In doing so, Lehigh expects to deliver better service, with fewer data access interruptions or incidents of lost data.
The shift will take time. The central data store is roughly 100TB, and Chupa says there are potential customers on campus who already store a significant chunk of that capacity in their labs. He estimates 2.5 times more storage will be needed to "sweep up" the university's existing silos.
"We need to be able to deliver supported storage at a competitive price and offer incentives for researchers to join the central pool, which will help drive down costs" Chupa says. "We're hoping for policy backup explaining that we now have this scalable storage platform, and people should be using it."