May 07 2013

Big Data Gets Faster Thanks to Research at Rutgers and Stony Brook

Rutgers, SUNY Stony Brook at work on storage performance improvements for more agile access.

The Big Data–driven college and university may well be the future of higher education. It only makes sense that educational institutions increasingly will want to connect the data dots to analyze and address a world of challenges: raise the graduation rate by understanding which students are at higher risk of dropping out; improve teaching and learning thanks to insights gained from students' activities online; and better manage physical resources, such as classrooms, based on ­expected revenue.

Administrators find, however, that there's a lot of work to be done before that future becomes a reality. Advances that make it easier and more efficient to work with large, complex and frequently updated data sets can be one important first step to that future. Help is on the way — in part, through millions of dollars in grants from the National Science Foundation (in conjunction with the National ­Institutes of Health) — for Big Data ­fundamental research projects to improve the ability to extract ­knowledge out of large collections of data sets.

A $1.2 million grant recently awarded to Rutgers University and the State University of New York at Stony Brook is already at work. Researchers are further optimizing how indexed data is organized on hard disks or external storage so that it can be processed more efficiently, led by principal investigators Martin Farach-Colton, professor in Rutgers' Department of Computer Science; and Michael Bender and Rob ­Johnson, associate and assistant professor, respectively, in the Department of Computer Science at Stony Brook.

"The way that a lot of databases, file and other storage systems still organize data on disks is a method that was invented 40 years ago," ­Johnson says. "But over the last dec­ade or so, there have been some new algorithms discovered that can perform up to 100 times faster than these old ones on certain operations. These algorithms are important for Big Data applications because the size of storage systems we have, and the size of the data we want to process, are beginning to show the limits of the 40-year-old technology."

22.7% The percentage of senior campus IT officials who view investment in data and managerial analytics as very effective

SOURCE: "The 2012 Campus Computing Survey" (The Campus Computing Project)

Farach-Colton and Bender actually did the foundational research behind those new algorithms for write-optimized data structures and launched a company, Tokutek, which offers a solution for taking advantage of them: Its plug-in storage engine, TokuDB, for the MySQL open-source database, speeds query and database updates.

Breaking Up Bottlenecks

The new grant allows the Rutgers and Stony Brook team to focus on alleviating the bottlenecks that happen in other operations because not every database or file system mechanism can yet leverage write-optimized indices. The team expects to see huge progress over the next five years, but it's also facing the ongoing challenge of ­successively uncovering the next bottleneck once the most prominent ones are resolved.

The group is ­committed to ­commercializing its efforts, so IT leaders in higher education can expect the research to make its way into TokuDB, and potentially also be licensed by closed-source database and file systems vendors to drive performance improvements at scale.

The Big Deal with BIGDATA

A team working to eliminate data ingestion bottlenecks to enable faster storage systems at Rutgers and Stony Brook is one of 16 recipients of mid-scale research awards made to date under The Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA), a joint collaboration between the National Science Foundation and the National Institutes of Health.

"Scientific advances are increasingly limited by our ability to analyze and interpret extremely large data sets," says Vasant Honavar, NSF program director, Division of Information and Intelligent Systems, Directorate for Computer and Information Science and Engineering. Sources of data range from email and Internet transactions to sensor networks. "We now can gather lots of diverse types of data, but scientific advances won't take place until we can manage such data effectively and can extract useful knowledge from data."

BIGDATA solicited proposals focused on data collection and management, analytics and infrastructure for collaborative science environments. It's only one of many NSF efforts tuned into the Big Data challenge. Another is the Science of Learning Centers program, which funds programs like LearnLab (formerly, the Pittsburgh Science of Learning Center) to leverage Big Data in the improvement of instruction through online tutoring.

NSF also has plans to establish a program aimed at educating and supporting a new generation of researchers to address fundamental Big Data challenges concerning core techniques and technologies, problems and cyberinfrastructure across disciplines. This will be a new track in NSF's Integrative Graduate Education and Research Traineeship program.

A larger interagency coordinating group (of which the NSF is a part), The Big Data Senior Steering Group of the Networking and Information Technology Research and Development program, also is working to identify programs across the federal government and bring together experts to define a potential national initiative in this area.