Cloud

Digital Library Opens Avenues for Data Analysis in Academic Research

Massive virtual stacks yield insights through algorithms and artificial intelligence.

Dan Tynan is a freelance writer based in San Francisco. He has won numerous journalism awards and his work has appeared in more than 70 publications, several of them not yet dead.

At the HathiTrust Digital Library, there are no carrels, no tables, no card catalog and no reference desk. There’s almost nothing physical at all.

This collection of nearly 17 million digitized volumes from dozens of campus libraries exists entirely online. An estimated 95 percent of those volumes were originally scanned by Google when it partnered with universities to create its Google Books project starting in 2002, says Mike Furlough, executive director of HathiTrust at the University of Michigan.

“If there’s a book sitting on the shelf at the University of Michigan library, University of California, Illinois, Virginia or Harvard, there’s a very good chance we have a copy of it,” says J. Stephen Downie, associate dean for research at the University of Illinois School of Information Sciences and co-director of the HathiTrust Research Center (HTRC).

The 5.9 billion digitized pages consume nearly a petabyte of storage at U-M’s on-premises data center, with a mirrored copy at Indiana University. In addition, the fully indexed text and metadata for each volume are kept at IU for researchers using advanced text mining methods.

Just over 6.4 million of the books are in the U.S. public domain and can be downloaded by students and researchers. Those still under U.S. copyright (about 10 million) can’t be downloaded, but their contents can be mined and analyzed, which has led to several unique academic applications.

“It turns out that running algorithms against copyrighted data is totally within American copyright law,” says Downie. “As long as you’re not copying the books in some clever way, the answers you get back from analyzing those texts can be shared with researchers.”

Digital Text Analysis Opens New Avenues for Data-Driven Insights

HTRC has developed a Bookworm visualization tool that explores the evolution of words over time; an Extracted Features Dataset that tallies up words and parts of speech; and a Data Capsule feature that scoops up information based on a researcher’s custom queries and presents the results without making unauthorized copies of the work.

For example, researchers can search for all mentions of George Washington, while excluding results related to the state or city of Washington, Downie says. Or, they can apply artificial intelligence to literary styles to determine whether books attributed to an author were actually written by that person.

Academics have used HTRC, for example, to map changes in representations of women in novels over the past four centuries, trace the evolution of steam technology in literature, and examine how the Chicago School has influenced global architecture, among other projects.

“By founding the research center, we’re able to really push the boundaries of how library collections can be used,” adds Furlough.

For more on how universities are using new technology to bring digital transformation to their libraries, check out Libraries Use Cloud and Other Tech to Reimagine Traditional Services.

ismagilov/Getty Images