Peter Gorman (from left), Irene Zimmerman and Edward Van Gemert are part of the team that is scanning 500,000 of the University of Wisconsin-Madison’s library materials.

Oct 24 2007

Digital Libraries: Turning to the Same Page

Universities take part in a Google book digitization project, collaborating on a shared repository.

Universities take part in a Google book digitization project, collaborating on a shared repository.

Because students frequently start their research on Google instead of campus libraries, university librarians are taking the unique step of putting copies of their library books online, courtesy of the Google Book Search Library Project.

For instance, Google this spring began scanning 500,000 of the University of Wisconsin-Madison’s 7.9 million library holdings, including collections on American and Wisconsin history, medicine, engineering and genealogical materials. Once the materials are scanned, people can read the university’s public domain books online for free. For copyrighted books, Google will show a few lines of text and provide links to find the material in libraries or for purchase in online stores.

“It’s a good way to bring researchers back into our own environment,” says Irene Zimmerman, head of cataloging at UW-Madison.

UW-Madison is among 27 university and public libraries, including Harvard, Stanford and the New York Public Library, that allow Google to digitize parts or all of their collections. The Google Book Search Library Project ambitiously aims to digitize all the world’s books and increase the value and usefulness of the Google search engine by creating a comprehensive online catalog of every book ever published.

In return, Google pays for the cost of digitizing the books, giving participating libraries the opportunity to preserve their materials and make them available to a wider audience. Each university library can keep copies of its digitized materials, so they can be made available on its own Web sites.

“If people discover our materials on Google, even if it’s not the full text because of copyright laws, there will be links to our own online catalog, so students can find the physical items in our collection,” Zimmerman says.

Shared Repository

To share on storage costs, UW-Madison and 11 other Midwest universities are collaborating to house their digitized library materials in a shared data repository that will likely be hosted by the University of Michigan.

The 12 universities are members of a consortium called the Committee of Institutional Cooperation (CIC). The CIC includes all of the Big Ten Conference schools that compete fiercely in football and other sports year after year. But these same schools have collaborated on educational projects for 50 years. These projects range from sharing online subscriptions of periodicals to most recently using videoconferencing to allow students from any of the universities to take foreign-language classes from other schools. In June, the CIC signed a six-year deal to digitize up to 10 million of the 75 million volumes the colleges have in their collections, including books, journals, periodicals and government records.

Google will scan public domain materials from 10 CIC colleges: the University of Chicago, University of Illinois, Indiana University, University of Iowa, Michigan State University, University of Minnesota, Northwestern University, Ohio State University, Penn State University and Purdue University.

The two other CIC members, University of Michigan and UW-Madison, had previously signed deals with Google to scan public domain and copyrighted materials. They’re already reaping the benefits of Google’s high-speed proprietary scanning technology.

For example, Michigan, which will make up the biggest chunk of the shared repository, is digitizing its entire 7 million volume library collection and expects to have the work done by 2010. Michigan scanned about 50,000 volumes on its own, beginning in the mid-1990s before becoming a Google partner in 2004. It would have taken more than 1,000 years for the university to scan its collection on its own, says John Price Wilkin, the associate university librarian for Library Information Technology and for Technical and Access Services at the University of Michigan Library.

Google and the CIC have not decided what will be scanned from the 10 new university participants, but it’s likely they will choose to digitize resources that each school specializes in, such as Minnesota’s Scandinavia and forestry collections or Ohio State’s psychology materials, says Mark Sandler, director of CIC’s Center for Library Initiatives.

“If we look at all the strengths within our collections and send those materials for digitization, we’ll have one incredible online library,” UW-Madison’s Zimmerman says. “The only way something like this can be done is through partnership. There’s no way any library in this country or the world would have the staffing or resources to undertake this venture alone.”

Collaboration Saves Money

The CIC is looking to Michigan to house the repository because its library is the furthest along in the Google digitization project and because it has already developed the applications to download the digitized library content from Google’s servers and make the public domain works viewable by users worldwide, Wilkin says.

The shared repository will be the biggest technology collaboration in CIC’s history. It’s made possible because of recent networking projects the universities’ CIOs have invested in that connect the universities, says Karen Partlow, CIC’s associate director of technology collaboration.

In 2004, the universities pooled their resources to purchase a 12-mile fiber ring in downtown Chicago, which connects the universities to National LambdaRail, Internet2 and other research networks, as well as to each other, she says. By purchasing the fiber together, the universities saved about $13 million.

In 2006, the CIC CIOs collaborated again, building a framework called the CIC OmniPoP, which allowed them to purchase services and equipment for network connectivity. This makes the shared repository for digitized library materials and other collaborations possible. The OmniPoP collaboration has saved each university $1 million in start-up costs and about $600,000 in annual recurring costs, she says.

“Collaboration is not easy, but the CIC CIOs are tight-knit and they select projects that clearly provide collaborative value, making the so-called ‘hassle factor’ of working together worth it,” Partlow says.

Michigan in the Mix

University librarians, not the CIOs, will manage the shared repository. They will meet this fall to finalize plans and decide how to build and manage it. Each university is expected to contribute about $500,000 over the six-year Google digitization project to help fund the repository, Sandler says.

One plan is to use the University of Michigan’s current storage system. For archival security, the repository will be backed up and replicated at another CIC university, Sandler says.

Michigan currently houses several dozen terabytes of digitized library materials on RAID servers, using direct-attached storage. But it recently purchased a new clustered storage system that features an initial 100 terabytes of storage and can increase to 1.6 petabytes, if necessary, Wilkin says. The clustered storage system is more reliable than the old system because there are several layers of redundancy that eliminate downtime, he says.

Michigan’s library IT staff focuses on security to prevent people from accessing copyrighted materials, Wilkin says. The hardware is housed in locked rooms. IT staff members regularly perform network port scanning and review software code to prevent anyone from hacking into the system and getting into copyrighted works.

The library IT staff has also built software programs to manage and view the digitized content, and Michigan will share them with other universities in the CIC. They include a “book reader,” which allows library patrons to flip through digitized books online, and an application that will allow specialized software to “read” digitized text to blind users. His staff also plans to use an open-source search engine to make the shared repository searchable by users.

Google’s Role

UW-Madison’s library staff says Google’s customer service and communication is excellent. Google employees communicate regularly by e-mail and frequently check in to make sure books are returned after scanning, says Edward Van Gemert, acting director of libraries at UW-Madison.

John Price Wilkin of the University of Michigan Library, which is in the process of becoming digital.

The UW-Madison library hired several new staffers to help with the Google project. Duties include pulling books from the collection and determining if they’re in good shape for scanning.

Google’s scanners are low-impact devices, so the process doesn’t cause deterioration, says Dan Clancy, engineering director for Google Book Search.

Michigan helped Google refine its early digitization processes and has seen the speed of digitization increase over time because of continued improvements, Wilkin says.

Google has built scanning facilities in different parts of the country, which helps speed the scanning process, Clancy says. For Michigan, Google uses trucks to haul the college’s library materials to a nearby facility. Google has never lost any materials, Clancy adds.

Google and university libraries track the books through barcodes attached to each book, which link to card catalog information.

Michigan developed software that queries Google regularly to see if new materials have been digitized. When Google’s system says materials are available, the application downloads them to Michigan’s servers. Another in-house Michigan application validates the content and sends the information to the online card catalog, which automatically updates its records and declares to the public that the material is available, Wilkin says.

The software checks the copyright rules for each book and determines the type of access it should give users. The university gives only bibliographic information for copyrighted materials, he says.

Once the shared repository is built, the CIC can contemplate building services on top of the content, such as collaborating with the University of Illinois’ National Center for Supercomputing Applications to support text mining and linguistic analysis of the data, Sandler says.

Meanwhile, UW-Madison’s Van Gemert is looking forward to the shared repository, so it can make its digitized materials — along with content from other universities — available to faculty and students.

“The shared repository puts us in a stronger position in years to come,” he says.

Preserving Ancient Content

“With the CIC, we want to target their world-class collections, so we get a vast range of expertise that complements what we already have,” says Dan Clancy, engineering director for Google Book Search. “What we’re doing is in line with their mission. In some cases, their materials are centuries old, and we are working to preserve this valuable content and facilitate access.”

Other Digitization Projects

Google isn’t the only game in town when it comes to book digitization projects. There’s the Open Content Alliance (OCA), launched by the nonprofit Internet Archive and Yahoo. Microsoft, which supports OCA, is also digitizing books for its own Live Search site.

While competition may play a role in the technology companies’ involvement, universities see them as collaborative partners, enabling the digitization of their library collections and giving their materials a larger audience. The University of California, for example, has partnered with all three efforts.

“These are all opportunities to take the vast resources that are in the UC system — some 32 million objects — and make them available and discoverable over the Web,” says Robin Chandler, director of data acquisitions at the California Digital Library, which focuses on digital collections for the UC library system and digitizes some of its own materials.

Google’s project, which launched in 2004, has had its share of controversy. The Association of American Publishers and the Authors Guild filed lawsuits against Google in 2005, accusing the company of copyright infringement for making digital copies of copyrighted library books. Google argues that using the snippets of copyrighted material that it makes available is fair use.

The OCA and Microsoft, whose digitization efforts began in 2005, have taken a different tack and focus on public domain books. That means digitizing books that were published before 1923 and are not subject to copyright laws.

Google and Microsoft are digitizing UC’s materials for free. Google will scan 2.5 million volumes over six years. Microsoft, which has a yearly contract with UC, scans thousands of volumes a year, Chandler says. With the OCA, the library funded the scanning of historical mathematics books.

The University of Illinois at Urbana-Champaign is also working with Google and the OCA to digitize its books. The OCA has installed two scanning stations and is scanning materials that include collections about Illinois, says Paula Kaufman, the university’s librarian and dean of libraries. The OCA project is funded through state and private funds.

“We’re eager to provide as much of our collection as possible in digital formats,” Kaufman says.

Do-It-Yourself Scanning

Despite its Google deal, the University of Wisconsin-Madison still has a do-it-yourself mentality when it comes to digitizing its library collection.

Since 2001, the university has digitized 1.4 million pages of materials at its University of Wisconsin Digital Collections Center, and it will continue to pursue its own scanning projects.

That’s because the center’s efforts are wider in scope than Google’s are. The center scans not only its own items but also the library content from throughout the University of Wisconsin system and the Wisconsin Historical Society. And unlike Google, which focuses on text, the center also digitizes audio, video, maps and photographs, says Peter Gorman, head of the UWDCC.

The center, funded by UW-Madison, the University of Wisconsin system and grants, is staffed by 12 full-time professionals and 15 to 20 student employees who scan and add “metadata,” or bibliographical information, to the materials, says Vicki Tobias, the center’s digital services librarian.

The center, which typically has a $700,000 annual budget, runs on several servers and stores its 4 terabytes of digital library content in UW-Madison’s central storage area network, Gorman says. Staffers also use 10 scanners to digitize materials and a variety of commercial, open-source and in-house software to house and organize the digitized content.