“We were interested in automating a process that is labor intensive and also providing search capabilities,” says Penn State’s C. Lee Giles(left). Penn’s Prasenjit Mitra says he and Giles wants to improve the ranking algorithm for their table search engine.
Jan 09 2008

A Search for Research

Penn State researchers develop first-of-a-kind search engine for tables.

Tables are an important data resource for most university researchers. Although some applications can identify and extract tables from text, none of the current search engines can search for tables across documents. Information technology managers, scientists, scholars, researchers, students and faculty must manually browse documents to find tables and then sift through the data for what they are seeking.

TableSeer not only identifies and extracts tables from Adobe Portable Document Format files, but it indexes and ranks search results using factors such as a table’s title, text references to the table and date of publication.

The new search engine is the invention of C. Lee Giles and Prasenjit Mitra, two professors with similar interests who found themselves working in the same corridor at Penn State, brainstorming new ideas and eventually collaborating.

Giles and Mitra developed TableSeer to facilitate table searching in articles on chemistry, but the search engine can be used wherever data is presented in tabular form, says Giles, the David Reese Professor of Information Sciences and Technology and co-director of the IST Cyber-Infrastructure Lab.

“Tables are important pieces of information in many documents,” he says. “Most of the time, the way you get information from tables is by copying it down by hand. This is a very labor-intensive process. We were interested in automating a process that is labor intensive and also providing search capabilities.”

Mitra says he had found that search engines did a good job of indexing the textual part of documents but not the tables. “Usually tables are either badly extracted, which means the engines don’t differentiate them from the text, or they’re skipped all together,” says Mitra, an assistant professor at Penn State’s College of Information Sciences and Technology. “We believe that tables convey some of the most important information in the document, and that was what was missing in these search engines.”

But What’s the Value?

A chief goal of their work was to create a tool that would let scientists spend more time on research and less on manual tasks, such as searching through articles for tables and relevant data points. To that end, they created a ranking algorithm, TableRank, for their table-search application.

TableRank can identify tables found in frequently cited documents and weigh that factor into the search results. “The ranking function goes in and predicts which tables are the most important ones and puts them right up on the front page,” Mitra says. “The algorithm uses information from the tables. It extracts information from the captions and from the references to the tables and text, and it finds out which part of the document is talking about a particular table. It also uses content in the rest of the document, such as what the document is about and who it was written by.”

According to Mitra, in tests using documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats. “We also tested it with documents from the computer science domain,” he added.

The development of TableSeer is part of an open-source cyberinfrastructure project focusing on searching environmental-chemistry documents and is funded by the National Science Foundation.

Giles says the techniques used should work with any text-based tool. “While we designed and developed TableSeer to facilitate searching of tables occurring in articles in the chemistry domain, it can be used in any domain where data is presented in tabular form including other scientific, technical, social and business areas.”

How can TableSeer better serve colleges and universities? Mitra points to the speed factor as well as the search engine’s role in automating the manual labor part of research. “This is a new tool which allows you to go in and grab either data from tables automatically or find summaries of results much faster. It’s a productivity tool to help researchers, students and scholars,” he says.

Beyond PDFs

Table-searching presents a unique challenge because there is no standard table representation method. Tables can appear in PDFs and PowerPoints, in HyperText Markup Language files and in Microsoft Word documents. “We chose to focus on PDF documents because of their growing popularity in digital libraries and because PDF documents had been overlooked in other table-search efforts,” Giles says. “In the future, we will be working on adapting TableSeer to the other formats.”

The two professors also plan to work on extracting the data returned by TableSeer searchers and gathering it in a database so that researchers can use the data again and again for checking out models and theories they want to study.

“We are also doing ongoing research to improve the ranking algorithm by adding additional features,” Mitra says. “It’s still in the early stages. We started looking into systematically identifying what the authors want. We need to do a large-scale user study to find out what readers and users want. This would change and perhaps improve the ranking function.”

Giles, meanwhile, is involved in working on a search engine that can identify, extract and rank figures found in documents. This may prove to be more challenging than table extraction as the researchers will be working with images rather than text.

“We find it quite promising,” he says. “We’re looking at extracting data from plots. We have some preliminary results we find encouraging. In many ways this is as exciting as tables because there is so much data out there in plots that is so difficult to get out, and it’s the only place you can find that data.”

Common Data Delivery Tool

After reviewing 10,000 chemistry, biology and computer science articles and documents, Giles and Mitra found that more than 70% included tables.

Spoken-Word Search Research

With the prolific use of the Internet as a way to make lecture materials available, researchers at the Massachusetts Institute of Technology have developed a search engine that lets MIT students browse lectures for specific terms.

So far, the Computer Science and Artificial Intelligence Laboratory has cataloged more than 200 MIT lectures. Using voice recognition software, the lab creates transcripts of lectures professors post online. The transcripts are then broken into sections by topic. For searches, the engine uses a mathematical formula to compare a search query against 100-word chunks of text.

The MIT project, like TableSeer, is funded by the National Science Foundation. To check out the lecture search engine, go to web.sls.csail.mit.edu/lectures.