Who wrote that scientific paper?

If your best friend's name is Bill Gates, you probably have some difficulties to find him online using a search engine. In the scientific world, things can be even worse. Imagine a guy named 'John Doe' who has been published in several journals, all using a different policy. His name might appear as 'John Doe,' 'Doe John,' 'J. Doe,' 'Doe J.' or even 'Doe, J.' How will you find the papers he really wrote without knowing the university he works for? Now, computer scientists at Penn State University have developed a system which solves the 'who is J. Smith' puzzle.

If your best friend's name is Bill Gates, you probably have some difficulties to find him online using a search engine. Too many results will point you to the richest man on the planet. In the scientific world, things can be even worse. Imagine a guy named 'John Doe' who has been published in several journals, all using a different policy. His name might appear as 'John Doe,' 'Doe John,' 'J. Doe,' 'Doe J.' or even 'Doe, J.' How will you find the papers he really wrote without knowing the university he works for? Now, computer scientists at Penn State University have developed a system which solves the 'who is J. Smith' puzzle. In fact, they found a way to 'disambiguate' authors with similar names which works pretty well. Their system was able to identify the authors in more than 90% of papers written by almost 500 different authors.

The development of this system was led by C. Lee Giles, professor at the College of Information Sciences and Technology, with the help of two doctoral students, Jian Huang and Seyda Ertekin.

Here is a brief explanation about how this system works.

"The system works by using machine-learning methods to cluster together names that the system believes to be similar. If you think there’s another parameter that’s relevant, you can change the algorithm and include it," Giles said.

In the figure below, you can see the process used for name disambiguation. Given a research paper, each author appearance in this paper is associated with a metadata record, consisting of a set of attributes. The goal is to find a function to match these attributes with a single person. (Credit: Jian Huang, Seyda Ertekin, C. Lee Giles, Penn State)

Process used for name disambiguation

This second figure shows the system architecture, starting with the metadata extraction module which extracts the author metadata records from each paper and ends with the DBSCAN module which builds clusters of papers by different authors. (Credit: Jian Huang, Seyda Ertekin, C. Lee Giles, Penn State)

Disambiguation system architecture

For more information, you can read the paper written by Giles, Huang and Ertekin, and which was presented at the recent 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases in Berlin. Here is a link to this paper called "Efficient Name Disambiguation for Large-Scale Databases" (PDF format, 9 pages, 568 KB).

Sources: Penn State University news release, December 14, 2006; and various websites

You'll find related stories by following the links below.