X
Business

Text mining the New York Times

Text mining is a computer technique to extract useful information from unstructured text. Using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. Read more...
Written by Roland Piquepaille, Inactive

Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more...

Let's start with the introduction of this UCI news release -- and forget the marketing hype.

Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.

Here is a quote from one of the researchers.

"We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier," said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. "To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."

Now, let's look at a real example and as how the team discovered links between topics and people. Below is a graph showing "topic-model-based relationships between entities and topics. A link is present when the likelihood of an entity in a particular topic is above a threshold." (Credit: UCI)

Discovering topics in the NYT archives

Here is another example picked from the UCI news release.

For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich." From this, researchers were easily able to identify that topic as the Tour de France. By examining the probability of words appearing in stories about the Tour de France, researchers learned that Armstrong was written about seven times as much as Ullrich.

But what exactly is 'topic modeling'?

Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. Older text-mining techniques require the user to come up with an appropriate set of topic categories and manually find hundreds to thousands of example documents for each category. This human-intensive process is called supervised learning. In contrast, topic modeling, a type of unsupervised learning, doesn't need suggestions for an appropriate set of topic categories or human-found example documents. This makes retrieving information easier and quicker.

This research work has been presented by Newman and his colleagues during the IEEE Intelligence and Security Informatics Conference (ISI 2006), which was held in May in San Diego. Here is a link to their technical paper, "Analyzing Entities and Topics in News Articles Using Statistical Topic Models" (PDF format, 12 pages, 248 KB). The above graph has been extracted from this paper.

For more information about the topic modeling technique used by these scientists, you can look at the works done by Mark Steyvers and his Memory and Decisions Laboratory (MADLAB).

In particular, you can try the software available from this Topic Modeling Toolbox. And as you might not have the archives of the New York Times at your disposal to do some experiments, start with something smaller and see what kind of topics you discover -- using the contents of this blog for example.

Sources: University of California - Irvine, July 26, 2006; and various web sites

You'll find related stories by following the links below.

Editorial standards