YORKTOWN HEIGHTS, NY—At Columbia University's Institute for Data Sciences and Engineering, institute director Kathleen McKeown teaches machines to answer human queries based on online information. Which means McKeown's computers not only have to be able to read, they also have to be able to write. She calls the operation by which computers gain these skills “natural language processing.”
As she addressed an audience of cognitive computing enthusiasts at yesterdays IBM colloquium, she talked about how she's teaching machines to process information gleaned from social media.
“The language of online discussion,” she said, “provides unedited perspectives form the everyday person, often in form of dialogue. It's opinions, viewpoints, and emotion.”
McKeown wants her computers to be able to answer open-ended questions based on socially-generated accounts. She showed a slide with an example question, and the answer produced by her program.
Q: What is the effect of Hurricane Sandy on New York City?
A: It's dark, there is minor price gouging. There are restaurants selling hot food through their bay windows. The Police are doing an amazing job with traffic concerns and directing traffic. Many stores have set up recharging stations for people to recharge their electronics. People are meeting other people in various parts of the city and offering showers. Bars are cash only. Banks have waived ATM fees. Stores with non-perishables are open and selling goods. The buses are running great., just like any other day. Everyone is walking around with flashlights or phone camera lights on.
“It's hard,” she said of this kind of task, “because there is no word overlap between the input and the answers.”
At this point her team takes a semi-supervised approach, meaning the computers have a bit of help on their responses. The computers get access to posts that have been manually annotated.
“Now when is a post reliable?” McKeown asked. “When should we select it to include in the answer to a question? One factor we're looking at for that is influence. We want to be able to detect online influencers.”
They do that by identifying situational influence. That comes from dominance in conversation, she clarified, not the number of followers a user has.
For example, she shows the background history of an edited Wikipedia article. One editor in particular seems to be answering other people's questions, and no one seems to correct her actions. Therefore McKeown's program would determine that editor as reliable, her posts are seen as especially strong sources. In this way, MeKeown says, her program has gained intuition.
At this point she says her natural language processing machines have about a 65% accuracy in answering open-ended questions about events.
“Our ultimate goal,” McKeown says, “is to be able to combine all of these resources into one.” She pulls up a mock-up example. It's a map of New York City. Text bubbles linked to different points on the map pop up sequentially, marking the important events of Hurricane Sandy in the order they occurred. The text bubbles provide summaries pulled from online news coverage, journal articles, and social media accounts.
“What is the outlook for natural language processing?” she asked. “Well there's huge potential. Human language is the currency of communication. The web contains troves of language examples that can be exploited for learning.”
“There are many application areas in which it's important,” McKeown said. In finance, she explained, people want to be able to understand public sentiment towards companies in order to make investment decisions. In medicine, machines like Watson can read through journal articles to find important information that can improve an individual patient's care. And in politics natural language processing can look at social media responses to analyze the impact of different speakers.
McKeown noted that currently one of her team's biggest challenges has been to secure government funding. “I would note that there's a fine line between research and applications,” she said, suggesting that government funders may be less than cognizant of the real world uses of her work.
She noted that the next goal of her team's research will be to use natural language processing at scale, with the full breadth of data that the internet has to offer.