When a child is learning to speak, no one bothers explaining the difference between subjects and verbs, or where they fall in a sentence. That is, however, how humans teach computers to understand language: We annotate sentences to describe the structure and meaning of words, and then we use those sentences to train syntactic and semantic parsers. These parsers help voice-recognition systems like Amazon's Alexa understand natural language. It's a time-consuming process and one that's especially difficult for less common languages.
This week, researchers from MIT are presenting a paper that describes a new way to train parsers. Mimicking the way a child learns, the system observes captioned videos and associates the words with recorded actions and objects. It could make it easier to train parsers, and it could potentially improve human interactions with robots.
For instance, a robot equipped with this parser could observe its environment to reinforce its understanding of a verbal command -- even if the command wasn't clear.
"People talk to each other in partial sentences, run-on thoughts, and jumbled language," said Andrei Barbu, a researcher at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds, and Machines (CBMM). "You want a robot in your home that will adapt to their particular way of speaking... and still figure out what they mean."
Barbu is a co-author of the paper being presented this week at the Empirical Methods in Natural Language Processing conference. To train their parser, the researchers combined a semantic parser with a computer vision component trained in object, human and activity recognition in video.
Next, they compiled a dataset of about 400 videos depicting people carrying out actions such as picking up an object or walking toward an object. Participants on the crowdsourcing platform Mechanical Turk to wrote 1,200 captions for those videos, 840 of which were set aside for training and tuning. The rest were used for testing.
By associating the words with the actions and objects in a video, the parser learns how sentences aer structured. With that training, it can accurately predict the meaning of a sentence without a video.
Since captioned videos are easier to produce than annotated sentences, this approach should make it easier to train parsers.
Meanwhile, this approach could even help us better understand the way young children learn to speak, the researchers said.
"A child has access to redundant, complementary information from different modalities, including hearing parents and siblings talk about the world, as well as tactile information and visual information, [which help him or her] to understand the world," said co-author Boris Katz, a principal research scientist and head of the InfoLab Group at CSAIL. "It's an amazing puzzle, to process all this simultaneous sensory input. This work is part of bigger piece to understand how this kind of learning happens in the world."