MIT ups the ante in getting one AI to teach another

Researchers at MIT used two machine learning networks working in tandem to improve speech processing via image recognition, an example of "multi-modal" training where vision and sound reinforce one another. It could radically simplify the task of natural language processing.
Written by Tiernan Ray, Senior Contributing Writer

Computers have gotten so good at recognizing images via machine learning, why not use that ability to teach the computer other things? That's the spirit of a new bit of research by Massachusetts Institute of Technology, which hooked up natural language processing to image recognition.

MIT coordinated the activity of two machine learning systems, one for image recognition and another for speech parsing. Simultaneously, the image network learned to pick out the exact place in a picture where an object is, and the speech network picked out the exact moment in a sentence containing a word for that object in the picture.

The two networks learned together, reinforcing one another until they converged on a joint answer that represents the union of the location of the object and the moment of the spoken word. They "co-localized," as it's put, spatially and temporally.

Also: Top 5: Things to know about AI TechRepublic

The paper, "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input," was presented this week at the European Conference on Computer Vision by MIT researcher David Harwath and colleagues Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass, all of MIT's Computer Science and Artificial Intelligence Laboratory, CSAIL.

The authors were inspired by the learning process of babies. Babies learn to associate an object they see with the word told to them by an adult. The baby's process is a messy one, with lots of "noise" in the different ways objects appear in the world, and the different ways that different human adult voices sound. Although the scientists weren't trying to decipher the human learning process, they found an intriguing challenge to subject a neural network to a similar kind of challenge, namely, having only minimal supervision.

Like babies, the research employed none of the standard supervision used in similar prior research. All the data was submitted to the computer in raw form. As the authors write, "Both the speech and images are completely unsegmented, unaligned, and unannotated during training, aside from the assumption that we know which images and spoken captions belong together."

The work aims at a place in machine learning that's struggled. While ML has made great strides in image recognition, natural language speech processing has lagged. Systems such as Apple's Siri assistant require extensive training via text transcriptions of speech, and using explicit "segmentation" of a stream of audio, to pick out and memorize words.

The hope here is to lessen the need for transcripts, and thus make possible speech training beyond just the "major languages of the world" such as English. There are over 7,000 spoken human languages, the authors observe, and having to train with transcripts isn't going to scale to that many tongues.

Also: Facebook enlists AI to tweak web server performance

Harwath and colleagues constructed two convolutional neural networks, or CNNs, one for image detection and one for speech detection. Interestingly, the speech audio waveforms are converted into images so that they can be processed visually just like the images.

The networks were filled with 400,000 pairs of images and corresponding audio clips describing them. The clips were obtained by recruiting people on Amazon's Mechanical Turk service to speak a description of what's in each picture.

False captions were also fed into the networks, to reinforce the proper responses. The networks kept processing speech and audio until they achieved the best match between a small patch of the image and a small portion of the audio.

There's already future directions being explored, such as using pictures to translate between languages.

Also: MIT launches MIT IQ, aims to spur human, artificial intelligence breakthroughs, bolster collaboration

In a separate paper that builds upon the first, titled, "Vision as an Interlingua," Harwath and colleagues paired an English-language caption network, and its images, with a Hindi-language network using captions for the same images, recorded by Hindi speakers. The authors were able to use a caption in one language to recall a caption in another. "The visual domain," they write, acted as "as an interlingua or 'Rosetta Stone' that serves to provide the languages with a common grounding."

Of course, there's a ways to go to achieve the complexity a baby soon masters with language. Harwath and company have achieved what they called "semantic alignment" between words and objects, but it's just a correspondence. As the authors acknowledge, future work should "go beyond simple spoken descriptions and explicitly address relations between objects within the scene" in order to "learn richer linguistic representations."

Previous and related coverage

What is AI? Everything you need to know

An executive guide to artificial intelligence, from machine learning and general AI to neural networks.

What is deep learning? Everything you need to know

The lowdown on deep learning: from how it relates to the wider field of machine learning through to how to get started with it.

What is machine learning? Everything you need to know

This guide explains what machine learning is, how it is related to artificial intelligence, how it works and why it matters.

What is cloud computing? Everything you need to know about

An introduction to cloud computing right from the basics up to IaaS and PaaS, hybrid, public, and private cloud.

Editorial standards