Humans, cover your mouths: Lip reading bots in the wild

There are many uses for machine lip reading, from transcription in noisy contexts to resolving multi-speaker concurrent speech and improving automated speech recognition. But let's let our imaginations run wild.

Video: Humans prefer faulty robots over perfect ones

special feature

AI and the Future of Business

Machine learning, task automation and robotics are already widely used in business. These and other AI technologies are about to multiply, and we look at how organizations can best take advantage of them.

Read More

In the movie "2001" I found the scariest moment was when astronauts David Bowman and Frank Poole met in the EVA pod to discuss the artificially intelligent HAL 9000 computer's behavior -- and HAL reads their lips. Science fiction? Not anymore!

In the paper Lip Reading Sentences in the Wild, researchers Joon Son Chung, of Oxford University, Andrew Senior, Oriol Vinyals, and Andrew Zisserman, of Google, tested an algorithm that bested professional human lip readers. Soon, surveillance videos may not only show your actions, but the content of your speech.

Google's DeepMind

The researchers used Google's Deep Mind neural network and trained it using thousands of hours of subtitled BBC television videos. The videos showed a broad spectrum of people speaking in a wide variety of poses, activities, and lighting -- thus the "in the wild" designation.

Lip reading is an active area of AI research, and this team is not the first to work on it. But by using thousands of hours of BBC videos, their algorithm achieved the best results yet.

Their 'Watch, Listen, Attend and Spell' (WLAS) neural network learned to transcribe videos of mouth motion to characters, using over 100,000 sentences from the videos. By translating mouth movements to individual characters, the neural net spelled out words.

In training the AI, one of the team's innovations was to start out with single words, and gradually increase the length of samples to reach complete sentences. That sped up training and dramatically improved test performance.

The videos were 120x120 pixel images, showing only the lips, sampled every 40 milliseconds.

Results

They found that a professional lip reader is able to correctly decipher less than one-quarter of the spoken words. Their WAS model (lips only) was able to decipher half of the spoken words, significantly better than professional lip readers.

The Storage Bits take

There are many practical uses for machine lip reading, such as transcription in noisy contexts, dubbing and/or transcribing silent films, resolving multi-speaker concurrent speech, and improving automated speech recognition.

But let's let our imaginations run wild. With chatbots inventing their own languages, and AI surpassing human intelligence someday this century, it's clear that humans will have to cover their mouths -- as David and Frank should have -- in order to have secure verbal communication during the coming war with our AI overlords.

Or maybe we'll all just have to mumble.

Courteous comments welcome, of course.

Previous and related coverage

Google DeepMind wins again: AI trounces human expert in lip-reading face-off

Google DeepMind and Oxford University researchers have developed an automated lip-reader that far outperforms a human expert.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All