Open the pod bay doors, please, HAL: Meta's AI simulates lip-reading

The software uses the popular approach of "attention" to make videos of speakers' lips and audio waveforms reinforce one another in ways much more efficient and accurate than prior approaches.
Written by Tiernan Ray, Senior Contributing Writer

"Although you took very thorough precautions in the pod against my hearing you, I could see your lips move."

It is a fact widely known that people hear speech not just by listening with their ears but also by picking up cues from the mouth movements they observe on the part of speakers. 

Similarly, combining visual observation with audio could help a computer conceivably parse human speech better. In a sense, computer programs can read lips, though it is a laborious task to engineer. 

Recent AI work by Meta, the parent of Facebook, Instagram and WhatsApp, suggests a more efficient path to a day when computers can read lips just as well as HAL 9000 did when Dr. David Bowman and Dr. Frank Poole tried to evade its audio sensors inside the pod in the movie "2001."

Last Friday, Meta's artificial intelligence scientists published a research report in which they were able to dramatically reduce the effort needed to engineer software to parse the words of the lip movements of speakers in recorded videos. The work was also able to use lip-reading technology to improve speech recognition in noise environments meaningfully.  

The program is "75% more accurate than the best audio-visual speech recognition systems (which use both sound and images of the speaker to understand what the person is saying)," the authors state.

Of course, there's a Metaverse angle here: Not only could the program be used for instantaneous translation, someday, it could also "help generate realistic lip movements in virtual reality avatars, in order to deliver a true sense of presence -- that feeling of being there with someone even if they're on the other side of the world."

The work represents an advance along two lines. One is self-supervised learning, which eschews specific clues, such as text transcripts, and instead has the program spontaneously divine structure in data. The other area of development is so-called multimodal neural networks, which combine data of different kinds in a way where they reinforce one another. 

The result, called AV-HuBERT, the "AV" standing for audio-visual, the "Hu" standing for "hidden unit," combines auditory and visual signals to detect words from lip movements. 

Lead author Bowen Shi and colleagues Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed of Facebook posted their paper, "Learning Audio-Visual Speech Representation By Masked Multimodal Cluster Prediction," on the arXiv preprint server last Friday. The authors also wrote up a blog post that you may find easier to digest.

As Shi & Co. explain, previous work has also been multimodal, combining visual data, video frames with audio data, waveform snippets to train a neural network to predict how they match up. 

But such programs have tended to rely on some kind of additional, prepared clues, such as a transcription of videos of speakers into text sentences that then serve as labels. The new work is going the self-supervised route, putting together patterns spontaneously without external structure.

"It is the first system to jointly model speech and lip movements from unlabeled data -- raw video that has not already been transcribed," the authors write in their blog post.

Many prior models word-level annotated lip-reading videos," to train, "which is costly to collect since they require word boundary information. In contrast to these models, our models are fully pre-trained from scratch using the proposed approach.

The AV-HuBERT program they've invented builds on an audio-only program called HuBERT introduced last year by Hsu and colleagues. As the name implies, HuBERT uses the bi-directional Transformer neural network approach developed at Google in 2018

By "masking" parts of an audio recording, meaning leaving out sections of an audio waveform, the HuBERT neural network in its training phase had to reconstruct which pieces of audio go with one another. 

Now, in AV-HuBERT, Shi and team "fuse" bits of audio with frames from videos of people speaking. The training phase of the neural network proceeds in essentially two stages. First, like the original audio-only HuBERT, they use the attention approach to mask the audio and then group those audio waveforms into clusters, which groups of examples that are in some way near to each other in their attributes.

Those groupings then become a target for the second stage of the neural network. The multimodal part of AV-HuBERT simultaneously masks both the images of speakers' lips and the audio waveform and then tries to match them to the clusters established in the first wave. In this way, the program calculates which lip configurations correspond to which audio waveforms, thereby "learning" the correlation of mouth movement and audio output. 

That is, effectively, a self-supervised approach that divines structure without explicit clues.


The structure of the AV-HuBERT program, starting with visual and audio data entering at the bottom, and being output into final "clusters" at the top.

Meta 2022

The fusion means that the attention placed on image frames and that placed on audio waveforms reinforce one another to produce superior clusters than either would alone. Those clusters become the "target" of subsequent tasks, such as lip-reading and speech recognition. 

As the authors explain, 

AV-HuBERT simultaneously captures linguistic and phonetic information for unmasked regions from both the lipmovement and audio streams into its latent representations, then encodes their long-range temporal relationships to solve the masked-prediction task.

Once AV-HuBERT has been self-trained in this way, the authors do a fine-tuning by introducing actual labeled video, hours of it, with formal transcripts that tell the machine where the words are in the video.

The main dataset used to test and to train the AV-HuBERT program is LRS3, developed in 2018 by Triantafyllos Afouras and colleagues at Oxford, which is "the largest publicly available sentence-level lip reading dataset to date. It consists of over 400 hours of video, extracted from TED & TEDx talks in English from YouTube."

As a result of the self-supervised training of AV-HuBERT, it can predict the words from the videos of speakers better than all prior attempts, write Shi and company.


Test results on lip reading for the "proposed" Meta system, AV-HuBERT, bottom, and previous best-in-class programs.

Meta 2022

But, more important than the raw score is the vast reduction in the amount of data needed to train the program. 

"AV-HuBERT achieves state-of-the-art using 433 hours of text transcriptions, two orders of magnitude less than the 31,000 hours of labeled data used in the prior best approach," they write. 

With far less data required, it's possible to do lip-reading tasks on languages that have a lot less data than others, so-called low-resource languages. (Think of languages other than English, French and German, for example.)

The authors observe that "As future work, AV-HuBERT can be applied for multilingual lip-reading in low-resource languages," and that the same "approach can be extended to other applications of visual speech representation, such as speech enhancement and generation."

Shi and colleagues added to their findings with a second paper posted last week describing the use of AV-HuBERT for automatic speech recognition. Here, the focus is on how to do better parsing of speech in the context of noise. 

Speech recognition "deployed in meeting scenarios is subject to babble noise, while one used in a home environment naturally encounters music, cooking, or vacuums machine noises." Their inquiry is whether such ambient noise can be overcome by AV-HuBERT.

During training, Shi and the team mix in noise clips with AV-HuBERT's video frame and audio waveform samples. The result, they write, is that the program gets good at getting around the babble. So much so that AV-HuBERT garners a 50% reduction in the word-error rate, or WER, the proportion of mistaken words, versus previous speech recognition systems. 

"Our future work includes applying audio-visual speech recognition in real-world low-resource and multilingual settings," they write.

So, how real is something such as HAL 9000's lip reading? The notion that AI is now better than humans at lip reading has been written about in recent years with previous AI work. The word-error rate in AV-HuBERT's best showing is, indeed, far better than human, professional lip readers, at 26.9%. Apparently, the best most human lip-readers get is only 40% (they're wrong four times in ten.) Obviously, for things such as transcribing talks after the fact, this could be a huge boost to software programs.

In practice, though, there's a big caveat. This is really simulating lip reading. The AV-HuBERT results pass a test on canned video, not a live, free-form, in-the-wild conversation such as Bowman and Poole's in the movie. 

For the moment, you may still be safe inside the pod.

Editorial standards