Machine learning could be the answer to helping millions of hearing-impaired people read what's being said in the world around them.
Researchers at Oxford University and Google DeepMind have developed an artificial intelligence system trained on thousands of hours of BBC video broadcasts that far outperforms a professional lip-reader.
The human lip-reader, who's provided lip-reading services for use in court, was shown a random sample of 200 videos from the BBC test set and correctly deciphered less than a quarter of the spoken words. The AI system was able to decipher half of the words from the same set, the researchers say in a new paper.
New Scientist also reports the professional was only able to annotate 12.4 percent of words without an error, while the AI annotated 46 percent without error.
The technology may one day show up on phones as a new way of instructing a voice assistant like Siri, or it could be used to enhance audio-based speech-recognition systems.
"A machine that can lip read opens up a host of applications: 'dictating' instructions or messages to a phone in a noisy environment; transcribing and redubbing archival silent films; resolving multi-talker simultaneous speech; and, improving the performance of automated speech recognition in general," the researchers write.
As with most machine-learning efforts, a massive dataset was required to train the machines to lip-read. The researchers had access to nearly 5,000 hours of talking faces on six BBC shows, such as Newsnight, BBC Breakfast, and Question Time, according to New Scientist.
In total the BBC data gave them 118,000 sentences and just over 17,500 unique words, dwarfing other large public lip-reading datasets, such as GRID, which was used to train another recent automated lip-reading system from Oxford called LipNet.
While LipNet also beat humans at lip-reading the system was constrained by the training data.
The Oxford and DeepMind researchers call their network 'Watch, Listen, Attend and Spell', which describes different modules used to discern speech from lip movements. Together it transcribes speech into characters by learning to predict sentences being spoken from a video of a talking face.
The other benefit of the BBC data is the variety of human voices. It has roughly 1,000 different speakers, allowing it to be flexible enough to be used no matter whose lips it's reading, as compared with GRID, which has 34 speakers uttering 1,000 phrases that follow a formula.
The researchers note their work focused on "unconstrained natural language sentences" and "in-the-wild videos", whereas previous work has targeted recognition on a limited number of words or phrases.
DeepMind and Oxford plan to make the data publicly available to help other researchers in the field.