Video: Google and MIT's AI can fix your phone snaps in real time
Google researchers have developed a deep-learning audio-visual model that can isolate one speaker's voice in a cacophony of noise.
The 'cocktail party effect' -- the ability to mute all voices in a crowd and focus on a single person's voice -- comes easily to humans but not machines.
It's an obstacle to an application of the Google Glass smart glasses that I personally would like to see developed one day. That is, as a real-time speech-recognition and live-transcription system to support hearing-aid wearers.
Hearing aids help my wife hear better, but often with so much static and crackling she's mostly lip-reading. At the age of 15, she could hear perfectly. Now her hearing is getting worse each year and will almost certainly disappear at some point in future. Only stem-cell magic could reverse the situation.
I thought my Glass idea was a great fallback until I wondered how it would pick the right voice out of a crowd -- the scenario she finds it hardest to hear in -- to live-transcribe the target speaker.
Apparently voice separation is a hard nut to crack, but Google's AI researchers may have a part of the answer to my Glass dream in the form of a deep-learning audio-visual model that can isolate speech from a mixture of sounds.
The scenario they present are two speakers standing side-by-side jabbering simultaneously. The technique hasn't been proven in a real-world crowd but it does work on a video with two speakers on a single audio track.
Video: Google's research combines the auditory and visual signals to separate speakers. Source: Google/YouTube
They also used the technique to erase the background noise of single person speaking in a noisy cafeteria, which, Glass idea aside, could produce a much clearer sound for hearing-aid wearers.
"All that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context," write Inbar Mosseri and Oran Lang of Google Research.
Related: IT leader's guide to deep learning
The researchers don't mention Glass at all in the research paper but note that the technique could be especially of help to hearing-aid wearers in multi-speaker scenarios. It could also help in video conferencing and with the enhancement and recognition of speech in videos.
The Glass visual hearing aid is probably some way off, but Google's application of the technique to speech recognition and video captioning gives hope it will be possible.
You can test the impact of the voice separation technique using YouTube's closed-caption service on the videos they've cleaned up.
The key to their voice separating technique is using visual cues, such as the movements of a speaker's mouth to correlate the sounds they're making to identify the audio belonging to them.
"The visual signal not only improves the speech separation quality significantly in cases of mixed speech, compared to speech separation using audio alone, as we demonstrate in our paper, but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video."
To create the speech-separation model, the researchers drew on thousands of hours of talking-head video clips on YouTube to create "synthetic cocktail parties", which became the training data for the neural network.
The researchers believe the technique will have a whole range of applications and they're currently looking at where it can be incorporated it into Google products.
Previous and related coverage
Google's deep-learning algorithm could offer a simpler way to identify factors that contribute to heart disease.
Google's Vision Kit lets you build your own computer-vision system for $45, but you'll need your own Raspberry Pi.
One AI has the highest IQ of them all, but it's still low by human standards.
Google is taking a modular approach to accelerating deep-learning research.
Google's Speech-to-Text now includes improved phone call and video transcription, automatic punctuation, and recognition metadata.
Machine learning proves its worth for new video effects tech: distinguishing between faces and backgrounds at 100 frames per second.