X

Innovation

Home Innovation Artificial Intelligence

Google AI can pick out a single speaker in a crowd: Expect to see it in tons of products

Google reveals its speech-separating tech, which could, for example, transform hearing aids.

Written by Liam Tung, Contributing Writer April 13, 2018 at 6:09 a.m. PT

Video: Google and MIT's AI can fix your phone snaps in real time

AR + VR

I replaced my boring workouts with Meta Quest's Supernatural app, and can't imagine going back
This Finnish startup's new VR headset rivals Apple's Vision Pro - and business users will love it
Meta's $500 Quest 3 is the mainstream VR headset I've been waiting for, and it delivers
I tried Apple Vision Pro and it's far ahead of where I expected
The best VR headsets right now (and they're not just from Meta)

Google researchers have developed a deep-learning audio-visual model that can isolate one speaker's voice in a cacophony of noise.

The 'cocktail party effect' -- the ability to mute all voices in a crowd and focus on a single person's voice -- comes easily to humans but not machines.

It's an obstacle to an application of the Google Glass smart glasses that I personally would like to see developed one day. That is, as a real-time speech-recognition and live-transcription system to support hearing-aid wearers.

Hearing aids help my wife hear better, but often with so much static and crackling she's mostly lip-reading. At the age of 15, she could hear perfectly. Now her hearing is getting worse each year and will almost certainly disappear at some point in future. Only stem-cell magic could reverse the situation.

I thought my Glass idea was a great fallback until I wondered how it would pick the right voice out of a crowd -- the scenario she finds it hardest to hear in -- to live-transcribe the target speaker.

Apparently voice separation is a hard nut to crack, but Google's AI researchers may have a part of the answer to my Glass dream in the form of a deep-learning audio-visual model that can isolate speech from a mixture of sounds.

The scenario they present are two speakers standing side-by-side jabbering simultaneously. The technique hasn't been proven in a real-world crowd but it does work on a video with two speakers on a single audio track.

Video: Google's research combines the auditory and visual signals to separate speakers. Source: Google/YouTube

They also used the technique to erase the background noise of single person speaking in a noisy cafeteria, which, Glass idea aside, could produce a much clearer sound for hearing-aid wearers.

"All that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context," write Inbar Mosseri and Oran Lang of Google Research.

Related: IT leader's guide to deep learning

The researchers don't mention Glass at all in the research paper but note that the technique could be especially of help to hearing-aid wearers in multi-speaker scenarios. It could also help in video conferencing and with the enhancement and recognition of speech in videos.

The Glass visual hearing aid is probably some way off, but Google's application of the technique to speech recognition and video captioning gives hope it will be possible.

You can test the impact of the voice separation technique using YouTube's closed-caption service on the videos they've cleaned up.

The key to their voice separating technique is using visual cues, such as the movements of a speaker's mouth to correlate the sounds they're making to identify the audio belonging to them.

"The visual signal not only improves the speech separation quality significantly in cases of mixed speech, compared to speech separation using audio alone, as we demonstrate in our paper, but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video."

To create the speech-separation model, the researchers drew on thousands of hours of talking-head video clips on YouTube to create "synthetic cocktail parties", which became the training data for the neural network.

The researchers believe the technique will have a whole range of applications and they're currently looking at where it can be incorporated it into Google products.

Google AI's decomposes the input audio track into clean speech tracks, one for each person detected in the video.
Image: Google

Previous and related coverage

Google AI can predict your heart disease risk from eye scans

Google's deep-learning algorithm could offer a simpler way to identify factors that contribute to heart disease.

Google offers Raspberry Pi owners this new AI vision kit to spot cats, people, emotions

Google's Vision Kit lets you build your own computer-vision system for $45, but you'll need your own Raspberry Pi.

Google AI vs Siri vs Bing: IQ tests show one is smartest by a mile

One AI has the highest IQ of them all, but it's still low by human standards.

'One machine learning model to rule them all': Google open-sources tools for simpler AI

Google is taking a modular approach to accelerating deep-learning research.

How Google is turning its Cloud Speech-to-Text AI into a real business tool(TechRepublic)

Google's Speech-to-Text now includes improved phone call and video transcription, automatic punctuation, and recognition metadata.

Google AI now can give YouTube videos a wacky background(CNET)

Machine learning proves its worth for new video effects tech: distinguishing between faces and backgrounds at 100 frames per second.

Editorial standards

Show Comments

Related

sonosdots-gettyimages-976255650

How Apple can rescue miserable Sonos users

qcom-panel-1

Apple's iOS 18 beta and Amazon's AI assistant top the Innovation Index

Colorful apples illustration

Apple accelerates AI efforts: Here's what its new models can do