Google says improvements to its voice search make it more accurate, even in noisy environments.
Google has detailed a number of improvements to the acoustic models its app on iOS and Android rely on to recognise spoken words more efficiently - for example, when users ask for directions to the nearest restaurant or put questions that can be searched on the web.
With the improvements, Google can recognise with greater accuracy what's being said, even in noisy surroundings, while requiring lower computational resources to analyse sounds in real-time, according to the company's speech team.
According to the researchers, the type of "recurrent neural network" (RNN) Google is using for its acoustic models can memorise information better than deep neural networks and model "temporal dependencies". For example, the word 'museum' in phonetic notation would be translated to /m j u z i @ m/.
"When the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. RNNs can capture that," they note.
To reduce computations, Google has also trained the models to take in audio in larger chunks while improving recognition in noisy places by adding artificial noise to the training data.
The researchers said that to create the additional improvements, the speech team had to tweak the models to find an optimal balance between improved predictions and latency:
"The tricky part though was how to make this happen in real time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often," they said.
"With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here."
Those improvements gave Google a faster and more accurate acoustic model that could be used on real voice traffic.
"However, we had to solve another problem. The model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal," the researchers said.
"This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech."