Always on: Google AI gives Android voice recognition that works on- or offline

Google research suggests soon you won't need a network connection at all to use a smartphone for dictation and voice commands.


Google researchers have created a lightweight but accurate embedded speech-recognition system that runs locally on a Nexus 5.

Image: LG/Google

Google has developed a speech-recognition system that's small enough to run "faster than real time" on a Nexus 5 without an internet connection.

The new system, which doesn't require computation at a remote datacenter, could get around the obstacle of needing a reliable network connection to use speech recognition on a smartphone, smartwatch or any other memory-constrained gadget.

The objective, outlined in new paper by a team of Google researchers, was to create a lightweight but accurate embedded speech-recognition system that runs locally.

By lightweight, they mean a 20.3MB footprint system that, when tested on a Nexus 5 with a 2.26GHz CPU and 2GB RAM, achieved a 13.5 percent word error rate on an open-ended dictation task.

Of course, as with a lot of Google's research these days, the system is underpinned by machine-learning techniques, which in this case was "long short-term memory (LSTM) recurrent neural network (RNNs), trained with connectionist temporal classification (CTC) and state-level minimum Bayes risk (sMBR) techniques".

To scrimp on system requirements, the researchers developed a single model for the two very different domains of dictation and voice commands. Using a variety of techniques, they compressed an acoustic model to a tenth of its original size.

As the researchers note, offline embedded speech-recognition systems can already handle a command such as, "Send an email to Darnica Cumberland: can we reschedule?" simply by transcribing immediately and executing later so that users don't notice. But accurate transcription requires integrating personal information, such as the contact's name.

The researchers' answer to this problem was to integrate the device's contacts list into the model.

To train its acoustic model, the researchers extracted three million utterances, amounting to 2,000 hours, from Google voice search traffic. To make the model sturdier, it also introduced noise samples from YouTube videos. The original acoustic model they developed was about 80MB in size.

More on Google