After Google added 20 new languages to its Translate app yesterday, the company has detailed how it manages to squeeze deep learning onto a smartphone.
Google Translate users can now point their smartphones to text in 27 languages and have it live translate on their display even without an internet connection. The technology came to Translate through Word Lens, a company Google acquired last year and integrated with its Translate app earlier this year.
Google started out with English, French, German, Italian, Portuguese, Russian and Spanish. Thanks to this week's update, users can live translate to and from English and Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Filipino, Finnish, Hungarian, Indonesian, Lithuanian, Norwegian, Polish, Romanian, Slovak, Swedish, Turkish and Ukrainian.
After installing the update, users will need to download an extra file for each language - a 4.8MB file for the English to Swedish package, for example.
Otavio Good, a software engineer at Google Translate, explained in a blog post what the update actually brings to iOS and Android users is a pocket sized deep neural network.
Doing the type of live visual translation found in Translate would be easy in a datacenter but to bring the same capability to a low-end smartphone with a poor network connection required some engineering smarts from Google. What it came up with was a mini version of the neural net that it uses to do translation at its datacenters, but one that could also handle real-world smartphone conditions, such as a shaky hand and no connection to the cloud.
According to Good, the live translation has a few steps, starting with distinguishing words from background objects. To pick out text, it looks at "blobs of pixels" with a similar color to other similar blobs nearby. The next step is to translate each letter, which Good notes, is where the deep learning comes in.
"We use a convolutional neural network, training it on letters and non-letters so it can learn what different letters look like," he said.
Once Translate has recognized the letters, it runs an approximate dictionary lookup. "That way, if we read an 'S' as a '5', we'll still be able to find the word '5uper'," said Good.
The last step is rendering the translation over the original words in the same style as the original.
Finally, Good explains how the Translate team crammed all this into a pocket sized neural network:
"We needed to develop a very small neural net, and put severe limits on how much we tried to teach it - in essence, put an upper bound on the density of information it handles. The challenge here was in creating the most effective training data. Since we're generating our own training data, we put a lot of effort into including just the right data and nothing more.
"For instance, we want to be able to recognize a letter with a small amount of rotation, but not too much. If we overdo the rotation, the neural network will use too much of its information density on unimportant things. So we put effort into making tools that would give us a fast iteration time and good visualizations. Inside of a few minutes, we can change the algorithms for generating training data, generate it, retrain, and visualize. From there we can look at what kind of letters are failing and why. At one point, we were warping our training data too much, and '$' started to be recognized as 'S'. We were able to quickly identify that and adjust the warping parameters to fix the problem. It was like trying to paint a picture of letters that you'd see in real life with all their imperfections painted just perfectly.
"To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor's SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory."