Microsoft's translation breakthrough: Speak, and hear your voice in Chinese

The company has demonstrated a translation technique in which an English speaker's words get played back seconds later in Mandarin, as if they were speaking that language themselves. It aims to 'completely break down language barriers' within a few years.
Written by David Meyer, Contributor

Microsoft has shown off a technology that translates someone's speech into another language, with the results being played back in the speaker's own voice.

The company's chief research officer, Rick Rashid, said on Thursday that Microsoft hopes to have "systems that can completely break down language barriers" within the next few years. In a video demonstration, Rashid spoke in English and was then echoed, in his own voice, by a Mandarin Chinese translation.

Microsoft has been working on the core speech-recognition technology, which it calls Deep Neural Net (DNN) translation, for the last couple of years, and it already offers it as a commercial service called inCus. However, as Rashid explained in a blog post on Thursday, the company has now taken the system a step further.

Rashid wrote the post, he said, due to interest in a speech he gave a fortnight ago at Microsoft Research Asia's 21st Century Computing event. In that speech, Rashid's words were translated on-the-fly into Mandarin, with the translated text being spoken back in a simulation of his own voice.

"The first [step] takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part," Rashid wrote. "The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages. Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful."

For the final, text-to-speech leg of the translation process, Microsoft had to record a few hours of a native Chinese speaker's speech, and around an hour of Rashid's own voice.

Better than the competition?

Speech recognition and machine translation are fairly common technologies these days. Google uses such techniques in Google Now and its Translate apps, Apple has Siri and Microsoft itself has Kinect.

"While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979" — Rick Rashid, Microsoft

However, these systems, which are based on a statistical technique known as Hidden Markov Modeling, tend to have an error rate of between 20-25 percent. According to Rashid, the DNN technique reduces that error rate by around 30 percent.

"This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight," he wrote. "While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979, and as we add more data to the training we believe that we will get even better results."

"The results are still not perfect, and there is still much work to be done, but the technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers," Rashid added.

Editorial standards