Microsoft's translation breakthrough: Speak, and hear your voice in Chinese

Microsoft's translation breakthrough: Speak, and hear your voice in Chinese

Summary: The company has demonstrated a translation technique in which an English speaker's words get played back seconds later in Mandarin, as if they were speaking that language themselves. It aims to 'completely break down language barriers' within a few years.


Microsoft has shown off a technology that translates someone's speech into another language, with the results being played back in the speaker's own voice.

The company's chief research officer, Rick Rashid, said on Thursday that Microsoft hopes to have "systems that can completely break down language barriers" within the next few years. In a video demonstration, Rashid spoke in English and was then echoed, in his own voice, by a Mandarin Chinese translation.

Microsoft has been working on the core speech-recognition technology, which it calls Deep Neural Net (DNN) translation, for the last couple of years, and it already offers it as a commercial service called inCus. However, as Rashid explained in a blog post on Thursday, the company has now taken the system a step further.

Rashid wrote the post, he said, due to interest in a speech he gave a fortnight ago at Microsoft Research Asia's 21st Century Computing event. In that speech, Rashid's words were translated on-the-fly into Mandarin, with the translated text being spoken back in a simulation of his own voice.

"The first [step] takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part," Rashid wrote. "The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages. Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful."

For the final, text-to-speech leg of the translation process, Microsoft had to record a few hours of a native Chinese speaker's speech, and around an hour of Rashid's own voice.

Better than the competition?

Speech recognition and machine translation are fairly common technologies these days. Google uses such techniques in Google Now and its Translate apps, Apple has Siri and Microsoft itself has Kinect.

"While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979" — Rick Rashid, Microsoft

However, these systems, which are based on a statistical technique known as Hidden Markov Modeling, tend to have an error rate of between 20-25 percent. According to Rashid, the DNN technique reduces that error rate by around 30 percent.

"This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight," he wrote. "While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979, and as we add more data to the training we believe that we will get even better results."

"The results are still not perfect, and there is still much work to be done, but the technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers," Rashid added.

Topics: Emerging Tech, Microsoft

David Meyer

About David Meyer

David Meyer is a freelance technology journalist. He fell into journalism when he realised his musical career wouldn't pay the bills. David's main focus is on communications, as well as internet technologies, regulation and mobile devices.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Wow!

    Sci-fi has become reality (e.g. Babel Fish) . This will go down in history as one of the most revolutionary and important advances in technology.

    This is nothing less than the destruction of one of the greatest barriers between human beings on this planet.

    Now, we all speak the same language.
    Tim Acheson
    • still some problems

      tell that to the apple vs android, gm vs ford, republicans vs democrats.

      This however is quite cool and will make the world a smaller place yet again. Unfortunately none of us have data plans to allow us to hold conversations when we're traveling and would need it.
      • Factions have always been with us

        But it helps to be "politically bilingual" so you can at least talk people on the other side without being misunderstood.
        John L. Ries
    • Agreed!

      Excellent job by Microsoft - it's nowhere near ready for distribution in the wild but Microsoft has made excellent progress.
    • Nothing really exciting.

      Visit any decent IT department on any University and you will find someone interested and doing research in speech recognition (or even sign language recognition).

      "Replay" is also not so hard. In fact we have very decent speech synthesizers out there.

      What is really, really, really hard is TRANSLATION. Because it mean understanding one implementation of abstraction (word we hear is just representation of its meaning) with Many-to-Many relation (one "sound" can represent many words, and one word can represent many meanings), with contex as integral part of this abstract.
      Than you find implementation of that abstract in another language, which have vastly different rules for it, you eliminate solutions that would destroy context. Than you pick that singe word from many that can represent your abstract.

      Mentally healthy adults can heave problems with it. And humans are way, way, way smarter, and more flexible than computers.

      And human are capable of creativity and inventiveness and of learning any new language.

      In simple words, hearing and speaking is "relatively" easy for computer. But understanding what it hear or formulation what to say is almost impossible in open ended situation.
  • Skype?

    No doubt this will be an enhancement to Skype in the future.
    • That does seem to be the next logical step.

      I might even pay MS for the privilege.
      John L. Ries
    • nope

      I have done that with Google already, so I don't know why you are waiting Microsoft to do same?

      Oh yeah, to hear a bad imitation of your own voice what not even your mother recognize, because it is cool?

      That MS "innovation" is years old among computer studiers but seems to be news for popular people who believe Microsoft being there now first.
      • Not every body need to

        use translation software to speak with its mother in the same language.

        In fact you are better off learning that language ASAP.
    • Even one error per eight words is not acceptable for wide use

      If you are going to talk with a Chinese person not for a minute, but for twenty minutes, he/she will be annoyed as hell with the level of nonsense that would accumulated for such period of time.

      It will take maybe another ten-to-twenty years before there will be next breakthrough in quality of translations.
  • missing something?

    Hate to burst the bubble but the only reason this is different from Google translate is that the words that come out the other end have had their pitch changed to match yours.

    That isn't going to break down language barriers any faster Microsoft.

    So essentially the groundbreaking news here is that they have a product, that, if it gets released, and if you talk to it for hours first, may have 6% less errors than Google Translate, which has been out for years.

    Why don't I have the urge to kiss a programmer right now?
    • Why don't you have the urge...?

      Maybe it's because you're not married to or otherwise romantically involved with one.
      John L. Ries
    • Microsoft has been working in this area

      considerably longer than Google has. And the process from speaking English to hearing Mandarin was what was done. Sure various parts of this have been done before, but this shows lots of overall improvement.
    • translation is not the important part

      The last step: having speech sound like natural speech is the awesome part. The fact that it can't translate language perfectly is trivial. I am more concerned of having PLAIN English text converted into a real voice. Imagine sending a text message and the receiver hears you reading it in your voice! It could go further than that if you can customize the the text-to-speech engine on your computer and you have HAL talking back to you instead of a fake voice like SIRI.
      • HAL and Siri ...

        ... both speak with about the same fluency and fluidity. I'd obviously give the nod to HAL, since he's - well - fake. Someone just read those lines and Siri is actually simulated human speech.

        But I agree that natural speech is a huge breakthrough here: imagine the Star Trek Universal Translator, through which anyone can talk to anyone else and have a natural conversation in their own native tongue and dialect. It could completely remove languages as a barrier.

        But an accurate translation is still vitally important: If I say "I want an apple" and the translator says "I want to kill you", that might cause some communication issues.
      • nope

        Using a Siri as example shows you just either are trolling or then ignorant.
        There are much better ones available and have been for years.
        Example one of the popular one

        And then Google has awesome voice qualities for other than English language (what is very strange).
        • Why?

          Why bother to comment on this article since this an article about Microsoft and you are such an obvious Google zealot? This shows you are trolling.
          • nope

            Why you are so ignorant with so big need to have a close minded vision?

            When someone mentions Google, you attack like crazy againts them calling them as zealots etc.

            The claim is: Microsoft did something amazingclaim
            Counter argument is that those are already done by others than Microsoft if even little trying to follow speech technology.

            One argument was that Microsoft has better voice than Siri is.
            Counter argument is that Siri doesn't have best English voice but others, with example of one of much better one but still not the best of them.
            And mention that Google has awesome voice quality for other languages than English, for languages what are much harder to speak and use than English becuse Google get for them already developed technologies from universities because open source.

            Taking a voice sample from someone voice isn't hard or amazing.
            Then applying that filter for normal text-to-speech technology isn't amazing or breaktrought at all.
            And neither is having a real-time translation to another language, what is simply stupid because you can not translate whole sentences word by word because meaning does not get translated and cause lots of basic mistakes what even 10 year old kids do when they start learning their second or third language.

            It is like Microsoft fans like you have not even been thinking that simply changing a order of words in sentence can cause totally different meaning in other languages and to translate correctly, you need to change order from original one. So machine translating real-time when speaking English to Chinese is actually impossible and science fiction, unless machine can read speaker mind and change words to correct ones and change order correctly before speaker even starts speaking.

            When a computer manages to work like you think Microsoft have, then it has solved one of the hardest parts in translation what is that you can not translate jokes so easily.

            Like I bet you don't understand a joke when it is literal translation to English: a man walked to a bar a carrot.
            For Chinese it is totally different joke and impossible to translate in real-time, even if computer would do it after sentence. And if computer would do it by reading it from list of jokes and telling similar one, it would be even more problematic.

            And when you need to translate English to languages what are totally flexible, it becomes just funny.

            But you already knew that you are Microsoft zealot and hate blindly Google from anything what others say....
  • it is always a couple of years away

    if this works as advertised, it is a huge win for MS. They desperately need a "must have" new product. Win 8, Surface, the new Win Phones are decent products, but they need something that will jump them ahead of the the market leaders. Right now MS is still in "catch up mode" on too many fronts. (when your goal is to be the number 3 OS in the mobile space, you are admitting you don't have a compelling product)
  • Obviously

    This is still in development and nowhere near ready for useage in the wild but it is progress. Kudos to Microsoft.