Microsoft prototypes speaking multiple languages with a single human voice

Microsoft prototypes speaking multiple languages with a single human voice

Summary: At TechFest 2012, Microsoft demoed a project from Microsoft Research Asia, turning a monolingual speaker into a multilingual voice output using machine based Text To Speech (TTS) synthesis.

TOPICS: Microsoft, Telcos

Truly bilingual speakers are an asset to the business.  Unfortunately, not too many speakers are truly bilingual.  Microsoft Translator aims to deliver synthesised multilingual communications from a monolingual human voice.

At TechFest 2012 earlier this year, Microsoft demoed a project from Microsoft Research Asia, which aims to turn a monolingual speaker into a multilingual voice output using machine based Text To Speech (TTS) synthesis.

Microsoft describes how this 'monolingual into multi-lingual' method works:

Out of a speaker’s monolingual recordings, our algorithm can render speech sentences of different languages for building mixed-coded, bilingual TTS systems. We have recordings of 26 languages which are used to build our TTS of corresponding languages. By using the new approach, we can synthesize any mixed language pair out of the 26 languages.

Frank Soong, Principal Researcher at Microsoft Research Asia demonstrated the concept of TTS synthesis.

He used the example of a TTS for an American driving a car in Beijing.  The TTS understands English, but the TTS is trained in Mandarin using the same English speech data. The key directions are in English, but the landmarks and street names are in Chinese.

Seamless translation

It is difficult to find one speaker that is good enough in both languages to be able to train the TTS to use one voice. The aim is to get a seamless transition between the human and the synthesised voices.   You might be able to find a truly fluent speaker who could train the TTS, but it might be difficult across all languages.

The translator uses a reference speaker, in this case, a Chinese speaker to get the frequency, tone and modulation of the voice.  The voice is then 'warped' or equalised between the reference Chinese human speaker and the English multilingual 'machine' voice.

The English language database is then broken down into pieces (five milliseconds per piece).  All of the voice pieces which are closest to the trajectory of the 'warped' Chinese sentence are used. The best concatenations of sequences are then calculated and reassembled.

One voice

You can train the TTS to recognise your voice.  It takes about one hour. In the Techfest 2012 video, Rick Rachid, Microsoft's Chief Research Officer trained the TTS to recognise his English voice, and after calculations to reference his voice, the playback of his English phrases is replayed back in Spanish.

At about 19 minutes 30 seconds on the video there is a cool talking head of Rachid's boss, Craig Mundie speaking in English, then Mandarin using his own voice with the same timbre and intonation.

Although this is still a prototype, this machine based 'Babel Fish' brings great opportunities for businesses that only speak one language. Businesses could have the opportunity to break into new markets around the globe without the overhead of human based translation services.

Train the synthesiser once and reproduce your training video in any one of 26 languages. Apply it to your audio communications for multi lingual reach.

And even if it is only applied to car navigation systems, then it is a step in the right direction -- whatever language the directions happen to be in...

Related content:

Topics: Microsoft, Telcos

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Great, let me know when it ships.

    Otherwise, it's just MS Courier Voice.
    • Actually, sounds like very other "text-to-voice" project

      While the world move into "voice recognition", MS is still stuck in the 90s.
  • Best Speech Rocognition App

    To learn german language I am using Speechtrans and I think speechtrans aquires more advance technology then what microsoft is offering. Speechtrans is most accurate app with most organic output voices, works on all versions of the iPhone, 3rd Generation iPod Touch, iPads and Android devices. The app helps me alot while travelling because Speechtrans supports total 28 languages and also upgrades new languages to existing users for free.
    The InterprePhone service is the latest innovation and lets the users communicate, without the need of an interpreter, via a telephone conference call.
    SpeechTrans??? apps can be used as your personal portable interpreter Facebook chat service integration allows users to communicate in different languages with outstanding clarity and minimal translation processing delay. Learn more at