Microsoft prototypes speaking multiple languages with a single human voice

At TechFest 2012, Microsoft demoed a project from Microsoft Research Asia, turning a monolingual speaker into a multilingual voice output using machine based Text To Speech (TTS) synthesis.

Truly bilingual speakers are an asset to the business.  Unfortunately, not too many speakers are truly bilingual.  Microsoft Translator aims to deliver synthesised multilingual communications from a monolingual human voice.

At TechFest 2012 earlier this year, Microsoft demoed a project from Microsoft Research Asia, which aims to turn a monolingual speaker into a multilingual voice output using machine based Text To Speech (TTS) synthesis.

Microsoft describes how this 'monolingual into multi-lingual' method works:

Out of a speaker’s monolingual recordings, our algorithm can render speech sentences of different languages for building mixed-coded, bilingual TTS systems. We have recordings of 26 languages which are used to build our TTS of corresponding languages. By using the new approach, we can synthesize any mixed language pair out of the 26 languages.

Frank Soong, Principal Researcher at Microsoft Research Asia demonstrated the concept of TTS synthesis.

He used the example of a TTS for an American driving a car in Beijing.  The TTS understands English, but the TTS is trained in Mandarin using the same English speech data. The key directions are in English, but the landmarks and street names are in Chinese.

Seamless translation

It is difficult to find one speaker that is good enough in both languages to be able to train the TTS to use one voice. The aim is to get a seamless transition between the human and the synthesised voices.   You might be able to find a truly fluent speaker who could train the TTS, but it might be difficult across all languages.

The translator uses a reference speaker, in this case, a Chinese speaker to get the frequency, tone and modulation of the voice.  The voice is then 'warped' or equalised between the reference Chinese human speaker and the English multilingual 'machine' voice.

The English language database is then broken down into pieces (five milliseconds per piece).  All of the voice pieces which are closest to the trajectory of the 'warped' Chinese sentence are used. The best concatenations of sequences are then calculated and reassembled.

One voice

You can train the TTS to recognise your voice.  It takes about one hour. In the Techfest 2012 video, Rick Rachid, Microsoft's Chief Research Officer trained the TTS to recognise his English voice, and after calculations to reference his voice, the playback of his English phrases is replayed back in Spanish.

At about 19 minutes 30 seconds on the video there is a cool talking head of Rachid's boss, Craig Mundie speaking in English, then Mandarin using his own voice with the same timbre and intonation.

Although this is still a prototype, this machine based 'Babel Fish' brings great opportunities for businesses that only speak one language. Businesses could have the opportunity to break into new markets around the globe without the overhead of human based translation services.

Train the synthesiser once and reproduce your training video in any one of 26 languages. Apply it to your audio communications for multi lingual reach.

And even if it is only applied to car navigation systems, then it is a step in the right direction -- whatever language the directions happen to be in...

Related content: