Microsoft releases upgrades to Azure AI Speech at Build 2024

The company announced a slew of improved features to Azure AI services that support building with generative AI.
Written by Tiernan Ray, Senior Contributing Writer
SOPA Images/Contributor/Getty

At its annual Build developer conference on Tuesday, Microsoft announced new features for its Azure AI Speech service that enhance voice-enabled, generative AI-powered app development.

Azure AI Speech is already being used for "a variety of use cases including call analytics (audio, text), medical transcription (audio, vision, text), captioning (audio/video, transcription, translation) and chatbots (audio, GPT)," Microsoft said in the release. The service has numerous capabilities to date, including converting audio into text captions for a broadcast or extracting the addresses mentioned on a phone call. 

Also: Microsoft Build is this week - here's what to expect, how to watch, and why I'm excited

One highlight of OpenAI's GPT-4o reveal last week was an improved Voice Mode, which focused on the enhanced quality of the voice given to the program's responses. Running to keep up, Microsoft announced it is making Personal Voice generally available. 

The feature lets users "create and use their own AI voices for various applications, such as voice assistants, speech translation, and video content creation," the release explained. 

Another new capability is speech analytics, now available in preview. Accessible within Azure AI Studio, Adobe's development environment, it is supposed to address what the company calls the "soft" analysis of phone calls or other audio sources. A soft element of a call could be semantic content, or how the caller seems to feel, which is presumably subtler than the content of the call itself.

Sentiment analysis could detect details like the "degree of empathy shown, commitment of the participants and strength of the arguments made or even predict possible conversation flows," the release explains. 

In a transcript of a call, for example, sections could be labelled with a rating of each speaker's phrase as "positive," "negative" or "neutral." You can check out an interactive demo here

To make quick analysis possible, Microsoft is also rolling out Fast Transcription, which the company claims is "a game changer for transcription at large" because "it can now transcribe 40x faster than real-time (real-time factor<1)." 

According to the company, Fast Transcription can save call center agents "thousands of hours" by eliminating the need to manually take notes on a call, and doctors and nurses can analyze conversations with patients in seconds. "Media and content creators can analyze and extract insights from podcasts or interviews as soon as they complete," the release continued.

Microsoft said the feature will be made available next month. 


Example of a post-call analysis with a customer. 


To meet the need for disseminating content globally, Microsoft also teased automatic video dubbing, which translates content, synthesizes a voice in the target language, and syncs it to the video of the speaker. 

Additionally, the company announced updates to its multi-lingual translation feature, such as the ability to switch languages for captioning while a person is watching a broadcast.

Editorial standards