Microsoft's new record: Speech recognition AI now transcribes as well as a human

Microsoft speech-recognition systems now match professional human transcribers but understand nothing.
Written by Liam Tung, Contributing Writer

Microsoft is applying its work in speech recognition in services such as Speech Translator, which aims to translate presentations in real time for multilingual audiences.

Image: Microsoft

A speech-recognition system developed by Microsoft researchers has achieved a word error rate on par with human transcribers.

Microsoft on Monday announced that its conversational speech-recognition system hit an error rate of 5.1 percent, matching the error rate of professional human transcribers.

Microsoft last year thought its 5.9 percent error rate had achieved human parity, but IBM researchers argued that milestone would require a system achieving a rate of 5.1 percent, slightly lower than its lowest word error rate of 5.5 percent.

IBM's study of human transcribers allowed several humans to listen to the conversation more than once, and picked the result of the best transcriber.

As with last year's test, Microsoft's system was measured against the Switchboard corpus, a dataset consisting of about 2,400 two-sided telephone conversations between strangers with US accents.

The test involves transcribing conversations between people discussing a range of topics, from sports to politics, but the conversations are more formal in nature.

Unlike last year's test Microsoft didn't test its system against another dataset called CallHome, which includes open-ended and more casual conversations between family members. CallHome error rates are more than double Switchboard tests for both humans and machines.

Still, Microsoft did manage to shave 12 percent off last year's Switchboard results after tweaking its neural-net acoustic and language models.

"We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels," said Xuedong Huang, a technical fellow at Microsoft.

"Moreover, we strengthened the recognizer's language model by using the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation."

Despite the new milestone, Microsoft acknowledges machines still find it tough to recognize different accents and speaking styles, and don't perform well in noisy conditions.

And although Microsoft was able to train its models to detect a context to transcribe a conversation more accurately, it has a way to go before it can train a computer to actually understand the meaning of a conversation.

Google earlier this year announced its systems achieved a 4.9 percent word error rate, though it's not known what test it used.

Related coverage

IBM vs Microsoft: 'Human parity' speech recognition record changes hands again

Artificial intelligence can do many things better than humans, but speech recognition isn't one of them. Yet.

Google's strides in computer vision leads to Google Lens feature

At Google I/O, CEO Sundar Pichai said that computer vision has reached an "inflection point," with error rates lower than a human's.

Microsoft's newest milestone? World's lowest error rate in speech recognition

Microsoft has leapfrogged IBM to claim a significant test result in the quest for machines to understand speech better than humans.

Read more on speech recognition

Editorial standards