Microsoft researchers have developed a system that recognizes speech as accurately as a professional human transcriptionist.
Researchers and engineers from Microsoft's Artificial Intelligence and Research group have set a new record in speech recognition, achieving a word error rate of 5.9 percent, down from the 6.3 percent reported a month ago.
The word error rate is the percentage of times in a conversation that a system, in this case a combination of neural networks, mishears different words. Microsoft's system performed as well as humans who were asked to listen to the same conversations.
Microsoft sized its machines up against professional transcribers who were tasked with listening to the same evaluation data over the phone, which included two-way conversation data and a separate set where friends and family have open-ended conversations.
Humans and Microsoft's automated systems scored 5.9 percent and 11.3 percent error rates, for the respective test data.
The scores are an umbrella figure for the results of three tests, comparing how many times Microsoft's system and the human transcribers wrongly substituted sounds, dropped a word from a sentence, and or inserted the wrong word.
As Microsoft notes in the paper, humans and the automated system mostly fumbled over the same sounds in the tests, with the exception of "uh-huh" and "uh".
Microsoft's system was confused by the sounds "uh-huh", which can be a verbal nod for someone to go ahead speaking, and "uh", used as a hesitation in speech. The utterances sound the same but have opposite meanings, which humans had far fewer problems identifying than Microsoft.
The transcriptionists, for some reason, frequently dropped the letter 'I' from two-way conversations, and did so far more often than Microsoft's AI.
Overall, Microsoft notes, humans had a lower substitution rate, and higher deletion rate, while both humans and machine produced a low number of insertions.
"The relatively higher deletion rate might reflect a human bias to avoid outputting uncertain information, or the productivity demands on a professional transcriber," Microsoft speculates.
Still, to achieve parity with a human in this test was an "historic achievement", said Xuedong Huang, Microsoft's chief speech scientist.
Improved automated speech-recognition systems could be used in speech-to-text transcription services and enhance Cortana's accessibility features, say, for deaf people. However, that prospect still appears to be some way off.
Microsoft used 2,000 hours of training data to equip its neural networks for the task. It claims that by parallelizing the data with its AI Computational Network Toolkit on a Linux-based multi-GPU server farm, it was able to cut down training times from months to under three weeks.
Despite the milestone, Microsoft admits it's still a long way from achieving speech recognition that works well in real-life settings with lots of background noise.
For example, as a live transcription service it's not yet possible to identify and assign names to multiple speakers who may have different accents, ages, and backgrounds. However, the company says it's working on the technology, which could open up a whole set of possibilities.