Speech recognition, the unreachable frontier

Voice input has been a pipe dream since Star Trek, with crew members talking into the air and the computer understanding every spoken word. We aren't much closer to that scenario now than we were a decade ago.


The author has been actively involved with speech recognition for over ten years. This includes intensive training by IBM with its Via Voice technology, resulting in designation as a Speech Recognition Specialist.

Ubiquitous voice input is like the paperless office

When speech recognition first hit the PC years ago it was with the promise of a voice controlled world. We would be able to run our computers and dictate to create content with hands firmly off the keyboard.

Reality soon set in as it became clear that computers, and the apps that listened to us, weren't very accurate. It didn't help that the intensive speech recognition programs required heavy-duty PCs to even work.

Computing hardware improved over the following years and the programs got more accurate at interpreting what humans spoke. Even so, it wasn't accurate enough to get many to put their hands in their pockets and talk to the computer.

We now have speech recognition on smartphones, tablets, and PCs, but aside from dictating short phrases few owners are using it. Apple's introduction of Siri and her voice-centric input rekindled interest briefly in speech input. While you'd often see someone speaking to Siri in the beginning, I can't remember the last time I've seen it.

Even speech input in a totally quiet environment using a high-quality noise-cancelling headset only gets you 90-95 percent recognition accuracy.

Even with over a decade of evolution in speech recognition, it's still not accurate enough to draw users in. Companies making the technology are quick to tell us their products are 90+ percent accurate at interpreting speech, but that's still not good enough. That's an admission that 5 to 10 of every 100 spoken words will not be correctly translated to digital text.

It doesn't help that editing incorrectly interpreted speech by voice is an exercise in frustration.

I know some who dictate text messages into their phones and they are happy doing that. When I watch them, however, it's not uncommon to see them trash bad recognition and do it again. Sometimes they do it over and over again. In those instances it would have been faster to just thumb type that short message into the phone.

Anything longer than a brief message fares even worse, with errors popping up regularly. The more ambient noise the worse the interpretation that results. Even speech input in a totally quiet environment using a high-quality noise-cancelling headset only gets you 90-95 percent recognition accuracy.

So when are we going to see speech recognition good enough to become ubiquitous? Entering text by speech is easier than typing for some folks. It's not that they are avoiding it, it's that it's not very good.

I've been trying to use speech recognition for over a decade, and what I see today is only a little better than what I saw back then. The hardware is much better than it was in the early days of speech recognition, but that just gets you to the incorrectly interpreted text faster.

Talking into the phone or other device is OK for short entries like text messages, but longer than that and all bets are off. The pipe dream of years ago is still a pipe dream, and that's a shame.