Speech recognition, the unreachable frontier

Speech recognition, the unreachable frontier

Summary: Voice input has been a pipe dream since Star Trek, with crew members talking into the air and the computer understanding every spoken word. We aren't much closer to that scenario now than we were a decade ago.


The author has been actively involved with speech recognition for over ten years. This includes intensive training by IBM with its Via Voice technology, resulting in designation as a Speech Recognition Specialist.

Ubiquitous voice input is like the paperless office

When speech recognition first hit the PC years ago it was with the promise of a voice controlled world. We would be able to run our computers and dictate to create content with hands firmly off the keyboard.

Reality soon set in as it became clear that computers, and the apps that listened to us, weren't very accurate. It didn't help that the intensive speech recognition programs required heavy-duty PCs to even work.

Computing hardware improved over the following years and the programs got more accurate at interpreting what humans spoke. Even so, it wasn't accurate enough to get many to put their hands in their pockets and talk to the computer.

We now have speech recognition on smartphones, tablets, and PCs, but aside from dictating short phrases few owners are using it. Apple's introduction of Siri and her voice-centric input rekindled interest briefly in speech input. While you'd often see someone speaking to Siri in the beginning, I can't remember the last time I've seen it.

Even speech input in a totally quiet environment using a high-quality noise-cancelling headset only gets you 90-95 percent recognition accuracy.

Even with over a decade of evolution in speech recognition, it's still not accurate enough to draw users in. Companies making the technology are quick to tell us their products are 90+ percent accurate at interpreting speech, but that's still not good enough. That's an admission that 5 to 10 of every 100 spoken words will not be correctly translated to digital text.

It doesn't help that editing incorrectly interpreted speech by voice is an exercise in frustration.

I know some who dictate text messages into their phones and they are happy doing that. When I watch them, however, it's not uncommon to see them trash bad recognition and do it again. Sometimes they do it over and over again. In those instances it would have been faster to just thumb type that short message into the phone.

Anything longer than a brief message fares even worse, with errors popping up regularly. The more ambient noise the worse the interpretation that results. Even speech input in a totally quiet environment using a high-quality noise-cancelling headset only gets you 90-95 percent recognition accuracy.

So when are we going to see speech recognition good enough to become ubiquitous? Entering text by speech is easier than typing for some folks. It's not that they are avoiding it, it's that it's not very good.

I've been trying to use speech recognition for over a decade, and what I see today is only a little better than what I saw back then. The hardware is much better than it was in the early days of speech recognition, but that just gets you to the incorrectly interpreted text faster.

Talking into the phone or other device is OK for short entries like text messages, but longer than that and all bets are off. The pipe dream of years ago is still a pipe dream, and that's a shame.

Topics: Mobility, Laptops, Smartphones, Tablets

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • They're a mixed blessing

    An awful lot of call centres are starting to use voice recognition as a technique to further distance the customers from their one and only objective - to talk to a human being.

    On the other hand a bad voice recognition system is usually better than trying to converse with a human in a noisy call centre, who barely speaks English in a heavy foreign accent, at a pace of someone commentating on the last furlong of a very important horse race.
  • AI

    Well, if you've studied it for ten years, you should know by now, that language is a multi-level communication system with a lot of information noise in normal conditions that requires a lot of context information and expectations of what's to come, to actually interpret the input correctly. Most humans have a difficult time with doing this, even though they have a much better grid of distributed computing units with an intelligent learning algorithm and a huge experience database.

    So no, there won't be any speech recognition software that will be usable in the decade to come. I doubt the following decade will make more progress in this area. Because what is needed, is an actual artificial intelligence that knows you, learns from you, asks for feedback, and has an expectation of what might be said next.

    Way to go...
  • As usual, I disagree...

    Both Google Now and Siri have drawn plenty of people in for speech recognition...

    Even Dragon Dictate has drawn some people in but, reality is, most people don't see this as a need and when they see Xbox One or PS4 doing it they're like, oh that's pretty cool but, I don't really want that all the time.
  • Every time the subject comes up...

    I'm reminded of the voice mail message that told the system to shutdown... and it did.
    • Why didn't Dave think of that?

      Instead of spacewalking into the processor to pull those modules out! Oh, I forgot, in the movie HAL was actually SMART! It could even lip read!
  • Um

    I agree that for text input it's just not there (although for short questions and commands Google Now is pretty good). Part of the problem is that when you pause it doesn't necessarily mean you're finished, you may just be thinking (and nobody ever wanted to include "um" in a text). It might be useful to have specific non-words to indicate that you want to pause and that you're finished.

    As far as noisy environments, over the years I've wondered if a throat mike would help but they're too expensive for testing on a whim. I'm not quite committed enough to get Moto's throat tattoo.
  • Google Now...

    Google Now has improved dramatically lately and I have become more and more reliant on it as a tool. It understands natural language and is super accurate and even uses some context. Although it is no star trek... It is the first time when it feels close to getting there.
  • More emphasis on context

    I majored in foreign languages in college and spent a year in a PhD program in linguistics (the *science* of language) and speech recognition has been a particular interest of mine for a long time.

    I agree that language is highly context-based and without a *sentient* listener we will never get anywhere close to the Star Trek-type units.

    But both voice recognition *and* OCR suffer from the same basic problem: the developers are so focused on new features they ignore obvious and easily solved problems. I hate it when OCR can't distinguish "r n" (without a space) from "m" and therefore misspells a word. How hard is it to write a routine that when it sees a word with a non-initial m is misspelled, substitutes "r n" and if that is a real word, uses that instead? The same for putting a 1 in the middle of a word instead of a lower-case l. The same for putting "andlor" instead of "and/or" or some garbled mess with l's or 1's instead of d/b/a and a/k/a.

    One of the things needed for both VR and OCR is a "likelihood" ranking. There are professional versions of OCR and VR for attorneys and medical personnel with huge amounts of specialized vocabulary. But the average person doesn't have a common-use vocabulary of tens of thousands of words. OCR/VR should focus on getting the common words right in context.

    I majored in German in college and awhile back I took a German-language article and ran it through Google translate and I was astounded at how accurate it is (although admittedly not perfect). I read that the accuracy is because Google has added *billions* of pages to its specialized database where highly competent translators have put the text in multiple languages. (Examples are the Bible, many modern novels, and hundreds of thousands of EU government documents that have to be put into each major EU language.) When Translate sees a phrase in the source language, it looks for the same phrase in its database and a translation into the target language.

    The same could be used much more effectively for VR and OCR than has been done so far. (Admittedly, Nuance's deal with Apple awhile back offering free VR to massively increase Nuance's data sample was a start.)
  • The nightmare

    I'm still skeptical about voice recognition for the simple reason that I share an office with three other people. The idea of all four of us talking to our PCs all day fills me with dread.

    I'm also not prepared to talk to my smartphone during my commute.

    Voice recognition does have its place (in the car, or in a private office), but I'm keeping hold of my keyboard and 70wpm typing for the time being.
    Brian O'Blivion
  • Humans don't talk in soliloquies.

    Dictation is as much a learned skill for people as it is for computers. James opines that "Talking into the phone or other device is OK for short entries like text messages, but longer than that and all bets are off."

    But talking in short sentences or phrases is how most humans communicate on a daily basis. And, computers know how to translate rather well those short phrases into text and understand those words in the proper context.

    For example, in composing this comment, I wished to use the word, "soliloquies". I didn't know how to spell it and my first guess was as clueless to the computer as it would be for most humans tasked to guess what the correct spelling would be from my initial misspelled efforts.

    However, I used OS X's built in speech to text capability (technology "borrowed", so they say, from Nuance ) and "Siri" accurately gave the right spelling for that word on the first try.

    As others have pointed out, I would characterize the current computer AI technology for understanding human speech as fairly advanced compared to the technology used in IBM's ViaVoice for Windows (Release 10) era. BTW, I never could get that technology to work for me ten years ago. I suspect that is so for a number of reasons. Not the least of those reasons is that today's technology, as in the case for Siri, uses the power of mainframes and large data sets rather than relying on a user's own mobile or desktop hardware/software alone.

    I suspect what James really meant to convey was that consumer speech recognition technology isn't as advanced as IBM's Watson technology is. But as cloud computing becomes more prevalent, those days of Star Treks vision should be realized by the turn of the decade, IMO.
  • Indeed

    Given how long Dragon has been trying to crack this nut, speech recognition should be (have been!) THE killer app. Ultimately I think it will be, but we're not even close today (Why are any of us still using remote controls to control our TV's? Why can't we just tell it what to do, given the relative few commands it would require?). I do have a co-worker who swears by Dragon for creating email content and I also do see my son talking to our XBOX on occasion. However, a great example of a real world failure is the voice system the local cable company has implemented. It simply asks why you're calling today. Naturally this type of system might work if I had a billing question, but I'm typically calling about a more complex technical issue. I don't even know how to respond to the robotic system because it doesn't know how to handle even the simplest instructions.
    • Xbox 1 works well as the TV remote

      So I can use voice commands to change channels, volume, etc. Though I currently need to use the remote for the DVR, the potential is there for future updates to add that capability.
    • already there

      Various smart TVs already accept voice input. Theres a mic in the remote of the higher spec Samsung Smart TVs for eg.
  • The key is understanding, not word-for-word transcription

    I can claim deeper experience in speech recognition than the author, having founded and run a speech recognition company in the 1980s and published a paid-subscription newsletter for over 20 years (Speech Strategy News) that covers the area. The author of this piece accurately describes the barrier to using voice to create text that we intend to be read. Such dictation is a skill that takes time to develop, and as the author indicates, even a few errors can be frustrating to correct on a smartphone. Since Nuance Communications has some $400 million of its annual revenue based on medical speech-to-text, the technology itself is clearly accurate enough for motivated users on other than a mobile phone. Siri and Google's voice search are speech UNDERSTANDING systems, and this is the future of speech recognition for the general public. While they can be used to dictate a text message or even email, the key functionality is to answer a request in one step, e.g., find a restaurant near you and provide a review or play a specific song. They can often do so when there are small errors in the speech recognition if it doesn't change the core meaning of the request.
  • The FUTURE of speech recognition

    Speech recognition CAN work well if you take the time to "teach" the recognition software. However, ubiquitous use, across programs requires something that still does not exist:

    1) A uniform "base speech" standard. In other words an industry wide acceptance of "base" oral translation.
    2) A transportable, uniform standard for recording individual speech variations.

    Without these 2 items speech recognition will continue to stall. No matter how good an individual product is, today's requirement is that you must "teach" each new application.

    Without a "base" standard, and then a way to capture and transport a "variation" standard, across applications then there will be little progress. With the cloud and various web services this could be accomplished by storing the variant / speech variation standards on the web and making it available to all applications. However, this demands that the standards must be defined and adhered to. Then, and only then, will all of the effort to "teach" a device our individual speech make any sense.
  • It takes too many electrons

    I met a voice recognition engineer from IBM at a conference a few years ago who had worked on their system for many years. He reported that it takes a lot of electrons (my words, not his) to do the job that the human brain does rather effectively, that is interpret spoken language. I should note that in the recent Jeopardy series where past leading winners competed against IBM's Big Blue, the questions (answers) were fed into BB via direct textual input and not via speech-to-text. I suspect that gives you an idea of how many electrons are apparently needed...
  • The advantage of the written word...

    ...is that it is more precise than the spoken word can be, it's a lot easier to modify than the spoken word can be; and you don't have to be quite so careful about what you say (a concept familiar to dog owners and the parents of small children).
    John L. Ries
  • even with training it couldnt distinguish "test" from "text"

    I first worked with speech recognition in the 80s and like AI it remains an unrealized promise. Maybe in another 20 or 30 years but don't hold your breath. Yep it works, sorta, on a spotty basis but as a mass market thing it still needs a lot of improvements.

    Then there is the guy I worked with that I could never understand and id just tell him to explain it in an email :-) There are huge differences from person to person in the way they speak, some beg a lot easier to understand than others.
  • Observation regarding Star Trek

    Most commands given to the computer are database queries (and structured in much the same way as SQL). I do think the Star Trek scenario is more realistic than HAL from 2001, but I've long suspected that terminals still exist in the former world; we just never see them.
    John L. Ries
    • Command processing has improved

      considerably from the earlier days. Though you still need to phrase it exactly like it expects: "Xbox snap Xbox Music" for instance is the needed phrase, Xbox snap music, doesn't work.

      Sadly, there isn't a solution for multiple Xbox Ones in the same room yet. As they are both listening for "Xbox".