Apple's Siri is sassy, clever and occasionally useful.
But how the hell does it really work?
"Voice recognition" is what Siri does, but those words alone don't reveal how the system actually gets your words right when you say, "Send message to Jason Perlow: Go get a shave, Linux Lover."
But a lengthy feature article over at our sister site SmartPlanet has the dirt, step by step:
The sounds of your speech were immediately encoded into a compact digital form that preserves its information.
The signal from your connected phone was relayed wirelessly through a nearby cell tower and through a series of land lines back to your Internet Service Provider where it then communicated with a server in the cloud, loaded with a series of models honed to comprehend language.
Simultaneously, your speech was evaluated locally, on your device. A recognizer installed on your phone communicates with that server in the cloud to gauge whether the command can be best handled locally -- such as if you had asked it to play a song on your phone -- or if it must connect to the network for further assistance. (If the local recognizer deems its model sufficient to process your speech, it tells the server in the cloud that it is no longer needed: "Thanks very much, we're OK here.")
The server compares your speech against a statistical model to estimate, based on the sounds you spoke and the order in which you spoke them, what letters might constitute it. (At the same time, the local recognizer compares your speech to an abridged version of that statistical model.) For both, the highest-probability estimates get the go-ahead.
Based on these opinions, your speech -- now understood as a series of vowels and consonants -- is then run through a language model, which estimates the words that your speech is comprised of. Given a sufficient level of confidence, the computer then creates a candidate list of interpretations for what the sequence of words in your speech might mean.
If there is enough confidence in this result, and there is -- the computer determines that your intent is to send an SMS, Erica Olssen is your addressee (and therefore her contact information should be pulled from your phone's contact list) and the rest is your actual note to her -- your text message magically appears on screen, no hands necessary. If your speech is too ambiguous at any point during the process, the computers will defer to you, the user: did you mean Erica Olssen, or Erica Schmidt?
There's a whole lot more to learn in the article, including a history of research around the technology and exploration into what Google, Microsoft and others want to do with it. (What are you waiting for? Go read it.)
Voice recognition has been around in some form for years, but it's pretty neat to see exactly what happens when you press that button.