Alexa: The good, the bad, and the creepy
It was, of course, inevitable. "Alexa, this. Alexa, that." After a while, we all just knew Alexa was going to ... what was it that guy on Twitter said? "There's a good chance I get murdered tonight."
Admit it. If you're one of the millions of Alexa owners, you've noticed some odd behaviors. If you're like my wife and me, you've probably, maybe even more than once, wondered just how long it would be before our AI overlords rise up and put us down.
TechRepublic: Amazon Alexa: The smart person's guide
Let's cover the back story pretty fast, since it's been written about elsewhere. Alexa has been known to suddenly exhibit weird behaviors. In January, I wrote about how Alexa suddenly started to speak without being woken up by a wake word.
A few weeks ago, tech columnist Farhad Manjoo wrote in the New York Times about how his Alexa startled him in bed one night by screaming. All across the Internet this week, we've been hearing stories about Alexas breaking out with unbidden, evil-sounding laughter.
Relax. Your Alexa isn't haunted (probably). Your Alexa isn't going to murder you in your sleep (it doesn't have hands or feet). And, your Alexa isn't going insane. Well...
Actually, by definition, that last may not, technically, be entirely true. According to The Google, one definition of insanity involves being "in a state of mind that prevents normal perception, behavior, or social interaction."
When it comes to Alexa, everything is about perception and behavior.
Alexa is triggered by what's called a "wake word." It will respond to a wake word of Alexa, Echo, Amazon, or Computer, depending on which you choose in your Amazon Echo preferences. Other voice assistants also use wake words. Siri uses "Hey, Siri." Google Home uses "Okay, Google." Windows 10 responds to "Hey, Cortana," and soon, just "Cortana," named after the AI assistant to Master Chief from the Halo video game series.
For now, we'll just talk about the wake word "Alexa," and the Alexa devices. But what I'm about to discuss applies to any active listening voice recognition system.
Alexa, and the other AI voice systems, have overcome (at least mostly) a huge technical challenge. How do you filter through all the noise (literally, noise) in an environment and know when to respond?
The way developers are solving it now is to listen for a wake word, or, essentially, a specifically defined sound wave form. Alexa's microphones are always on. The vibration of the diaphragm on each mic is converted to a digital signature.
The processing hub inside each Alexa device then examines that digital signature, and if it matches that of a pre-defined wake word, then and only then is the device supposed to parse follow-on sounds for meaning.
Sifting through all that noise for a wake word is a non-trivial programming problem. Take a look at the wave form below.
That's the word "Alexa." I recorded that on a professional studio microphone, with my head and mouth located at the exact optimal location for voice recording. I recorded this in a silent room, with everything except my computer turned off.
Now, look what happens to that wave form when I walk just five feet away and repeat the word, "Alexa."
As you can see, the bumps are still relatively noticeable, but the amplitude of the wave is considerably less.
Somehow, the processor on the Alexa device has to recognize that the wave form it's just heard corresponds to a command to wake up and listen. The device accomplishes that in a couple of ways.
First, it has multiple microphones, so it's able to pick up different sound wave structures on each mic. Because the mics are arrayed around the device, each mic will pick up a given sound event at a very slightly different time, and with a very slightly different wave.
If, and only if, the device determines that the sound it just heard is the wake word, then it starts processing the follow-on sounds.
But we're not ready to discuss Alexa's command processing just yet. Remember that I recorded the waves shown above in an optimal studio environment. Parsing the wake word by Alexa would be easy if it always lived in an optimal studio environment.
But Alexa doesn't.
The real technical challenge that voice assistant vendors like Amazon have to overcome is variety. There are millions of Alexa owners, and you can bet many of them say "Alexa" very differently. They may have different accents, they certainly have different voices, tones, pitches, and rate of speech.
They also have a wide range of background noises. A car door may slam. A TV may be on. Music may be playing in another room. A dog might be barking. A fan might be generating a blanket of white noise. You get the idea.
Through all of that variety, somehow, Alexa has to determine if it's been woken up by the word "Alexa."
Given that there are millions of devices, situations, and voices, you can begin to see the challenge that the developers had in making invocation work reliably. You can't have Alexa wake up spontaneously, or that would be disturbing. On the other hand, if Alexa doesn't respond when spoken to, that would also be very frustrating to users.
Building a machine learning system that can parse all those variables, achieving a practical balance between too many false positives without seeming to ignore requests, is (and I'll use the phrase again) non-trivial.
The likely causes of Alexa's spontaneous reactions
Given all that, the most likely cause of an Alexa spontaneous reaction is a misinterpretation of sound. Given how sensitive Alexa has to be to process wake words, sometimes Alexa will react to a sound (even one we might not hear or notice) and interpret that as a wake word.
Although considerably rarer, there's also the possibility that an update changed Alexa's code and introduced a bug.
There's also the problem of the internet and Alexa's cloud-based AI system. Let's talk about that, next.
How a command is interpreted
Alexa responds to lots and lots of commands. Parsing all those wave forms is way too much work for the processor on the local Alexa device. To do that processing, Alexa relies on Amazon's cloud infrastructure.
Although Amazon hasn't disclosed the exact technical details of Alexa's internal functions, we know the complex parsing problem for all those commands is too much for a local CPU. The sound wave (or, some compressed representation of it) has to be uploaded to Amazon's data centers for computational analysis.
Once uploaded, the Alexa back-end AI has one very large task: match the sound wave to a specific Alexa command string.
Alexa has a large library of possible commands. Not only are there the native commands, like reminders and time requests, but there are all the commands associated with Alexa's ever-growing skills library.
The skills library is Alexa's version of an app store, where outside, non-Amazon developers can build custom code that waits for a certain Alexa command and then executes some behavior.
We'll come back to the skills in a moment, but for now, let's continue with the challenge of looking up the proper command.
To increase the usability of Alexa, it has to be able to respond to variations of a given command. For example, Alexa has to be able to process "Alexa, tell me the time," as well as "Alexa, what time is it." Most AI handles this problem by ignoring filler words (i.e., is, the, etc) and converting sounds into sound stems and normalized sequences. Essentially, this allows the system to take a variety of utterances and treat them as the same command.
Remember that not only are there thousands of commands for Alexa to parse, the sound waves are not pristine. The sound processing system has to be able to take the sound waves and do its best to interpret what the humans speaking are asking for.
As with the wake word, this is non-trivial given the millions of human speakers, dialects, accents, voice pitches, distances from devices, and environmental background noises.
Frankly, it's nothing short of amazing that this works at all. Here, too, Alexa can seem to be "in a state of mind that prevents normal perception" if it misinterprets a sound wave, accepts a false positive, or ignores what might be a valid request.
Acting on commands
Of all Alexa does, acting on commands is the easiest part. Once the Alexa back-end AI knows you're asking for the time, a time lookup is easy to code. So, too, is the voice synthesis response, because the only variable is the string of words to be spoken.
In almost all cases, if Alexa seems to be acting weirdly, it's not actually the behavioral component of Alexa's AI mind. It's almost undoubtedly the perception component.
That said, there are commands that could cause folks to think Alexa's lost its marbles. As of the time of writing this article, there are four third-party skills that relate to "scream."
The "scream prank" skill will initiate upon the phrase "Alexa, scream prank." After that, it will wait sixty seconds, and then scream. That allows the prankster to set up the prank, leave the room, and then torture whomever happens to be near the device when it screams.
The "spooky scream" is even more diabolical. It's initiated with the phrase, "Alexa, ask Spooky Scream to start in two minutes." You can adjust the time delay. As such, you could ask Alexa to start the scream in ten minutes, leave the room, and have the prank trigger way, way after you've left.
Who knows what happened to make Alexa scream in Farhad's bedroom? Certainly, when Alexa spoke to my wife and I without request, we didn't hear or say anything that should have caused her to speak.
But whether or not Alexa heard something outside is something we'll never know. Pixel (our pup) often barks at noises only he can hear. Today, he barked at the UPS truck a full three minutes before the driver rang the bell. It's possible Pixel heard the truck down the street and barked, before it even stopped at our house.
In the case of the creepy laugh behavior, it's highly probable that Alexa was responding to false positives. Take a look at the wave form below.
You can see there are three main waves in this optimal recording of the phrase "Alexa, laugh." The first two correspond to the Alexa wake word. The third is the word "laugh." Notice that the wave is actually quite flat. That's because "laugh" is a soft word, without many peaks or distinguishing characteristics.
Now look at the next wave form. This one is the same phrase, but uttered from about five feet away.
You can hardly tell, from that wave, what's happening. This next image is a zoomed in version of the above wave.
You can see there are some spikes, but the amplitude is terrible. There's very little data here. Given that Alexa was responding to just a simple "laugh" command, it's definitely possible that, in the millions of households with Alexa devices, a few generated enough data to be interpreted as a laugh command.
The need for human learning
This is where human learning, rather than machine learning, comes in. Alexa's developers, in the aftermath of the outcry, have changed Alexa's command sequence for a laugh. Now, "Alexa, laugh," doesn't do anything. Instead, the humans at Amazon learned, and changed the command to "Alexa, can you laugh?"
Let's hope the arms race is always in favor of the humans. If Alexa's AI ever does achieve self-awareness, we probably all are doomed. In the meantime, though, you now know you can chalk up most of Alexa's creepy behavior to misinterpreted sound waves.
Somehow, that's not as comforting a thought as I'd hoped it would be.