Amazon Echo: The four hard problems Amazon had to solve to make it work

Talking to Alexa might be easy but there's a huge amount of complexity inside Amazon's smart home assistant.
Written by Steve Ranger, Global News Director

Amazon's aim is to have Alexa indistinguishable from a human voice.

Dave Limp, Amazon's SVP of device and services business is standing in front of an image of the bridge of the starship Enterprise, explaining the inspiration for Amazon's surprise hit Echo device.

"A lot of people wondered what was the inspiration for this vision, it really was this, this cultural icon it started here with the tap on the lapel to talk to the computer. And later in the Star Trek series you could be anywhere on the starship Enterprise and you could talk to the computer and she would respond quickly with an answer," he says.

The Amazon Echo, a cylindrical, voice-controlled speaker has been something of a sleeper hit in the US, selling around three million since it was launched last year and winning some rave reviews along the way. "We wanted to build a computer in the cloud that was completely controlled by your voice," Limp says.

The Echo is activated someone saying Alexa (or Amazon, or Echo) at which point it begins streaming spoken requests to the cloud where they are analysed using neural network technology in order to generate the right response to questions such as "Alexa, will it rain tomorrow?", or "Alexa, how is traffic?" or many others.

It can play music from streaming services such as Amazon Music and Spotify, or play audio books from Audible. The Echo has evolved into a digital home hub and allows users to control - by voice - things like lights, switches, and thermostats, while other companies can also offer 'skills' - like apps to connect to their own services. As such, the Echo and Alexa have become Amazon's counter to Siri and Google Now.

But while using your voice is a simple way to control a device, building the hardware and software to make that possible involved solving some major problems, says Limp.

"When we started developing the product it turns out that you discover a large number of hard problems. Its often true that when you have a very simple interface ... underneath the covers are a large number of hard problems needing to be solved."

Limp identified four hard problems that the team solved before they could deliver Echo:

1. Far field voice recognition

Voice recognition has been around for decades but mostly it has been based on near-field recognition, where the microphone is close to the users mouth which means a clear signal and less ambient noise. Amazon wanted to design a device which could function in an everyday family kitchen - a much noisier scenario.

Core to solving this is the seven microphone array in the body of the device which use beam forming to identify the microphone closest to the voice and amplify that one - and suppresses the others. And when music is playing - a common use of the Echo - the device uses machine learning driven 'Echo canceller' to make it easier for the device to hear human voices.

2. Natural language understanding

"The first stage of any voice recognition system is taking that sound file we send up and turning it into text. That's a reasonably solved problem. the hard problem that's been vexing computer scientists for decades is understanding the context of what you say, parsing the words," said Limp.

The service needs to understand what is being said and and disambiguate it so we get the right answer as quickly as possible. Limp pointed to recent breakthroughs in machine learning and deep neural networks as providing the breakthrough. "We still see dramatic improvements month on month or accuracy and the amount [Alexa] can learn," he said.

3. Privacy

"With a product that has microphones that you are sitting in your home you can't think about privacy as an afterthought, it doesn't work, it has to be built into the foundation of the product itself," said Limp.

The Echo is always listening for its 'wake word', at which point it will start streaming words to the cloud to be analysed and responded to, until the blue light goes out. That data is also used to make the service work better and to help it understand your particular voice better. Some may find that having a such device, always listening in their home, may be a step to far. However, Amazon notes that customers can delete any individual utterance or question - or everything they have said - while the mute button cuts power to the microphones.

4. Text to speech

The industry has often forgotten about one of the most important bits which is the output, said Limp. That is, the voice of Alexa, which is a machine learning driven voice, regularly updated by Amazon.

"It turns out that speech based interfaces, the cadence the intonation, the rhythm of how I'm talking matters a lot. It makes it feel more natural it gives the service and device a personality and a bunch of hard problems go into figuring that out," he said. What customers want is a smooth human sounding voice, he said - Amazon's aim is to have Alexa indistinguishable from a human voice.

More on the Amazon Echo

Editorial standards