I've been having a lot of fun with the "Ok Google" feature of my Android phone.
In fact, I've been adding new commands to it (I'll describe how that's done in future articles). But once I enabled "Ok Google" on my laptop, and my wife enabled it on her Android phone, and then I went out to dinner with some friends who also used the feature, it became clear that "Ok Google" would not be able to scale in real-life situations.
The problem is simple: Because everyone's device answers to "Ok Google" and because there's no voice training required, whenever anyone in range says "Ok Google," all of the electronics decide to connect to the great brain in the sky and start listening.
Worse, an instruction or question one person might give can be interpreted and acted upon by all the devices within listening distance.
It could be a Monty Python skit, it's so ridiculous.
But how serious is this as a problem? In part, that depends on whether or not you think voice commands will become a key aspect of how we interact with our electronic partners.
When I posed this topic to the ZDNet editorial team, my colleague Steven Vaughan-Nichols commented, "I've long wondered how well voice would scale period in a business. I always thought it would prove to be a real problem. I suspect we've finally reached a point where voice-recognition is mainstream enough that we're going to start seeing its problems."
He's right, as I'm about to prove.
I decided to go about this semi-scientifically. I started with an open ended research question: "Given the ability of devices to respond to the spoken phrase 'Ok Google,' could there actually be a potential problem in terms of cross-contamination of vocal commands?"
See? That's right. I'm a real, live scientist 'n stuff.
I set up an experiment to measure how far away from a device (in this case, my Samsung Galaxy S4) a person could be and still trigger "Ok Google." To make sure the "Ok Google" request was consistent, I recorded it into another device (my iPhone) and played it back at exactly 60 dB (decibel), which is the sound level of normal conversation.
I measured distance of the sound source (the iPhone saying "Ok Google") to the listening device (my Galaxy S4) using a laser distance meter, the Bosch DLR130K Digital Distance Measurer. I've previously described this as my "magic measuring machine" because it uses lasers rather than sonics to return a near-exact distance measurement.
I then tested the following four environments: a quiet, sound-reflective hall, a quiet, sound absorbing bedroom, a simulated office, and a simulated active restaurant.
To test the hall and the office, I simply placed the phone as far away as possible, ran the "Ok Google" recording, and at the maximum distance "Ok Google" responded, took a distance measurement.
In the case of the bedroom, I ran out of room without seeing any failure in "Ok Google" response. Ambient sound level in the bedroom was 27 dB and "Ok Google" responded from a distance of 11.3 feet, which was as far away from the phone as I was able to get.
In the case of the hall, not only did "Ok Google" respond at the end of the hall, but I then put the phone at the far end of the room that's at the end of the hall and "Ok Google" still responded. Ambient sound level in the hall was 28 dB and "Ok Google" responded from a distance of 25.22 feet.
Because local offices and restaurants would have been less than thrilled if I barged in and started "Ok Googling" at everyone, I needed to generate a simulated soundscape that emulated the hustle and bustle of an office or a restaurant.
I turned to the very cool Web site, Coffitivity, for help. This site plays ambient sounds of people gathering, ranging from a bistro to a morning coffee shop. I chose University Undertones, because I thought the "scholarly sounds of a campus cafe" would fit perfectly for our test. I then cranked up my entertainment center until it reached the decibel levels that matched the environments I was testing.
According to "Occupational noise exposure and control," an occupational health and safety information page from Australia's Monash University, the typical office environment generates about 50-60 dB of ambient sound.
When I set my simulated sound environment to 55 dB, I was able to trigger "Ok Google" from 17.95 feet away, which again was the point at which I ran out of room in the room.
Finally, I decided to try simulating a relatively busy restaurant. I found a great article in the LA Times by Betty Hallock, who went into some popular restaurants and measured the noise. She reported dB levels ranging from about 80 to 90 dB.
I again set my sound level to the middle of the range, this time 85 dB. Here, finally, "Ok Google" started to have some problems. Note that 85 dB is roughly the sound level of a noisy lawn mower, and I kept my "Ok Google" source sound at the constant 60 dB I used throughout the test.
In this environment, which is noisier than a factory, "Ok Google" did not respond until the source sound got to 2.94 feet from the phone. Even so, that's an important fact, because if you measure the distances between the mouths of two people sitting side-by-side at a table, you'll generally find the distance is just about three feet.
In other words, even in an environment as loud as a poorly maintained lawn mower, "Ok Google" on your phone will respond when the person sitting next to you tries to talk to his or her phone.
The following chart summarizes my findings.
Here's the bottom line. If "Ok Google" and voice command remains a novelty, this is not a problem. But, as Steven said, if voice command becomes a mainstream tool, then having one catchphrase for every "Ok Google" user will simply not do.
There will need to be some form of personalization so your phone and only your phone (or PC or tablet) responds when spoken to. In addition, if you own different devices, each device has to know when it's being spoken to, so everything doesn't trigger at once.
This could be a substantially more difficult problem to solve than basic voice recognition. That's because you want the system to be discerning enough to recognize your voice and catchphrase and not trigger a false positive on someone else saying the same thing.
At the same time, you need the system to be reliable enough to always respond to you and not mistakenly ignore your command because it wasn't issued perfectly enough to trigger the sound validation elements in the "Ok Google" system.
That's some very challenging computer science, right there. But if Google wants to see "Ok Google" scale and not get discarded just when it starts getting interesting because everyone's phone is fighting "Ok Google" battles with everyone else's phone, they better get to solving this problem.
With the growth rate of smartphones and the next upgrade cycle nearly upon us, I estimate that Google has a year, at most, to find a solution to the "Ok Google" catchphrase cross-contamination we're all about to experience.