Vista Speech Command exposes remote exploit
Summary: Vista speech command system allows remote exploitation because sound files played by from a web browser or any other audio player can interact with the OS. Users should turn off Vista speech command until a patch is available.
[Update 1/31/2007 - Microsoft confirms] Sebastian Krahmer on the Dailydave security mailing list started a discussion about the potential for exploiting Vista's speech recognition feature by hosting malicious sound files on a website that would playback a series of audio commands to try to subvert the Operating System. Krahmer didn't actually test any of these theories, but raised an interesting concern about the safety of Vista's speech command system and I followed up and came up with the actual tests to prove the first Vista remote exploit.
I initially responded to the list explaining that an Operating System should filter out the sounds it picks up on the Microphone to avoid a nasty feedback problem, but it's still possible for the Mic to pick up enough of the voice to run. Someone else responded that Apple tried similar functionality 15 years ago and quickly realized that they had to guard the feature with a keyword that needed to be spoken because people were playing gags with the "shutdown" command. But I have used speech command and realized that Vista only requires a static command so I proceeded to investigate with an actual test to test these theories.
I recorded a sound file that would engage speech command on Vista, then engaged the start button, and then I asked for the command prompt. When I played back the sound file with the speakers turned up loud, it actually engaged the speech command system and fired up the start menu. I had to try a few more times to get the audio recording quality high enough to get the exact commands I wanted but the shocking thing is that it worked! Anyone that's ever visited MySpace knows how many annoying webpages out there that will start blasting loud MP3 music as soon as they enter the page. [Update 4:17PM - Someone asked me how loud I had the speakers. To my surprise, not very loud at all and I was shocked at how well it worked. I didn't even believe it would work at the loudest setting let alone at a moderate sound level.]
There are some mitigating factors but there is no doubt this is still a serious exploit. Most people won't have Vista speech commands configured and enabled but if they do, the speech command control console will automatically load with the operating system and park itself on the top of the desktop waiting for audio commands. The other mitigating factor is that if you visit a webpage and it starts barking out slow and loud Vista speech commands, it will be rather obvious to most people that something is very wrong. But it's still possible that a webpage might delay the sound playback and hope that the user is not around to stop the exploit. Another mitigating factor is that the Vista command prompt doesn't seem to take any speech commands at all, but that doesn't prevent a remote hacker from interacting with your OS in an unauthorized manner.
My recommendation is that Vista users disable the speech command feature from automatically starting up in Vista and only use it in a supervised manner until there is a patch for this. Vista speech commands should completely filter out any sound coming out of the computer system to prevent unauthorized speech commands coming from malicious sound files for a long term fix. Microsoft should at least implement a short term fix by letting the user set a unique pass phrase or series of numbers to activate speech commands rather than allowing a fixed phrase activate the system.
[Update 4:55 PM - Someone (who shall remain unnamed until they give me permission to name them) emailed me and criticized me that this isn't a remote exploit and that I was being "ludicrous" and that this can't bypass UAC. Well I never claimed this would bypass UAC and secure desktop nor do I think it needs to to be able to do some serious damage. The fact that a website can play a moderate level sound file to interact in a way with the desktop by activating an idle speech command system and be able to delete user documents with zero user interaction is serious by any stretch of the imagination.]
[Update 2/1/2007] Disagreement over impact of Vista’s analog hole
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Fooling speech systems is nothing new.
*NOTE: If you never watched the movie [i]Sneakers,[/i] stop reading this, head for your local video store, rent it, and watch it several times.*
You might want to sweep your office for bugs if you plan to use speech command.
New or not, this shouldn't be exploitable
a. too obvious.
b. feedback filtering should prevent it.
Turns out that Vista speech commands is exploitable.
Obvious?
a. too obvious.[/i]
Remember, this is the company that "secured" MSPassport by putting the user's password in the plaintext URL and "hid" the user's profile by refusing to give a link to an otherwise wide-open location.
Microsoft isn't big on review -- the staff there are, after all, the most brilliant that have ever lived on Earth. Mistakes and oversights Just Don't Happen, so there's no need for review.
[i]b. feedback filtering should prevent it.[/i]
Feedback filtering (echo cancellation) is generally applied to the input path (keep the microphone out of the speakers) because that's sufficient to kill the loop gain and [b]much[/b] less DSP-intensive than trying to allow for all of the multiple acoustic paths (each with a different delay) between the speakers and the microphone.
No fault to MS here -- trying to do a full cancellation on the audio return path would suck up most of your Core 2 Duo Extreme's processing power and overheat your laptop in a (pardon the expression) flash.
And you think they're the only ones making stupid mistakes?
"No fault to MS here -- trying to do a full cancellation on the audio return path would suck up most of your Core 2 Duo Extreme's processing power and overheat your laptop in a (pardon the expression) flash"
Windows Messenger does extremely good feedback cancellation with very low CPU on a slow processor for voice chat. Don't know what you're talking about.
The point is...
There are, of course, rational reasons for holding MS and other proprietary developers to a higher standard on this than the Free Software types: MS can afford to pay people to test their software (that's supposedly one of the reasons why MS can expect people to pay for the privilege of using MS software); free developers generally have to rely on themselves and end users.
it's not bad programming...
Lifecycle
Actually, it's bad specification. The usual lifecycle rule is that the cost of a mistake increases by an order of magnitude at each stage of the process:
.../specification/design/code/testing/deployment/use/...
...and inadequate testing.
Far from it
Touchy, touchy. If you notice, I didn't do any OS comparisons.
[i]This is a nasty mistake, it happens.[/i]
Yup. I find it interesting, however, that [b]you[/b] started off telling us that it's inexcusable -- right up until we agree with you, at which point you go defensive.
[i]Look at the 9 critical flaws in Mozilla in FF2 within 2 months. Let's put things in to context here.[/i]
Yup, nobody's perfect. Never mind, Microsoft didn't do anything that everyone else does so it's all right.
Tell me again why you wrote this column?
[i]Windows Messenger does extremely good feedback cancellation with very low CPU on a slow processor for voice chat. Don't know what you're talking about.[/i]
George, it's a difference in phase dispersion. Keeping the input stream out of the speakers is easy because the path from microphone to speaker is electronic and has a [b]very[/b] short delay -- and only one. Problem solved, and that's what you're seeing with MSWinMessenger.
Keeping speaker output out of the input stream, on the other hand, requires that you account for [b]all[/b] of the major echo paths. First, the acoustic path from speaker to microphone takes a lot longer than the mic-to-speaker path, which means a lot more "bookkeeping" if nothing else. Then there's the problem that it's through different materials: your laptop case passes sound faster than air. In air, there are multiple paths too -- including bouncing off of every object in the room. Each of them has a different path length. You'd have to account for at least the major ones.
It's called multipath cancellation, and it's [b]not[/b] a simple problem in adaptive signal processing. With a digital signal processor that has dedicated hardware for doing that kind of thing, it's managable. With a general-purpose processor (and don't kid yourself about the SSE instructions) it's a beast that eats cycles. [b]Lots[/b] of cycles.
You could probably do it, although I suspect that Microsoft would have to have delayed release by quite a bit in the process. However, it would mean that your CPU would never get a break -- and getting a break is what keeps your power managable.
Yagotta B. Right
The multidimensional nature of acoustic echoes requires a great deal parameter space to capture the frequency response function of each echo source. These functions include a bulk time delay but also a frequency spectrum with phase as different materials or surfaces have different frequency and phase profiles. Add to this that the problem is not a discrete set of "echo" sources but instead a continuum of them, the amount of processing power is required is staggering.
Keep in mind that the echo cancellation problem has a metric regarding how well the system nulls these. A 5 to 7 dB rejection is probably adequate for the human to ignore the noise (I have no idea how sensitive the "recognition" block is). But I am certain that the voice recognition system in any OS has a AGC (automatic gain correction) block that proceeds the recognition block and is good for at least 10db. That means the echo cancellation system must achieve something like 15 to 17 dB just to overcome the AGC block and give cancellation as good as human perception would notice.
Commands versus acoustics
To ignore sound from the speakers is probably horribly difficult. To ignore *commands* from the speakers should be pretty easy, and would be a good thing for all speech recognition products to do. My Dragon Dictate often interprets as text the lyrics of songs being played by my system. Be nice if it would ignore anything it produced.
don't get it
Am I missing something? So I don't think any special echo cancelling-eating-CPU is required
Echo cancellation is NOT hard
my point is...
Please see above
Non-Techie Approach
since we are at it
I mean, the point here is to benefit from voice command and a state-of-the-art next-generation multimedia platform experience (TM), without having to worry about when to activate it or deactivate it due to a security hole. If you are gonna click somewhere to activate it, you might have as well just press the keyboard shortcut for the same action and get rid of voice commands. Turn off speakers? What about people who like to have system notifications active (like when a new email comes)?
The way I see it, your proposal is not a solution to the problem, it is just a temporary and inconvenient workaround.
No offence intended
Yes,, the hole needs to be plugged, in the mean time, turn the speakers off while VC is active, and don't leave VC active unless you are activly using it at that point in time. Not a very elegent or techie temporary work around, but fool proof.
Ah, but
Seriously.
Brilliant Ideas that were conceived, examined, and discarded in the 60s routinely turn up again (apparently without any awareness of the reasons they were rejected then) in MS products. CS professors get no end of amusement from observing the phenomenon. (It's that, or blow their cool totally. The ones who survive laugh.)
Not just MS