Vista Speech Command exposes remote exploit

Vista Speech Command exposes remote exploit

Summary: Vista speech command system allows remote exploitation because sound files played by from a web browser or any other audio player can interact with the OS. Users should turn off Vista speech command until a patch is available.

TOPICS: Windows, Microsoft

[Update 1/31/2007 - Microsoft confirmsSebastian Krahmer on the Dailydave security mailing list started a discussion about the potential for exploiting Vista's speech recognition feature by hosting malicious sound files on a website that would playback a series of audio commands to try to subvert the Operating System.  Krahmer didn't actually test any of these theories, but raised an interesting concern about the safety of Vista's speech command system and I followed up and came up with the actual tests to prove the first Vista remote exploit.

I initially responded to the list explaining that an Operating System should filter out the sounds it picks up on the Microphone to avoid a nasty feedback problem, but it's still possible for the Mic to pick up enough of the voice to run.  Someone else responded that Apple tried similar functionality 15 years ago and quickly realized that they had to guard the feature with a keyword that needed to be spoken because people were playing gags with the "shutdown" command.  But I have used speech command and realized that Vista only requires a static command so I proceeded to investigate with an actual test to test these theories.

I recorded a sound file that would engage speech command on Vista, then engaged the start button, and then I asked for the command prompt.  When I played back the sound file with the speakers turned up loud, it actually engaged the speech command system and fired up the start menu.  I had to try a few more times to get the audio recording quality high enough to get the exact commands I wanted but the shocking thing is that it worked!  Anyone that's ever visited MySpace knows how many annoying webpages out there that will start blasting loud MP3 music as soon as they enter the page.  [Update 4:17PM - Someone asked me how loud I had the speakers.  To my surprise, not very loud at all and I was shocked at how well it worked.  I didn't even believe it would work at the loudest setting let alone at a moderate sound level.]

There are some mitigating factors but there is no doubt this is still a serious exploit.  Most people won't have Vista speech commands configured and enabled but if they do, the speech command control console will automatically load with the operating system and park itself on the top of the desktop waiting for audio commands.  The other mitigating factor is that if you visit a webpage and it starts barking out slow and loud Vista speech commands, it will be rather obvious to most people that something is very wrong.  But it's still possible that a webpage might delay the sound playback and hope that the user is not around to stop the exploit.  Another mitigating factor is that the Vista command prompt doesn't seem to take any speech commands at all, but that doesn't prevent a remote hacker from interacting with your OS in an unauthorized manner.

My recommendation is that Vista users disable the speech command feature from automatically starting up in Vista and only use it in a supervised manner until there is a patch for this.  Vista speech commands should completely filter out any sound coming out of the computer system to prevent unauthorized speech commands coming from malicious sound files for a long term fix.  Microsoft should at least implement a short term fix by letting the user set a unique pass phrase or series of numbers to activate speech commands rather than allowing a fixed phrase activate the system.

[Update 4:55 PM - Someone (who shall remain unnamed until they give me permission to name them) emailed me and criticized me that this isn't a remote exploit and that I was being "ludicrous" and that this can't bypass UAC.  Well I never claimed this would bypass UAC and secure desktop nor do I think it needs to to be able to do some serious damage.  The fact that a website can play a moderate level sound file to interact in a way with the desktop by activating an idle speech command system and be able to delete user documents with zero user interaction is serious by any stretch of the imagination.]

[Update 2/1/2007] Disagreement over impact of Vista’s analog hole

Topics: Windows, Microsoft

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Fooling speech systems is nothing new.

    [i]My voice is my passport. Verify me.[/i]

    *NOTE: If you never watched the movie [i]Sneakers,[/i] stop reading this, head for your local video store, rent it, and watch it several times.*

    You might want to sweep your office for bugs if you plan to use speech command.
    Mr. Roboto
    • New or not, this shouldn't be exploitable

      I had figured that this would not be exploitable because:

      a. too obvious.
      b. feedback filtering should prevent it.

      Turns out that Vista speech commands is exploitable.
      • Obvious?

        [i]I had figured that this would not be exploitable because:

        a. too obvious.[/i]

        Remember, this is the company that "secured" MSPassport by putting the user's password in the plaintext URL and "hid" the user's profile by refusing to give a link to an otherwise wide-open location.

        Microsoft isn't big on review -- the staff there are, after all, the most brilliant that have ever lived on Earth. Mistakes and oversights Just Don't Happen, so there's no need for review.

        [i]b. feedback filtering should prevent it.[/i]

        Feedback filtering (echo cancellation) is generally applied to the input path (keep the microphone out of the speakers) because that's sufficient to kill the loop gain and [b]much[/b] less DSP-intensive than trying to allow for all of the multiple acoustic paths (each with a different delay) between the speakers and the microphone.

        No fault to MS here -- trying to do a full cancellation on the audio return path would suck up most of your Core 2 Duo Extreme's processing power and overheat your laptop in a (pardon the expression) flash.
        Yagotta B. Kidding
        • And you think they're the only ones making stupid mistakes?

          Why do you need to turn everything in to a "my OS is better than your OS" thread? This is a nasty mistake, it happens. Look at the 9 critical flaws in Mozilla in FF2 within 2 months. Let's put things in to context here.

          "No fault to MS here -- trying to do a full cancellation on the audio return path would suck up most of your Core 2 Duo Extreme's processing power and overheat your laptop in a (pardon the expression) flash"

          Windows Messenger does extremely good feedback cancellation with very low CPU on a slow processor for voice chat. Don't know what you're talking about.
          • The point is...

            ...there's no such thing as error-free programming, no matter how brilliant your developers are. That's why good testing is critical.

            There are, of course, rational reasons for holding MS and other proprietary developers to a higher standard on this than the Free Software types: MS can afford to pay people to test their software (that's supposedly one of the reasons why MS can expect people to pay for the privilege of using MS software); free developers generally have to rely on themselves and end users.
            John L. Ries
          • it's not bad programming...

  's bad design
            Scott W
          • Lifecycle

            [i]it's not bad programming it's bad design[/i]

            Actually, it's bad specification. The usual lifecycle rule is that the cost of a mistake increases by an order of magnitude at each stage of the process:

            Yagotta B. Kidding
          • ...and inadequate testing.

            But bad design makes the tester's job that much harder.
            John L. Ries
          • Far from it

            [i]Why do you need to turn everything in to a "my OS is better than your OS" thread?[/i]

            Touchy, touchy. If you notice, I didn't do any OS comparisons.

            [i]This is a nasty mistake, it happens.[/i]

            Yup. I find it interesting, however, that [b]you[/b] started off telling us that it's inexcusable -- right up until we agree with you, at which point you go defensive.

            [i]Look at the 9 critical flaws in Mozilla in FF2 within 2 months. Let's put things in to context here.[/i]

            Yup, nobody's perfect. Never mind, Microsoft didn't do anything that everyone else does so it's all right.

            Tell me again why you wrote this column?

            [i]Windows Messenger does extremely good feedback cancellation with very low CPU on a slow processor for voice chat. Don't know what you're talking about.[/i]

            George, it's a difference in phase dispersion. Keeping the input stream out of the speakers is easy because the path from microphone to speaker is electronic and has a [b]very[/b] short delay -- and only one. Problem solved, and that's what you're seeing with MSWinMessenger.

            Keeping speaker output out of the input stream, on the other hand, requires that you account for [b]all[/b] of the major echo paths. First, the acoustic path from speaker to microphone takes a lot longer than the mic-to-speaker path, which means a lot more "bookkeeping" if nothing else. Then there's the problem that it's through different materials: your laptop case passes sound faster than air. In air, there are multiple paths too -- including bouncing off of every object in the room. Each of them has a different path length. You'd have to account for at least the major ones.

            It's called multipath cancellation, and it's [b]not[/b] a simple problem in adaptive signal processing. With a digital signal processor that has dedicated hardware for doing that kind of thing, it's managable. With a general-purpose processor (and don't kid yourself about the SSE instructions) it's a beast that eats cycles. [b]Lots[/b] of cycles.

            You could probably do it, although I suspect that Microsoft would have to have delayed release by quite a bit in the process. However, it would mean that your CPU would never get a break -- and getting a break is what keeps your power managable.
            Yagotta B. Kidding
          • Yagotta B. Right

            What YBK said is very true. My company specializes in acoustics and acoustic processing. We make submarine towed acoustic sensor arrays and a trainable system that can locate a sniper in an urban environment with an array of microphones.

            The multidimensional nature of acoustic echoes requires a great deal parameter space to capture the frequency response function of each echo source. These functions include a bulk time delay but also a frequency spectrum with phase as different materials or surfaces have different frequency and phase profiles. Add to this that the problem is not a discrete set of "echo" sources but instead a continuum of them, the amount of processing power is required is staggering.

            Keep in mind that the echo cancellation problem has a metric regarding how well the system nulls these. A 5 to 7 dB rejection is probably adequate for the human to ignore the noise (I have no idea how sensitive the "recognition" block is). But I am certain that the voice recognition system in any OS has a AGC (automatic gain correction) block that proceeds the recognition block and is good for at least 10db. That means the echo cancellation system must achieve something like 15 to 17 dB just to overcome the AGC block and give cancellation as good as human perception would notice.
          • Commands versus acoustics

            Doing acoustic cancellation might be very difficult, but how hard would it be to monitor the sound stream with a parallel recognition process, and if the microphone input and the speaker output resulted in the same command, ignore that. (You'd have to have a time frame, because the speaker output might be giving the user instructions, and the Vista speech training does.)

            To ignore sound from the speakers is probably horribly difficult. To ignore *commands* from the speakers should be pretty easy, and would be a good thing for all speech recognition products to do. My Dragon Dictate often interprets as text the lyrics of songs being played by my system. Be nice if it would ignore anything it produced.
        • don't get it

          Why is it so hard to prevent this? Take the following approach: every time something is played out from any audio output line, the speech commands are disabled for 2 seconds after the playback finishes. I guess this would solve the issue and I don't think people would like to use voice commands while listening to music (wouldn't work too well). Exception would be if using a headset, in which case the feedback and the exploit is not an issue. And in this case, the user could activate a special exception option to allow the function. In this exception mode, the OS could run a little echo-test every one minute or so, taking little CPU power, to make sure that the user didn't disable the headset and open the whole.

          Am I missing something? So I don't think any special echo cancelling-eating-CPU is required
          • Echo cancellation is NOT hard

            It's already done in Windows Messenger which has excellent echo cancellation with very minimal CPU usage. Polycom Communicator also does echo cancellation in software with minimal CPU.
          • my point is...

            that regardless of whether echo cancellation is hard or not, you don't need to do it. Just disable the speech recognition while playing audio. Simple.
          • Please see above

            Yagotta B. Kidding
          • Non-Techie Approach

            ROTFLOL,,,,,how about just turning Voice Command off when you're not using it?....or even easier,,,,just turn the speakers off or mute them unless you are listening to something? Sometimes simple is better.
          • since we are at it

            why don't we just drop Vista and switch to back DOS?
            I mean, the point here is to benefit from voice command and a state-of-the-art next-generation multimedia platform experience (TM), without having to worry about when to activate it or deactivate it due to a security hole. If you are gonna click somewhere to activate it, you might have as well just press the keyboard shortcut for the same action and get rid of voice commands. Turn off speakers? What about people who like to have system notifications active (like when a new email comes)?

            The way I see it, your proposal is not a solution to the problem, it is just a temporary and inconvenient workaround.
          • No offence intended

            Sorry,,didn't mean it quite that way. I just get amused at the typical geek (meant in a nice way) mind. First reaction is always some complicated tech solution. Let's face it, this exploit has a very tiny chance of success. In the first place, the user can hear what's going on. They have to be setting at the machine when it happens,right? Otherwise how did the machine get on the site to start with. Second, the user has to be a little bit tech smart to be using voice command anyway, so they are going to realise very quickly that something is not right,,,and kill it. Third problem, the site has no way to tell if a computer that accesses it has voice command turned on,,so it's spouting all this stuff to everyone that hits it,,,somebody is going to notice sooner or later,,probably sooner.

            Yes,, the hole needs to be plugged, in the mean time, turn the speakers off while VC is active, and don't leave VC active unless you are activly using it at that point in time. Not a very elegent or techie temporary work around, but fool proof.
    • Ah, but

      you really need to understand Microsoft's culture. History starts over with them: nothing that has ever happened elsewhere matters.


      Brilliant Ideas that were conceived, examined, and discarded in the 60s routinely turn up again (apparently without any awareness of the reasons they were rejected then) in MS products. CS professors get no end of amusement from observing the phenomenon. (It's that, or blow their cool totally. The ones who survive laugh.)
      Yagotta B. Kidding
      • Not just MS

        The PC revolution spawned a hacker ethos that help insure that a large proportion of a whole generation of programmers were openly contemptuous of academic computer scientists (the feeling was mutual). This generation is now middle aged, so it's easy to guess that many of them have worked their way into supervisory positions. It is therefore not surprising that approaches thought up and rejected back in the 60's and 70's still show up in commercial software.
        John L. Ries