Defective McAfee update causes worldwide meltdown of XP PCs

Oops, they did it again. Early this morning, McAfee released an update to its antivirus definitions for corporate customers that mistakenly deleted a crucial Windows XP file, sending systems into a reboot loop and requiring tedious manual repairs. It's not the first strike for the company, either. I've got details.
Written by Ed Bott, Senior Contributing Editor

[Update, April 22. More details in my follow-up post, McAfee admits "inadequate" quality control caused PC meltdown.]

Oops, they did it again.

At 6AM today, McAfee released an update to its antivirus definitions for corporate customers that had a slight problem. And by "slight problem," I mean the kind that renders a PC useless until tech support shows up to repair the damage manually. As I commented on Twitter earlier today, I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today.

Here's how the SANS Internet Storm Center describes the screw-up:

McAfee's "DAT" file version 5958 is causing widespread problems with Windows XP SP3. The affected systems will enter a reboot loop and [lose] all network access. We have individual reports of other versions of Windows being affected as well. However, only particular configurations of these versions appear affected. The bad DAT file may infect individual workstations as well as workstations connected to a domain. The use of "ePolicyOrchestrator", which is used to update virus definitions across a network, appears to have [led] to a faster spread of the bad DAT file. The ePolicyOrchestrator is used to update "DAT" files throughout enterprises. It can not be used to undo this bad signature because affected system will lose network connectivity.

The problem is a false positive which identifies a regular Windows binary, "svchost.exe", as "W32/Wecorl.a", a virus.

McAfee now has its own KnowledgeBase page posted, with details about the problem and the fix. The symptoms are described, tersely, as "Blue screen or DCOM error, followed by shutdown messages after updating to the 5958 DAT on April 21, 2010."

Update: Engadget's Nilay Patel quotes a statement from McAfee downplaying the impact on consumers:

The faulty update has been removed from McAfee download servers for corporate users, preventing any further impact on those customers. We are not aware of significant impact on consumer customers and believe we have effectively limited such occurrence.

That's bad news for McAfee. Corporate customers are likely to tally up the one-day cost of fixing this damage (or multiple days, if Engadget's report of tens of thousands of affected PCs within single companies is accurate), and they're likely conclude that it's time to find a new supplier of security software. At the very least, McAfee is going to have a lot of explaining to do at contract renewal time.

McAfee says it has already replaced the faulty virus definitions with an updated set, so if you update your definitions using the most recent set you will not encounter this issue. The company's official recommendation for repairing the damage involves copying Svchost.exe from a working system and manually copying it to an affected system. The McAfee technical bulletin doesn't include details about how to get to a command prompt on a system that's been temporarily bricked. (Using an XP installation disk allows a tech support professional to boot to a recovery environment and copy the necessary files from a command prompt. The good folks at BleepingComputer.com have published a tutorial that explains the process. Third party recovery tools also provide access to the file system from command-line environments.) This sort of repair is not a job for end users, certainly, and generally requires a skilled support professional.

Update 2: An e-mail correspondent from a large U.S. company  (see full text at end of this post) says that multiple files in addition to Svchost.exe mght be affected and claims that simply replacing Svchost.exe might not be enough to repair the damage. I'm still looking to confirm this report.

Update 3, 22-Apr: McAfee has released a repair tool it calls the SuperDAT Remediation Tool. Details are on this page. Running this tool is still a manual process that requires booting from portable media and running the executable file, in safe mode if necessary. 

Now, it is hard to imagine picking a more crucial file to torpedo. Svchost.exe is one of the most crucial of all Windows system files. It hosts the services that make just about every OS function possible. As the symptoms described here suggest, Windows simply won't start if Svchost.exe isn't there.

The bigger question is how on earth an update like this ever made it out of the testing lab and onto a production server. This should have been caught at the very beginning of the testing process.

Unfortunately, though, this isn't the first time McAfee has had a screw-up like this. Back in 2009, when the Conficker worm was making the rounds, I took a close look at how McAfee was handling its response to the new threat and was appalled at the sloppy, error-ridden documents they published for consumers and IT professionals. Here's what I wrote at the time:

Security is serious business, and details matter. When a company as large as McAfee is this sloppy with its public response to a high-profile issue, it makes you wonder how tightly the engineering, development, and support sides of the business are being operated.

Now we know.

Ironically, one company that was apparently affected by this issue is Intel, which was identified by the New York Times. It's the second major security headache for Intel in six months, following a widely publicized breach of its systems in China around New Year's. (Intel acknowledged the "recent and sophisticated incident [that] occurred in January 2010" in its 10-K report filed with the SEC earlier this year.)

If you've been affected by this issue, leave a comment in the Talkback section, I'll add further details as I come across them.

Update: I'm beginning to hear directly from people who were affected by this coloassal screw-up. One correspondent says he just fixed over 300 PCs: "Looked so much like Blaster from way back. Horrible clean up too as no network access. Moving clients to something with more centralized control ASAP."

A report from a university IT pro says 1200 PCs on his network were knocked out.

Another e-mail from an IT pro at a large U.S. company says that "hundreds of users" in his organization were impacted:

This issue affected a large number of users and is not resolved by simply replacing svchost.exe.  You must boot to safe mode, then installl the extra.dat, then manually run the vscan console.  You then remove the quarantined files.  All users had at least two and some had up to 15.  Unfortunately, using this method, you have no way to determine if some of the files you are restoring are vital system files or virus files.

I'm still hoping to get confirmation from Intel, where at least one anonymous source says "tens of thousands of PCs" were hit.

A report from Australia says 10% of the cash registers at the country's largest supermarket chain were knocked out, forcing the closure of 14-18 stores.

Via e-mail, I've heard firsthand reports from people who had to manually repair PCs at some very large corporations and arms of the U.S. military.

Editorial standards