Update 23-Apr: Late Thursday night, McAfee posted a FAQ on this issue at their web site. The FAQ includes some of the text from the confidential document I received yesterday and is clearly a later version of that document. However, the details of why the problem occurred and the specific steps that the company plans to take to avoid similar problems in the future have been replaced with general statements. I have highlighted the differences in updates below.
As of 6AM Pacific time on 23-Apr, there is still no statement, apology, or clearly labeled link to support resources related to this issue on McAfee's home page.
If your company uses enterprise security products from McAfee, you probably had a bad day yesterday. If you're an IT professional at one of those companies, you're probably still cleaning up the mess caused by a defective virus signature update that disabled systems running Windows XP with the most recent service pack (SP3). The worst part? According to a confidential document from McAfee, the cause was a fundamental breakdown in the most basic of quality-assurance processes.
From an IT perspective, this is a nightmare scenario: an automatic update that wipes out a crucial system file and that can only be repaired manually. I've heard from more than a dozen IT pros and consultants over the past 24 hours who shared their experiences. They are, to put it mildly, unhappy.
What went wrong?
That was the question I asked in my post yesterday, and I formally asked a McAfee spokesperson for an explanation this morning. I was told that an answer will be posted on McAfee's blog later today. As of this writing, that blog post has not been published.
But I found the answer, straight from the source, in a document forwarded to me by an anonymous source. According to my source, the document was "a confidential communication to enterprise customers" sent via e-mail. In it, the anonymous author acknowledges that the screw-up was thoroughly preventable. The document, titled "McAfee FAQ on bad DAT issue," is written in Q&A format and includes the following exchange:
8. How did this DAT file get through McAfee’s Quality Assurance process?
There are two primary causes for why this DAT file got through our quality processes:
1) Process – Some specific steps of the existing Quality Assurance processes were not followed: Standard Peer Review of the driver was not done, and the Risk Assessment of the driver in question was inadequate. Had it been adequate it would have triggered additional Quality Assurance steps.
2) Product Testing – there was inadequate coverage of Product and Operating System combinations in the test systems used. Specifically, XP SP3 with VSE 8.7 was not included in the test configuration at the time of release.
Update 23-Apr: The details I quoted above have been scrubbed from the FAQ posted at McAfee's website. The corresponding section of the FAQ now reads as follows: "The DAT release was designed to target the W32/Wecorl.a threat that attacks system executables and memory. The problem arose during the testing process for this solution. We had recently made a change to our QA environment. Unfortunately, this change resulted in a faulty DAT making its way out of our test environment."
McAfee has also sanitized the portion of the FAQ that describes its plans to adapt its quality control procedures. Here's the original text of the confidential document sent to enterprise customers:
9. What is McAfee going to do to ensure this does not repeat? McAfee is currently conducting an exhaustive audit of internal processes associated with DAT creation and Quality Assurance. In the immediate term McAfee will do the following to provide mitigation from false detections:
1) Strict enforcement of rules and processes regarding DAT creation and Quality Assurance. 2) Addition of the missing Operating Systems and Product configurations. 3) Leveraging of cloud based technologies for false remediation. 4) A revision of Risk Assessment criteria is underway.
And here is the corresponding text as it appears in the final FAQ, published overnight:
What is McAfee going to do to prevent this from happening again?
Nearly all of our 7,000 employees have been working around the clock to help customers like you get back to business as usual and to make sure this never happens again. The vast majority of our customers are now back up and running and we remain focused on those that remain affected.
We are implementing additional QA protocols for any releases that directly impact critical system files. We are also rolling out additional capabilities in Artemis that will provide another level of protection against false positives by leveraging an expansive whitelist of critical system files and their associated cryptographic hashes.
That is mind-boggling. For enterprise customers, Windows XP SP3 is probably the most widely used desktop PC configuration. Leaving it out of a test matrix is about as close as one can get to IT malpractice. Any enterprise customer who received this document has every right to be furious.
Meanwhile, McAfee's website is almost completely silent on the issue. Customers who have been affected by the issue who visit the McAfee U.S. home page see business as usual, with a rotation of large ads trumpeting McAfee's latest products. More than 24 hours after the problem occurred, only a single front-page link is available, and it's blandly headlined, "McAfee Response on Current False Positive Issue." If you go to McAfee's Enterprise home page, there is no mention of the problem and no link to any support resources. An overseas correspondent sent me a screen shot of McAfee's UK home page, which also has no mention of the issue.
That link leads to a blog post by McAfee's Barry McPherson, published yesterday at 4:29PM. McPherson seems more intent on praising McAfee's researchers and minimizing the problem than helping users. He writes: "We believe that this incident has impacted less than one half of one percent of our enterprise accounts globally…" I find it difficult to believe that the company could come up with an accurate estimate at all, much less do so within hours after the problem was identified. It certainly doesn't match up with the reports I'm hearing from the field.
Update 23-Apr: Yesterday afternoon, the McAfee blog post was edited to remove this reference. The sentence now reads, " We believe that this incident has impacted a small percentage of our enterprise accounts globally and a fraction of our consumer base..."
From a crisis management perspective, McAfee's response has been disastrous. If the company truly cared about its customers, the home page would contain an apology from the CEO and links to detailed support information. Instead, it appears that the company is hoping its customers will just forget about it.
Based on the 100+ comments to McPherson's post, customers who were hit by this error aren't likely to forget about it soon. And when they figure out that a lapse in the most basic of quality control steps caused them to spend thousands of dollars in IT manpower and lost productivity, they're likely to be angrier still.