Putting perspective on the "30% of Vista crashes caused by nVIDIA" reports

A story that's echoing around the blogosphere at the moment relates to how information contained in the 158 page pack of emails released as part of the Vista Capable lawsuit indicates that during some unspecified point during 2007, 30% of Vista crashes seemed to be down to nVIDIA drivers. Back in the real world, things aren't that clear cut.

A story that's echoing around the blogosphere at the moment relates to how information contained in the 158 page pack of emails released as part of the Vista Capable lawsuit indicates that during some unspecified point during 2007, 30% of Vista crashes seemed to be down to nVIDIA drivers.  Back in the real world, things aren't that clear cut.

It was an article written by Joel Hurska of Ars Technica that seeded the discussion pool initially.  In this article, Hurska presents the details in a careful, sensible way and he's clear about what can and can't be inferred from the data available.

However, the same care and attention to detail doesn't apply to other sites.  For example, Engadget's Nilay Patel takes the information available and draws some wild conclusions, such as:

... a lot of problems faced by the troubled operating system are actually NVIDIA's fault -- nearly 30% of logged Vista crashes were due to NVIDIA driver problems, according to Microsoft data included in the bundle. That's some 479,326 hung systems, if you're keeping score at home ... [emphasis added]

Whoa, slow down there!  Why said anything about hung systems?  It might be sensational, but it's far from accurate.  I know, because I've had several Vista systems plagued by ATi and nVIDIA driver crashes and so I've not only a lot of first hand experience with these issues, but I've also spoken to several very clever people at nVIDIA, ATi and Microsoft at great length about these issues.

The important thing to note here is that an nVIDIA or ATi driver crash doesn't automatically mean a BSOD, hung system or a full-blown crash under Vista.  This is because of a mechanism called TDR - Timeout Detection and Recovery.  Here's how Microsoft describes TDR:

Windows Vista attempts to detect these problematic hang situations and recover a responsive desktop dynamically. In this process, the Microsoft Windows Display Driver Model (WDDM) driver is reinitialized and the GPU is reset. No reboot is necessary, which greatly enhances the user experience. The only visible artifact from the hang detection to the recovery is a screen flicker, which results from resetting some portions of the graphics stack, causing a screen redraw. Some older Microsoft DirectX applications may render to a black screen at the end of this recovery. The end user would have to restart these applications.

WDDM timeout

The thing about these TDR events is that if your system is set to report crashes, information on each one of these events is sent to Microsoft.  While each one certainly classified as an undesirable event, they're not crashes in the XP sense of the word.

Back to Microsoft for more description of TDR events:

Throughout the process of GPU hang detection and recovery, the desktop is unresponsive and thus unavailable to the user. In the final stages of recovery, a brief screen flash occurs that is similar to the one when the screen resolution is changed. After the desktop has been successfully recovered, the following informational message appears to the user.

The message is also logged in the Windows Vista Event Viewer. Diagnosis information is collected in the form of a debug report that is returned to Microsoft through the Online Crash Analysis (OCA) mechanism if the user opts in to provide feedback.

I had one system that was badly affected by this issue that over a three-month period it had generated in excess of 3,300 TDR events (most of these happened when the system was displaying a screensaver).  But what's interesting is that while all 3,300 events were reported to Microsoft, the system only crashed in the XP sense of the word (in other words locked up or hit a BSOD) a couple of dozen times during that period.

Note:  In case you're wondering, here's how we fixed the problematic systems ... after talking to a few people at Microsoft (and handing them some debug information) we came to the conclusion that the problem came down to a vertical sync issue.  Both systems ran dual screens and these screens were of a different size and running at different screen resolutions and so I took a stab at solving the problem by fitting identical monitors on the systems.  It worked!  Problem solved.

Bottom line is that there's just not enough information in the released documents to allow us to come to anything better than broad-brush conclusions.  It's good for a headline, but not much more.

Thoughts?