Why the blue screen of death no longer plagues Windows users

The dreaded blue screen of death is familiar to any long time Windows PC user, but Microsoft has been developing tools to keep the dreaded BSOD at bay.
Written by Nick Heath, Contributor

Remember the blue screen of death, a Windows PC's way of telling you it had suffered an error so catastrophic it couldn't carry on anymore?

The dreaded Windows blue screen of death.

In recent years sightings of the BSOD have become less common in Windows operating systems, as Microsoft has stamped out some of the rogue code commonly responsible.

At a recent event in Cambridge Microsoft talked about how it had reduced misbehaving code in its operating system, using automated tools and a huge amount of crash reports from Windows XP users.

The main cause of crashes in Windows XP was device drivers, which were responsible for some 85 percent of hiccups in the OS. Drivers are the code that allow an operating system to control a hardware device, such as a video card, handling commands between the device and the core of an operating system, the kernel.

Drivers can be particularly difficult to debug, as their code will be written by different companies and is generally not open source, so is opaque to Microsoft. Their interactions can also be rather complex, with drivers commonly interoperating with a stack of other drivers.

"There's an exponentially growing number of device drivers in the ecosystem and they're written typically not by Microsoft but by our partners," said Byron Cook, principal researcher at Microsoft Research lab in Cambridge and manager of Microsoft's Programming Principles and Tools group.

"There are a number of rules that these systems must adhere to, otherwise the whole system is going to crash."


How Microsoft stamped on driver errors

Teams in Microsoft's Windows division developed algorithms that took in driver-related crash reports from XP users and automatically categorised them by driver vendor and the likely cause.

The goal for Microsoft was to figure out which drivers were causing problems and what the most common fatal mistakes were.

Microsoft established there were three ways that device drivers commonly tripped up Windows XP.

First was drivers breaking APIs in the Windows OS that handle communications between the Windows kernel and the driver. An example of this is a driver twice calling the Windows kernel API IoCompleteRequest, which caused Windows to crash.

The second major cause of errors was memory corruption, where memory is not correctly allocated for data structures needed by the driver. The third was drivers hanging the system after getting caught in an infinite loop.

To reduce the number of buggy device drivers, Microsoft embarked on what it called "data-driven program verification". This is a process whereby "you model a computer program as a mathematical system and the goal is to build tools that find proofs of correctness using mathematics and logic", said Cook.

"The goal is to build tools that automatically find proofs of correctness rather than just enumerating all the possible test cases", thereby accelerating the rate at which bugs can be stamped out.

Microsoft developed three new tools for automatically spotting and squashing software errors. The first was a piece of software called Slam, which checks that the properties of a piece of software will work with interfaces that software uses. Slam was used as a the verification engine for the Static Driver Verifier tool, which is now part of the Windows Driver Development Kit.

Another Microsoft tool, Slayer, addressed memory corruption. Slayer analyses data structures associated with a device driver and checks that every memory address the device driver touches has been properly allocated.

Using these tools Microsoft found a number of bugs in device drivers written by third parties, but also among the 40 or so sample device drivers Microsoft provided as part of the Windows Driver Development Kit.

"If you're a device driver writer what you do is typically copy and paste that code and then modify it," said Cook.

"So bugs in those samples are pretty bad because they then propagate throughout the infrastructure."

As well as fixing the bugs in the samples, Microsoft has now released tools to device driver writers that they can use to find bugs in their code.

Working out whether a device driver would get stuck in an infinite loop was a bit more tricky, as Microsoft was faced with the difficulty of addressing the halting problem. The famous mathematician and father of computing Alan Turing proved that a general algorithm for solving the halting problem couldn't exist for all possible program inputs.

But Cook said the nature of device drivers meant there were ways to analyse drivers to see if they would terminate.

"The nice thing about device drivers is they are typically quite small, about 30,000 lines of code. They typically don't have too many nested loops, and there are some other things about them that means you can succeed in this domain where you might not succeed generally," he said.

The team developed a termination prover for Windows device drivers called Terminator, which works on device drivers up to 35,000 lines of code. Terminator helped uncover a number of bugs in Windows XP, for example unplugging a mouse while moving it would cause XP to hang the system, as the OS would get stuck walking the I/O request queue forever.

Cook said the stability of recent Windows OS, such  as Windows 7 and 8, has benefited from Microsoft's work on stablising drivers.

"The internal crash data has pointed us towards buggy device drivers we should be focusing on and allowed us to figure out what the common mistakes are. It has helped us clarify with members of the Windows kernel team what rules device drivers should respect, but also what properties we should try and verify in programs," said Cook.

Further readings about Windows

Editorial standards