How to scale to 256 processors and beyond

On the advice of fellow ZDNet blogger Mary-Jo Foley I recently watched a video interview with Mark Russinovich. She wrote: "Given I’m not a programmer and am trying to channel a very technical Russinovich, it’s probably worth checking out the Channel 9 video interview of him yourself if you care about Windows kernel futures." Well, I *am* a programmer, so here's my take...
Written by Ed Burnette, Contributor

On the advice of fellow ZDNet blogger Mary-Jo Foley I recently watched a video interview with Mark Russinovich. In her article, Windows 7 to scale to 256 processors, she summarized the interview but noted:

Given I’m not a programmer and am trying to channel a very technical Russinovich, it’s probably worth checking out the Channel 9 video interview of him yourself if you care about Windows kernel futures.

Well, I am a programmer so here's my take. There are two main issues: a simple problem with describing a set of CPUs to run on, and a more complicated problem with lock contention. Let's look at the simple one first.

CPU enumeration

Way back when Windows was first conceived, multi-core chips weren't even a twinkle in Intel's eye. There were some very high end machines with multiple sockets, though, and each socket could accept a chip with a single CPU core. So it wasn't out of the question that in the future these multi-socket machines could become more common and more powerful, especially for server boxes. How many sockets can you put on a motherboard? 2? 4?

Just to be safe Microsoft allowed for 32, not because they really expected anyone to have 32 but because the word size of the brand new Pentium chips was 32 bits. In one register (or 'int' in C) you could keep a flag for each CPU. If your program wanted to confine itself to run on certain CPUs (this is called "affinity") then you could make an operating system call and pass a bit mask of all of those CPUs. For example if you wanted to only run on the first, second, and fifth CPU then you could pass the hex number of 0x13, or binary 010011.

Time marches on, and today it's pretty cheap to get an 8-core machine. Intel is talking about 80-core and higher, and manufacturers like IBM and Azul already will sell you 144-core or 768-core machines for a few $million. Clearly this is a market Microsoft wants to be in.

The first obvious step was to extend those 32-bit fields to 64-bits. Most enterprise servers run 64-bit capable hardware anyway, and they should be running a 64-bit version of the operating system that can handle wider integers. Still, that only gets you a 2-fold increase in the number of CPUs you can support.

So in Windows 7, they're introducing a processor grouping mechanism. You can still specify affinity using a 32- or 64-bit word within a group, but if you have more than 32 or 64 cores then there's a concept of groups overlaid on that. Using the old APIs, you can specify you want to run on the first, second, and fifth CPU but only within your current group. Using processor groups they can support up to 256 cores, maybe more. It's a bit like memory segmentation in the awkward transition between 16 and 32 bits years ago. (If you don't know what I'm talking about, then consider yourself lucky and move on.)

All that bit twidling doesn't address the real problem though, which is lock contention.

Dining Philosophers
Mitigating contention

Once upon a time, we thought we understood parallel programming. Everything is rosy until two actors or threads of execution or CPU cores (let's call them Socrates and Voltaire) try to access the same resource at the same time. As long as they play nice and take turns, there is no problem. To enforce this rule, they both try to acquire a lock (also called a mutex), and only one of them will get it.

Let's say Socrates gets the lock. He has exclusive access to the resource it protects as long as he holds the lock, so he can modify the resource however he wants. Meanwhile, Voltaire is left twiddling his thumbs. Eventually he'll doze off to sleep.

Finally Socrates is done, releases the lock, and elbows Voltaire in the side to wake him up so he can have his turn. I'm over-simplifying a bit here but that's the basic idea.

Now, there are two problems with this model. The first problem, deadlock, occurs if there are two locks (a right and left fork in the traditional Dining Philosophers problem) and the locks happen to be acquired in an order that Socrates is waiting on something Voltaire has while at the same time Voltaire is waiting on something Socrates has. That's an annoying problem, but preventable so I'm not going to discuss it.

The second problem is all that thumb-twiddling and sleeping that Voltaire is doing. That's wasted time that could have been put to better use. If the two actors only keep the lock for a small percentage of the time, say 1%, then it's not a big deal. Voltaire loses 1% of his day, or about 15 minutes. He could probably use the rest anyway, right?

Now imagine instead of 2 actors in this tale you have 20, all trying to get to the same resource. One will acquire the lock, and the other 19 will wait.  All that wasted time really starts to add up, and eventually you reach a tipping point, or "knee in the graph", where you're doing more waiting than work.

This is not a hypothetical problem. In one program I was working on, throughput actually went *down* as we increased the number of CPUs. It's counter-intuitive but true. Windows engineers like Russinovich ran smack dab into this when they tried running Windows programs on 16-way and higher machines. So what's the solution?

The best solution to avoid lock contention is not to use locks at all. A lot of research has been done on lock-less data structures and "concurrent" programming. Unfortunately it's still not well understood by mainstream developers and requires a paradigm shift in the way you think about programming. It's not practical to introduce into a large and complex existing system.

The second-best solution, which is the route chosen by Windows 7 engineers, is to turn your big locks into smaller, more granular locks. Suppose you have a big operating system structure containing 100 pieces of mostly unrelated information. In the "old days" when contention wasn't an issue you could protect the whole structure with one lock. Just grab the lock when you need something out of there, and release it when you're done. Safe, simple, resistant to deadlocks, but completely unscalable.

So one thing the Windows 7 people did is replace that one big lock with a bunch of smaller locks. They went through the code line by line and changed it to only acquire the small locks when needed and release them as soon as possible. It sounds simple but it's actually quite tricky because of that deadlock problem I mentioned earlier. It's doable, but hard.

Another thing they did is use more thread local data structures, and make them more effective by preserving thread affinity across system calls. Think of thread local storage as a little notepad you carry with you to a big meeting. There's also a whiteboard at this meeting. You could take notes on either, but there would be chaos if everybody tried to write on the whiteboard at the same time. So you use the notepad and reconcile with the whiteboard later. Everybody has their own notepad, so everyone can write to it at once.


The problems and solutions that Mark refers to in his video are not new. They're not unique to Windows 7, or even to Microsoft. What's new is that programming for multiple CPUs, whether they're on the same chip, in the same box, data center, or spread around the globe, is become more and more mainstream every day. I have an 8-core computer on my desk today. If you extrapolate the trends, it won't be long before 100-core, 1000-core, or even million-core computers are the norm, and if you want to be a gainfully employed developer in this new world it's time to sit up and take notice.

Related articles:

Editorial standards