ie8 fix
madison

How to scale to 256 processors and beyond

By | November 7, 2008, 7:42am PST

On the advice of fellow ZDNet blogger Mary-Jo Foley I recently watched a video interview with Mark Russinovich. In her article, Windows 7 to scale to 256 processors, she summarized the interview but noted:

Given I’m not a programmer and am trying to channel a very technical Russinovich, it’s probably worth checking out the Channel 9 video interview of him yourself if you care about Windows kernel futures.

Well, I am a programmer so here’s my take. There are two main issues: a simple problem with describing a set of CPUs to run on, and a more complicated problem with lock contention. Let’s look at the simple one first.

CPU enumeration

Way back when Windows was first conceived, multi-core chips weren’t even a twinkle in Intel’s eye. There were some very high end machines with multiple sockets, though, and each socket could accept a chip with a single CPU core. So it wasn’t out of the question that in the future these multi-socket machines could become more common and more powerful, especially for server boxes. How many sockets can you put on a motherboard? 2? 4?

Just to be safe Microsoft allowed for 32, not because they really expected anyone to have 32 but because the word size of the brand new Pentium chips was 32 bits. In one register (or ‘int’ in C) you could keep a flag for each CPU. If your program wanted to confine itself to run on certain CPUs (this is called “affinity”) then you could make an operating system call and pass a bit mask of all of those CPUs. For example if you wanted to only run on the first, second, and fifth CPU then you could pass the hex number of 0×13, or binary 010011.

Time marches on, and today it’s pretty cheap to get an 8-core machine. Intel is talking about 80-core and higher, and manufacturers like IBM and Azul already will sell you 144-core or 768-core machines for a few $million. Clearly this is a market Microsoft wants to be in.

The first obvious step was to extend those 32-bit fields to 64-bits. Most enterprise servers run 64-bit capable hardware anyway, and they should be running a 64-bit version of the operating system that can handle wider integers. Still, that only gets you a 2-fold increase in the number of CPUs you can support.

So in Windows 7, they’re introducing a processor grouping mechanism. You can still specify affinity using a 32- or 64-bit word within a group, but if you have more than 32 or 64 cores then there’s a concept of groups overlaid on that. Using the old APIs, you can specify you want to run on the first, second, and fifth CPU but only within your current group. Using processor groups they can support up to 256 cores, maybe more. It’s a bit like memory segmentation in the awkward transition between 16 and 32 bits years ago. (If you don’t know what I’m talking about, then consider yourself lucky and move on.)

All that bit twidling doesn’t address the real problem though, which is lock contention.

Dining PhilosophersMitigating contention

Once upon a time, we thought we understood parallel programming. Everything is rosy until two actors or threads of execution or CPU cores (let’s call them Socrates and Voltaire) try to access the same resource at the same time. As long as they play nice and take turns, there is no problem. To enforce this rule, they both try to acquire a lock (also called a mutex), and only one of them will get it.

Let’s say Socrates gets the lock. He has exclusive access to the resource it protects as long as he holds the lock, so he can modify the resource however he wants. Meanwhile, Voltaire is left twiddling his thumbs. Eventually he’ll doze off to sleep.

Finally Socrates is done, releases the lock, and elbows Voltaire in the side to wake him up so he can have his turn. I’m over-simplifying a bit here but that’s the basic idea.

Now, there are two problems with this model. The first problem, deadlock, occurs if there are two locks (a right and left fork in the traditional Dining Philosophers problem) and the locks happen to be acquired in an order that Socrates is waiting on something Voltaire has while at the same time Voltaire is waiting on something Socrates has. That’s an annoying problem, but preventable so I’m not going to discuss it.

The second problem is all that thumb-twiddling and sleeping that Voltaire is doing. That’s wasted time that could have been put to better use. If the two actors only keep the lock for a small percentage of the time, say 1%, then it’s not a big deal. Voltaire loses 1% of his day, or about 15 minutes. He could probably use the rest anyway, right?

Now imagine instead of 2 actors in this tale you have 20, all trying to get to the same resource. One will acquire the lock, and the other 19 will wait.  All that wasted time really starts to add up, and eventually you reach a tipping point, or “knee in the graph”, where you’re doing more waiting than work.

This is not a hypothetical problem. In one program I was working on, throughput actually went *down* as we increased the number of CPUs. It’s counter-intuitive but true. Windows engineers like Russinovich ran smack dab into this when they tried running Windows programs on 16-way and higher machines. So what’s the solution?

The best solution to avoid lock contention is not to use locks at all. A lot of research has been done on lock-less data structures and “concurrent” programming. Unfortunately it’s still not well understood by mainstream developers and requires a paradigm shift in the way you think about programming. It’s not practical to introduce into a large and complex existing system.

The second-best solution, which is the route chosen by Windows 7 engineers, is to turn your big locks into smaller, more granular locks. Suppose you have a big operating system structure containing 100 pieces of mostly unrelated information. In the “old days” when contention wasn’t an issue you could protect the whole structure with one lock. Just grab the lock when you need something out of there, and release it when you’re done. Safe, simple, resistant to deadlocks, but completely unscalable.

So one thing the Windows 7 people did is replace that one big lock with a bunch of smaller locks. They went through the code line by line and changed it to only acquire the small locks when needed and release them as soon as possible. It sounds simple but it’s actually quite tricky because of that deadlock problem I mentioned earlier. It’s doable, but hard.

Another thing they did is use more thread local data structures, and make them more effective by preserving thread affinity across system calls. Think of thread local storage as a little notepad you carry with you to a big meeting. There’s also a whiteboard at this meeting. You could take notes on either, but there would be chaos if everybody tried to write on the whiteboard at the same time. So you use the notepad and reconcile with the whiteboard later. Everybody has their own notepad, so everyone can write to it at once.

Conclusion

The problems and solutions that Mark refers to in his video are not new. They’re not unique to Windows 7, or even to Microsoft. What’s new is that programming for multiple CPUs, whether they’re on the same chip, in the same box, data center, or spread around the globe, is become more and more mainstream every day. I have an 8-core computer on my desk today. If you extrapolate the trends, it won’t be long before 100-core, 1000-core, or even million-core computers are the norm, and if you want to be a gainfully employed developer in this new world it’s time to sit up and take notice.

Related articles:

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Ed Burnette is a software industry veteran with more than 25 years of experience as a programmer, author, and speaker. He has written numerous technical articles and books, most recently "Hello, Android: Introducing Google's Mobile Development Platform" from the Pragmatic Programmers.

Disclosure

Ed Burnette

Ed Burnette is a Manager of Mobile Development at SAS. However the postings on this site are his own and do not represent the positions, strategies, or opinions of his employer.

Biography

Ed Burnette

Ed Burnette has been hooked on computers ever since he laid eyes on a TRS-80 in the local Radio Shack. Since graduating from NC State University he has programmed everything from serial device drivers and debuggers to web servers. After a delightful break working on commercial video games, Ed reluctantly returned to business software. He currently develops enterprise software for Android phones and tablets.

In his copious spare time, Ed writes and speaks about all kinds of technology and software. His most recent books include the Eclipse IDE Pocket Guide from O'Reilly and Hello, Android: Introducing Google's Mobile Development Platform from the Pragmatic Programmers.

12
Comments

Join the conversation!

Just In

Interesting problem. IBM solved this with the mainframe
arrowrod 10th Nov 2008
This is an old mainframe problem. IBM developed locks, latches, "parallel sysplex", "Compare and swap", Compare double and swap". And lots of code.

Yeah, we noticed a slow down as additional processors were added. The reason for parallel sysplex.

Interesting that all of this is solved in whatever the latest IBM mainframe operating system is called. OS/390? z/OS I think is the latest.
basically a pie in the sky.

Still chasing what Linux can already do...
0 Votes
+ -
In other words
croberts 7th Nov 2008
you've taken an article about software design concepts that apply to computer science generally, and managed to impose your OS-specific agenda.

Well done.
0 Votes
+ -
Grow up
happyharry_z 7th Nov 2008
This is not an OS specific issue....
Those Beowulf clusters are using message passing to run a bunch of small programs, effectively doing what MS is suggesting to do in Win7
0 Votes
+ -
One of the biggest problems for BeOS as a development
environment was lock and race conditions.

I can characterize the problems this way:

Developer X has lots of experience with writing code
on single threaded/single CPU OSes.

He comes to BeOS. He follows the APIs as well as he
understands them, except for that wacky stuff that
nobody'd ever really use, and tests them on his single
CPU system.

The app compiles, it works, there is joy.

He puts his app out on BeBits. App is downloaded and
run on an 8 CPU system. And it crashes - but not in
any kind of predictable or easily diagnosed (or in
many cases, repeatable...) ways, because of lock and
race conditions.

He screams, he froths, he curses this infernal OS and
swears he will never write for it again. Another
developer lost.

Someone who writes and tests, and follows the counter-
intuitive process on a multi-CPU box has a longer
debugging loop, and spends more developer hours - but
it works just fine.

BeOS tried method 1 (or some permutation of it as
existed in '93 or '94), and put a lot of the handles
for it into their APIs and OS calls. It still drove
70% of the coders on the system crazy.

When it DID work, it was a think of beauty.
0 Votes
+ -
some other words
shis-ka-bob 7th Nov 2008
If I was going to talk about scaling up to more than 16 processors, I would be using OpenSolaris and FreeBSD 7 as my open source examples. The scalability also requires application support as well as OS support.

Go look at the PostgreSQL 8.3 & FreeBSD 7.0 co-evolution to see this. I know there are other examples, but open source examples are easier to investigate, since Google will find developer postings that would be hidden in a closed application & OS.
0 Votes
+ -
Mark Russinovich said that the difference in performance is negligible on a Dual/Quad core. You need at least a 32+ cores to see increased performances in Win7. We'll see desktop processors with 32+ cores when Win8 will be out... so Win7 improvements are now useless. I stick with Vista
0 Votes
+ -
Your mistaken
RomW 7th Nov 2008
The kernel for Windows 7 is also the kernel used in Windows Server 2008 R2. During WinHEC 2008 they demo'ed a Win2k8R2 256 core machine acting as a SQL server under full load.

The 256 core machine was built by HP. They also demo'ed a machine by IBM.
the kernel in Vista SP1 is also the kernel used in Windows Server 2008.
0 Votes
+ -
Contributr
Even with dual and quad core...
Ed Burnette 8th Nov 2008
Depending on the workload, I've found Solaris (and to a lesser extent, Linux) to be noticeably better than Windows right now at multithreading efficiency even on dual and quad core machines. But still, lots of customers want to run Windows, so any improvements there are welcome.
0 Votes
+ -
Mark said NO!
qmlscycrajg Updated - 10th Nov 2008
If you see the channel9 interview, the journalist asked a precise question and Mark answered with a precise answer: "the difference in performance is negligible on a Dual/Quad core".
0 Votes
+ -
Contributr
That's the kind of question that can be answered truthfully both Yes and No. It depends entirely on what you're testing and what you consider to be negligible. Besides, I was referring to my own experiences with my own code. YMMV, a lot.

I think it's fair to say that as a rule of thumb, most people won't notice a problem on most workloads until they try it on 8 or more cores. But it's also fair to say there are exceptions.
This is an old mainframe problem. IBM developed locks, latches, "parallel sysplex", "Compare and swap", Compare double and swap". And lots of code.

Yeah, we noticed a slow down as additional processors were added. The reason for parallel sysplex.

Interesting that all of this is solved in whatever the latest IBM mainframe operating system is called. OS/390? z/OS I think is the latest.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix
Click Here
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix
ie8 fix