Foundations of parallel programming I

Intel's James Reinders looks into the algorithms that form the heart of threading building blocks - a C++ template library for parallel programming.
Written by ZDNet Staff, Contributor on
Hi, I’m James Reinders, and today we’re going to talk about threading building blocks. In fact, this is part one of a look at threading building blocks, which is an interesting template library for C++ for implementing parallelism.

So, a C++ template library for parallelism has to have quite a few features in it to help you program in parallel. The feature I will look at is the algorithms which really form the heart of threading building blocks.

Let’s take a simple look at a “for” loop. This is an extremely simple “for” loop but it gives you an idea of a loop that we could consider dispatching the multiple processors. So, in written this way in C or C++ it’s a pretty simple loop. Taking it and using “pthreads” or something around it to have it execute in parallel would add a lot of lines of code to create threads, to manage threads, to partition the problem and so on. In reality, what we really want to write is something more like what I have here at the bottom. Just “parallel_for” iterate from “O,n”, do that function, which I’ve called my “mydbl” here, and figure out how to partition it automatically.

This is the heart of how we write using Threading Building Blocks to execute that loop in parallel. Now, the loop itself we need to put in a class, where I’ve called it a “struct” here, we need to create a class with an operator for that class that has the “for” loop in it. When we add that we have to add some parameters. In this case, we want to have an input and an output array, we need to declare them and we need to have the range passed into this. Other than that, the loop stays basically the same. In this case, since the range is being passed into this we need to do a “range.being”, a “range.end” to find the “begin” and “end” of the iterations in the loop.

Bypassing the range in this allows Threading Building Blocks runtime system to break up the loop to fit the number of processors you’re on. You’ll notice nothing in this code actually tries to figure out how many processors there are; that all happens automatically at runtime for us.

So, we have to declare our function as part of a class but the function copy is over and only a little bit of parameters need to be added around it to make it work. Of course, we’ll have some “includes” at the top to include definition files but then we just basically call parallel_for. In this case, I’ve created an instance of this class, given it the input and output parameters I want to use and then used this template. The template expands into very efficient code that automatically takes advantage of the parallelism on the machine.

Threading Building Blocks offers several other algorithms as well in addition to parallel_for. There’s a “parallel_reduce”, a “parallel_scan”, a “parallel_while”, a “parallel_pipeline” and also a “parallel_sort”. Those give you pretty much all the templates you’ll need to express a great deal of parallelism in a program. In my experience, parallel_for is the most important and parallel_reduce is pretty important as well. The other ones come into play occasionally but you can get an awful lot done with parallel_for. In fact, the parallel_sort, supplied by Threading Building Blocks, is actually just written using parallel_for.

So, that’s a look at Threading Building Blocks algorithms. You can learn more by going to www.threadinbuildingblocks.org. You can download a copy from there, look at documentation and so on, and in part two we’ll take a look at the rest of the elements in Threading Building Blocks.

Editorial standards