Quentin Stout and Christiane Jablonowski from the University of Michigan gave a nice introduction to parallel computing on Sunday. They covered everything from architecture to APIs to the politics of sharing a common compute resource.
Hardware designs for parallel computing are as diverse as the problems they are asked to solve. One of the common categorizations include single/multiple instruction/data. A SIMD architecture operates on multiple data points using a single instruction. The most common architecture right now (according to the Top500 computer list) is "SPMD" or single program multiple data. In this design each compute core runs the same program on a subset of a larger data set.
Another dimension in parallel architecture is how much is shared across all the processors in the system. In a "shared-everything" system, all the processors, memory, and disk drives connect to some sort of bus that connects everything together in one homogeneous entity. If you have a dual- or quad-core system in a server box, it most likely uses this design. You have one block of memory, say 16GB, that all the processors share. Shared-everything is easy to program but doesn't scale well because of all the possible interconnects between components.
By contrast, in a "shared-nothing" system, there are discrete units of processor, memory, and disk. A processor has its own private memory, and if it needs to talk to some other processor it needs to do so through some kind of explicit messaging. This is more difficult to program but can scale out to thousands or hundreds of thousands of processors.To program parallel computers, developers have many options. However in the past few years two in particular have become standard. The first one is MPI (Message Passing Interface). It is primarily targeted towards distributed non-shared-memory processors. Here's a very simple MPI program in C:
All CPUs run this program, and they all get a numeric id, known as the rank. The rank 0 program is the controller (sometimes called manager or master). In the example above, any processor that isn't rank 0 simply sends a floating point value back to the controller. The controller uses an MPI call to receive the value from the rank 1 processor, the ranks 2 processor, and so on up to the number of processors in the system.
What if the rank 1 processor takes a long time to compute its result? The controller will stall until it finishes. MPI provides more powerful "collective" operations to perform common tasks like this more efficiently. For example:
The code above processes each value as it comes in (in any order) and adds the results together. There are a number of other collective operations supported by MPI.
The other major API is OpenMP. It's mainly targeted towards shared-memory computer systems. C, C++, and Fortran bindings are available. Here's an example using Fortran which is equivalent to the above MPI reduce call.
Lately people have been combining MPI and OpenMP, to handle the increasing common architectures of clusters of multi-core machines.Perhaps the hardest problem in parallel computing is load balancing. If you divide a problem up into p operations and run each piece on a different processor at the same time, then your total time will depend on the piece that takes the longest. Various techniques such as oversubscription, geometric decomposition, and space filling curves can be used to get "pretty close" to the optimal but this is still an area of active research.
Another key to getting good performance from a parallel program is profiling. Luckily there are many software packages that help the developer visualize what is going on and identify bottlenecks. Here's a screneshot from one called Vampir:
Here we have 5 processes running. Ideally you'd want to see all green, meaning that 100% of the CPU is being used for application code. In this example you can see that most of the time is actually being taken up by MPI calls. You can drill down into this to see what is causing the problem, such as unnecessary synchronization.
This was a full day tutorial so I can't cover everything in a blog entry, but if you'd like to learn more about parallel computing check out the references on Quentin 's web page.