Applications use to get free performance boosts whenever the clock speed on the CPU and memory bus went up at an exponential rate. For two decades, applications just magically doubled in speed every two years without any requirement to redesign the code but the era of the "free lunch" performance boosts is over. That end started around 2003 when CPU makers hit the 3 GHz thermal wall in microprocessors, but some modest per-core gains - although nothing like the old days - were made since 2003 in execution optimization with the transition to newer CPU micro-architectures despite the fact that clock speeds are lower. The Lion's share of progress made within the microprocessor industry for the last two years is the shift to multi-core processors, first dual core and now quad core processors. From this point forward we're only going to see a multiplication of CPU cores with relatively fixed clock speeds mostly in the 2 to 3 GHz range and maybe eventually close to 4 GHz on some premium products.
The consequence of this seismic shift in microprocessor development means that traditional single threaded applications will no longer see any significant gains in performance let alone exponential gains. That means a typical single-threaded application will probably not be that much faster 8 years from now even if there are 16 times the number of CPU cores. That's because even if we had 32-core CPUs, a single threaded application will only be able to leverage 1 of the 32 cores within that CPU while 31 cores sit idle. Of course some people might be wondering if it would be better to just keep scaling the clock speeds of single core processors and have something like a 20 GHz processor. Yes that would be the ideal solution but it simply can't be done short of using insane amounts of power and exotic liquid nitrogen cooling systems. The entire microprocessor industry was forced to shift to a process of putting more cores in to a CPU versus ramping up the clock speed.
The only way to scale applications to take advantage of all that extra processing power in the extra cores requires a fundamental shift in the way programs are written. This new programming technique is called "multithreaded programming" or "parallel programming". Here are a few ways to tackle multithreaded programming:
- Use multithread optimized libraries
- Use multithreaded development APIs like OpenMP and pthreads
- Use automated parallelization and vectorization compilers
- Hand threading (manual threading)
Multithread optimized libraries:
One of the easiest ways to do multithreaded programming is to take advantage of multithread optimized libraries. In the latest Intel compiler 10.0, you have Intel's MKL (Math Kernel Library) and multimedia processing functions which are optimized to run on multi-core processors with concurrent threads. The hard work was already done and the developer merely takes advantage of what was already written. Since math and science functions and multimedia processing have some of the heaviest computation requirements, these libraries and functions are a huge boost to developers.
Multithreaded development APIs:
OpenMP is a multithreaded development API designed to make multi-core optimization easier than manual "hand threading". OpenMP allows the automation of multithreaded parallel processing on multi-core processors and sometimes it even scales better than hand threading. Intel's Director of Marketing James Reinders explained to me that one might see a 400 to 500 percent performance gains over a single-threaded application on an 8-core processor. Considering the fact that 700% scaling is the theoretical maximum gain on an 8-core computer, 500% gains from automated multithreading is extremely tempting since it saves the programmer from having to manually chop up the workload among multiple CPU cores.
Automated parallelization and vectorization compilers:
The new parallelization optimizations in the latest Intel compiler 10.0 allow applications that haven't been coded with any multithreading in mind to get small boosts for multi-core computers. These typically take loops in programs and tries to divide up the test cases across multiple CPU cores. This isn't just limited to "for" or "do while" loops but also for more complex loop structures. The typical gains made are usually modest single digit or low tens percentage gains. While that isn't a lot, it is essentially a free boost with a simple compiler switch on all existing code. You can basically try it with and without and see if it makes a difference in your application without doing any modifications to the code. These parallelization and vectorization techniques have gotten a lot of press lately but they don't even come close to replacing OpenMP or hand threading.
Hand threading (manual threading):
Hand threading is a manual process where the developer decides exactly how to break up a workload across multiple CPU cores and it can scale perfectly when done right. With enough time and skill at one's disposal, manual hand threading performance should always beat OpenMP performance but the skills needed for multithreaded programming are a very rare commodity. The demand for skilled multithread programmers is huge and it isn't something your run of the mill programmer can do. For more on parallel programming, here's a great article by Herb Sutter and James Larus.
What scales and what hasn't:
The most obvious example of perfect multi-core scaling are 3D rendering and multimedia encoding applications all of which require a lot of processing time and have the most to gain. Server applications also tend to scale fairly well because by their very nature they have a lot of concurrent and independent tasks to handle which can be divided up across multiple CPU cores.
The difficulty lies in getting games to scale well on multi-core processors. Office productivity applications are another category of applications that generally don't scale well either because there's very little developer experience dealing with multi-CPU computers on the desktop platform. Furthermore, you only have a single user generating the workload and that's a lot harder to chop up than the server environment where you can just assign different user sessions to different CPU cores. Office productivity performance is also less of an issue since you can't possibly need that much more performance for mundane office tasks until computers start requiring more human-friendly interfaces. Voice dictation for example would be one area where you can have one CPU core doing the actual dictation and the other core handing the rest of the workload on the PC.