IDF 2011: Intel makes the case for more cores

On the final day of the Intel Developer Forum, CTO Justin Rattner made the case for more powerful PCs and servers with tens or even hundreds of processing cores.
Written by John Morris, Contributor

On the final day of the Intel Developer Forum this week, CTO Justin Rattner made the case for more powerful computers, or more specifically for many more cores. During his keynote Intel demonstrated a range of interesting applications-both consumer and business-that harness the power of multi-core and many-core PCs and servers. And he sought to debunk the widespread belief that you need to be "some kind of freak to actually program these things."

(You can view Webcasts of the IDF keynotes here.)

Five years ago at IDF, Rattner introduced the Core microarchitecture and the shift to using more cores running at lower speeds, rather than one or two very fast cores, to improve performance and power-efficiency. At the time, he said, no one could have imagined that a few years later we'd be talking about processors with tens or even hundreds of cores. This includes not only CPU core (what Intel refers to as IA cores) but other specialized cores such as graphics processing units (GPUs) and accelerators-a concept known as heterogeneous computing.

Intel's main product in this emerging segment is the Knights family of processors based on the company's Many Integrated Cores (MIC) architecture. Some customers are already testing Knights Ferry, a development chip, and reporting that can port existing multi-core applications to the MIC architecture and realize good speed-ups, according to Rattner. Based on these results, Intel will "soon" launch Knights Corner, a processor with more than 50 cores manufactured using the company's most advanced 22nm process. As part of a separate Tera-scale Computing Research Program, Intel Labs recently announced a prototype Single-Chip Cloud Computer (SCC) with 48 cores designed for scale-out cloud applications. Finally Intel has been busy creating better tools to address the challenge of programming these many-core processors, he said.

Rattner showed results of a series of application tests on systems ranging from one to 64 cores. The tests were not limited to traditional High-Performance Computing (HPC) applications-which often lend themselves to systems with many cores and threads-but also included business and consumer applications such as home video editing. The results looked very good (though the devil is always in the details with benchmarks), in some cases showing speed-ups that were close to linear. In other words, doubling the cores nearly doubles performance. "This has given us a lot of confidence that people are going to be able to put this architecture to work," Rattner said.

Andrzej Nowak from the CERN openlab, a collaboration with companies such as Intel to develop computer technology for the Large Hadron Collider, talked about the group's many-core efforts. The massive collider generates 40 million particle collisions per minute producing 15 to 25 petabytes of data per year (a petabyte is equal to 1,000 terabytes). To analyze all of this data, the openlab uses software that consists of millions of lines of code and 250,000 Intel cores distributed across hundreds of data centers. Nowak said the fact the same programming tools from Xeon server processors also work on the MIC architecture makes it easier to port this software. Because the workload is "heavily-vectorized and highly-threaded," it scales almost linearly with the number of cores and threads. "We will take any amount of cores you can throw at us," Nowak said.

To prove that many-core can work on both the server and clients, Rattner highlighted a series of real-world applications. Noting that many Web applications were really a collection of databases accessed by many users concurrently, Rattner said traditional servers were not designed for these sorts of workloads. He demonstrated how a different type of server, with a 48-core processor and in-memory database, could address this problem by handling about 800,000 transactions per second. Similarly, on the client side, Brendan Eich, the CTO of Mozilla and inventor of JavaScript, said that when he created the scripting language "in 10 days in May 1995" it was not designed for parallel applications. Intel Labs announced Parallel Extensions for JavaScript, code-named River Trail, which leverages multi-core and many-core to speed-up JavaScript applications. In the demo, a 3D Nbody Simulation in Firefox ran at 3 frames per second on a single processor and at 45 frames per second using all of the cores. Intel said these extensions will enable a new class of browser-based apps in areas such as photo and video editing, physics simulation, and 3D gaming.

One of the more intriguing demos was an LTE wireless base station, developed as part of a project with China Mobile, which uses standard PC parts including a second-generation Core i7 processor. Rattner said Intel will be doing field trials with China Mobile and other partners next year, adding that that it will try a similar approach with routers and switches. Communications and networking gear generally uses programmable logic devices or specialized ASICs, but Intel believes that it can match the performance with off-the-shelf multi-core CPUs. In the final demo, Intel showed how a PC can use facial recognition to decrypt and display only the correct images from a photo album on the fly. This demonstration used both the IA cores and on-die graphics in Sandy Bridge.

"I hope at this point there is no question in your mind that the time is now-- if you haven't already started--to build multi-core or many-core applications and you don't need to be a ninja programmer to do it," Rattner said.

If this isn't ambitious enough, Intel has an even bigger goal in mind: an exascale computer by 2018. An exaflop is one quintillian (10^18) floating-point operations per second. To put that in perspective, Nvidia's Tesla C2070 GPU is capable of 515 gigaflops, or billions of operations per second. The world's fastest supercomputer the K Computer at the RIKEN Advanced Institute for Computational Science in Kobe, Japan, is capable of 8 petaflops, or 8 quadrillion (10^15) floating-point operations per second.

The real challenge here, though, is power. Today's petascale supercomputers already use seven to 10 megawatts, so simply scaling them up isn't an option. An exascale computer would require several nuclear power stations to supply its six gigawatts of power. The practical limit for a datacenter is around 20 megawatts, which means we will need a 300x reduction in total system power to build an exascale computer. Intel's Shekhar Borkar is leading the company's effort to develop a prototype system by 2018 as part of the DARPA-funded Ubiquitous High Performance Computing project. Three other organizations, Nvidia, MIT and Sandia National Laboratory, are also developing prototype "ExtremeScale" supercomputers.

One way to reduce system power is to make the CPU more efficient. To illustrate this, Rattner demonstrated an experimental Pentium-class chip, code-named Claremont, which is capable of operating close to the threshold voltage of the transistors-the power required to switch a transistor on and off. CEO Paul Otellini had already given a quick preview of this chip running Windows earlier this week, but Rattner showed it running Linux and offered more details. Because Claremont operates within a couple hundred milliwatts of the threshold voltage, it sips power and can be run entirely from a solar cell about the size of a postage stamp. Intel got a 5x reduction in power using the older Pentium core, but it could achieve an 8x reduction using a newer core, Borkar said. Intel also showed Claremont's "wide dynamic range," meaning its ability to boost the frequency up to ten times to handle more intensive tasks, by running a Quake demo.

Rattner also talked about the Hybrid Memory Cube, a concept developed by Micron that consists of a stack of DRAM chips in a compact cube with an efficient, high-performance controller and interface. Intel said the HMC is capable of nearly 1Tbps of throughput yet it uses seven times less power than today's DDR3 DRAM. Stacked memory is difficult to manufacture, and therefore still relatively expensive, but the HMC seems like a promising concept for networking equipment and servers.

We're at a significant point in time where technology is no longer the limiting factor," Rattner concluded.

Editorial standards