768 cores ought to be enough for anybody

While Intel struggles to get their 80-core processor to work, a company you've probably never heard of has been quietly shipping systems with hundreds of cores. Last year at JavaOne I attended a presentation by Cliff Click of Azul Systems on scaling up an application that used to take weeks to run so that it could finish in minutes. After the Intel announcement, I tracked down Cliff to get his thoughts on multicore hardware and software.
Written by Ed Burnette, Contributor

Bill Gates is famously quoted as saying "640K ought to be enough for anybody." He denies it, but at the time (c. 1981) 640K must have seemed like a vast amount of memory. We're in a similar situation today as dual- and quad-core processors are becoming mainstream. How will we possibly use them all? Last week, Intel demonstrated an 80-core prototype, though it has a few kinks to work out (e.g., there's no way to connect it to memory). Meanwhile, a company you've probably never heard of has been quietly shipping systems with 24, 48, up to 384 cores (with 768 on the way). The company: Azul Systems.

Last year I had the pleasure of attending a presentation at JavaOne 2006 given by Cliff Click, Azul's Chief JVM Architect, on scaling up an application on a 384-way machine using a few simple concurrent programming techniques. A program that took weeks to run on a fast Pentium processor was reduced to minutes on the multicore box. After the Intel announcement, I tracked down Cliff to get his thoughts on multicore.

[Ed] Why do we need multicore systems? Couldn't we just make our existing processors faster?

[Cliff] The driving trend toward multicore processors is parallelism in software (the desire to do multiple things at once). This is most prominent in transaction-centric application designs where the volumes of executions can be very, very high. Parallelism has been happening in code for many years and the hardware response to this trend has, up until recently, been faster single-core chips. This approach queues up requests that had to be processed one at a time, when they wanted to be acted on all at the same time. Why it worked, to a degree, was that because the chip worked so fast, it could clear the queue of waiting requests quickly. If the single processor still wasn't fast enough, you simply added another processor, then another processor (to the degree you could) to get the kind of performance you wanted - this is SMP (symmetric multiprocessing) parallelism.

Continuing with this same hardware approach was beginning to reach a breaking point as the volumes of transactions and the amount of parallelized code continues to grow. Faster and faster single-core chips result in hotter and more power consumptive processors and if you simply add more processors to a single server the cost and complexity of the servers grows (and there are limits to how many chips you can put in a single server - especially with the Intel architecture).

So the better approach is to design the chips themselves for parallel execution - pack many, smaller, more efficient processors onto a single chip. We have seen this technique in telecom products (multicore chips from Broadcom), graphics accelerators (multiple GPUs (graphics processing units) per chip) and in more general computing (Azul's Vega, IBM's Cell and others).

[Ed] So multicore isn't just a matter of replicating single-core processors many times over?

[Cliff] Each multicore processor design is a bit different based on its target use model and packing more processors onto the same chip requires some sacrifices. For example, you can't simply take a full Xeon with all its cache, floating point units and other subcomponents and simply shrink it down to 1/4th its size and pack four on a chip. You have to redesign the chip.

In the case of graphics chips for instance, they don't need all the components of a Xeon, just the pieces needed to render graphics. You get rid of the rest, and you now have room for a second GPU, third GPU, etc.. And you can do this and deliver greater performance in your specific area of focus because the chip is purpose-built for that use.

This is the case with our Vega processor. It is designed specifically to execute object-oriented code, such as Java. It has specific instruction sets and component architecture needed to execute this code more efficiently. Because of this specialization we are able to make each core very small and pack up to 48 cores on a single chip in our Vega 2 processor.

Some of the unique instruction sets we can deliver enable us to boost Java performance, such as our instructions for Java memory garbage collection, which allows a single Java application instance to leverage nearly unlimited amounts of memory without a performance penalty - something you can't do with traditional servers.

[Ed] How does multicore hardware change the way that programmers need to think about their art?

[Cliff] In the case of Azul, it doesn't require any change to software design. In fact it opens up new freedoms for Java developers. Now they can take greater advantage of parallelism, use more memory and do more. In the case of other multicore designs, the answer depends.

Some multicore processors, such as Cell are non-homogeneous designs, meaning that all processor cores are not the same. These require very complex changes to software design because you have to schedule the use of certain cores that are in shorter supply than others, for example. This is something game developers have been wrestling with in building content for the Sony PlayStation 3.

[Ed] For years programmers have been taught to use locks and synchronized blocks to support multi-threading. Does having large numbers of cores invalidate that advice?

[Cliff] Multicore, by itself, doesn't change the picture when it comes to locks and synchronized blocks as these are controls put in place to prevent overwrites of objects stored in shared memory. However, with multicore processing you increase the likelihood that parallel threads will need the same shared objects. This is why we implemented Optimistic Thread Concurrency in its systems design.

This is a technology, first introduced in the database world, that allows us to monitor the use of shared objects in memory and allows parallel threads to bypass locks to read the same shared objects at the same time. If we detect a write attempt to that shared object we can back out the other threads requesting that shared object until the write completes; then we let them all back in again. This is a unique characteristic of our system design, rather than something inherent about multicore processor design.

[Ed] Does massively multicore hardware demand new programming languages and paradigms?

[Cliff] In some cases, yes. In the case of Azul, no. Standard Java code runs unmodified on our systems. For programmers who want to write to the specific instruction sets of these new multicore designs, such as IBM Cell and others, yes, they need the specific instructions to do so. Java applications are masked from the specific instructions of the Vega processor by the Azul Virtual Machine. Just write standard Java code, load it into our VM and you are off and running.

[Ed's note: In fact, Azul changed the instruction set between Vega 1 and Vega 2 to help get a 300% boost in speed over the one they showed at JavaOne. But since all application code is in Java, there was no need to recompile it to run on the new machine.] 


With more than twenty-five years experience developing compilers, Cliff serves as Azul Systems' Chief JVM Architect. Cliff joined Azul in 2002 from Sun Microsystems where he was the architect and lead developer of the HotSpot Server Compiler, a technology that has delivered dramatic improvements in Java performance since its inception. Previously he was with Motorola where he helped deliver industry leading SpecInt2000 scores on PowerPC chips, and before that he researched compiler technology at HP Labs. Cliff has been writing optimizing compilers and JITs for over 15 years. He is invited to speak regularly at industry and academic conferences including JavaOne, JVM'04 and VEE'05 and has published many papers about HotSpot technology. Cliff holds a PhD in Computer Science from Rice University.

Editorial standards