Intel believes future advances in supercomputing lie in looking beyond the power of the CPU.
While servers typically rely on CPUs to process data, heterogeneous supercomputers add many more types of silicon to the mix. Rather than just rely on CPUs, these machines also shunt data through the likes of GPU clusters, Field Programmable Gate Arrays and co-processors.
The architecture of these co-processors can also differ significantly from the standard server CPU. For example, while Intel's Xeon E5-2600 v2 CPU packs four to eight cores, its x86 Xeon Phi co-processor can include anything up to 61 cores on a single board.
Mixing processor architectures allows computers to deliver more bang per buck. When it comes to getting the best performance per watt, certain tasks may be better performed in parallel across many smaller, more power efficient cores in a co-processor or GPU, while others may favour beefier, energy hungry CPUs with deeper instruction pipelines.
Heterogeneous supercomputers can be built around processor cores with a mix of architectures, whose make-up is suited to the computing tasks the machine will carry out.
China's Milky Way 2, the fastest supercomputer in the world according to the TOP500 list, is based on a heterogeneous architecture. Milky Way 2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors, for a combined total of 3,120,000 computing cores.
"We believe the right answer is to use the best of what heterogeneity provides, which is the performance per energy spent and the customisation benefit," said Dr Rajeeb Hazra, Intel's VP of datacenter and connected systems group and general manager of the technical computing group.
One drawback of the heterogeneous computing is the added complexity of writing programs that can run on different instruction set architectures. With its Xeon Phi co-processor, Intel claims to be overcoming this by preserving the established programming model for the Intel x86 architecture.
"We can do that without incurring the cost of heterogeneity and retaining the benefits of homogeneity at the programming model level," Hazra said.
An IDC survey found the proportion of high performance computer users planning to use heterogeneous accelerators or co-processors in future systems has increased from below 30 percent a year and a half ago to more than 70 percent today, according to Hazra.
Yesterday, Intel announced its new line-up of Xeon Phi co-processor boards. All of the new Xeon Phi PCI-Express 3 boards are based on the same Knights Corner architecture released last November and the same 22nm Tri-Gate process.
The existing family of Xeon Phi 3100 is now available with active and passive cooling, allowing the boards to make use of a server or workstations in-case cooling. This family is designed to offer mid-range performance, capable of more than one teraflops of double precision calculations and with 240 GBps. On board is 6GB of GDDR5 memory.
The new 7100 series ups the clock speed, increasing performance to more than 1.2 teraflops of double precision floating point calculations and 352 GBps of memory bandwidth. Onboard is 16GB of GDDR5 memory.
The 5100 series is now available in a new "high density" form factor meant to be fitted into high density systems. Again the 5100 delivers more than one teraflops of double precision floating point performance and 300GBps of memory bandwidth. On board is 8GB of GDDR5 memory.
Suggested prices for the boards are $4,129 for the 7120P and 7120X, $2,759 for the 5120D and $1,695 for the 3120P and 3120A.
The next generation of Xeon Phi processors, codenamed Knight's Landing, will be based on a 14nm die and available as a a standalone CPU that fits into a processor socket, as well as a co-processor sitting on a PCI Express board.
The biggest challenge with the next generation of many core processors like the Xeon Phi will be feeding enough data to the many cores on the processor at a fast enough rate, Hazra said. Intel is planning to tackle the problem by integrating memory on package, stacking it on top of processor die, in the next generation of Xeon Phi processors.
Future high performance clusters will also need to integrate the network fabric controller onto processors in order to shuttle enough data to the network fast enough, he said. Integrating the network fabric controller in this way could deliver more than 100GBps connectivity, compared to 10-20GBps available today, he said.
"If you look at what the machines can do today they are all at the tips of the iceberg in terms of what they need to do tomorrow. The rate and pace of innovation has to continue and accelerate as the world goes from petascale to exascale computing."
An exascale machine would be capable of at least one exaflops, the equivalent of one billion billion operations per second.
Hazra said integrating the "right features into silicon … reduces power, reduces costs by eliminating chips, increases performance by having components closer together, increases scalability by being able to abstract multiple integrated features under one set of libraries and provides unprecedented level of innovation in providing very dense computing solutions".
Increased integration of system components will be needed for the semiconductor industry to deliver on its ambition to provide exascale computing at 20MW by the end of the decade, he said.
"We will be able to build petascale class computers in about half a rack today."