Intel takes its next step towards exascale computing

Intel takes its next step towards exascale computing

Summary: Intel has revealed more details about the capabilities of its forthcoming Xeon Phi many-core chip, codenamed Knight's Landing, at the International Supercomputing Conference in Leipzig.


Intel is betting on the market for supercomputing growing substantially in the coming years as big data analytics becomes the cornerstone of modern business.

That expectation has prompted Intel to predict substantial demand for supercomputing hardware, forecasting its HPC revenues will be growing by more than 20 percent each year by 2017.

To serve that burgeoning market the chipmaker will next year release Knight's Landing, its new Xeon Phi many-core processor.

Knight's Landing will deliver up to three trillion double precision floating point operations per second (3 teraflops) in a single processor socket. The processor is capable of three times the operations of the chip it will succeed, Intel's current Knight's Corner Xeon Phi co-processor.

But perhaps the most important change from Knight's Corner is that Knight's Landing will be available as a standalone CPU, as opposed to solely being a co-processor card sitting in a PCI-Express slot. The form factor change means Knight's Landing will fit into a wide range workstations and supercomputer clusters, opening up the chip for far broader use than its predecessor.

The move may help Intel win over a larger share of the HPC market. Of the top 500 supercomputers in the world only 17 use Intel's Xeon Phi co-processor, compared to the 44 that use Nvidia's Tesla GPU-based co-processor boards. That said, the world's fastest supercomputer, the Tianhe-2, uses Knight's Corner.

The 3 teraflops performance of Knight's Landing is another step towards the computing industry's goal of, by the end of the decade, building an exascale computing system - a machine capable of 1,000 times the performance of the world's fastest supercomputer in 2008. However Charlie Wuischpard, general manager of High Performance Computing at Intel, said that as systems edge closer to that exascale goal pushing the performance envelope becomes increasingly complicated.

"The race to exascale at the end of the decade is one of the goals we've all got our eye on in the HPC market," he said.

"New challenges keep on arising just outside of compute. As we head towards exaflop, issues of power consumption, network bandwidth, I/O, memory, resilience and reliability all become large problems to solve.

"One of the ways this is going to be resolved from a physics perspective is by greater integration [of hardware components such as processors, memory, interconnects]. Greater integration helps reduce latency. We're making investments in the whole stack and not just from a processor perspective."

The increased complexity of system engineering as machines approach the one exaflop performance mark. Image: Intel

Resolving these issues is made more difficult by the need for new HPC architectures, for example systems utilising processors with tens of cores, to maintain compatibility with existing HPC applications.

"Many of the programs in use today were designed for single-core, single-thread performance. We know and have seen massively-parallel environments are going to be the future. 

"While we're developing these next-gen technologies we have to be cognisant of the challenges that exist in the application programming area and ensure we're able to bring those applications forward."

The specs

Knight's Landing will use a more efficient chip architecture than its predecessors, moving to the Silvermont processor core, the low-power core with an out-of-order architecture used in Intel's Atom system on a chip. It will be manufactured using a 14nm process.

Intel has modified the Silvermont core to add what it calls HPC enhancements, including support for the AVX512 instruction set and for four threads per core.

Knight's Landing processors have previously been reported as having up to 72-cores, but Wuischpard only went as far as to say they would have at least 61, connected by a "low latency mesh", the same number as in the Knight's Corner co-processors.

It is by splitting compute tasks between these cores and running them in parallel that the Knight's Landing processor is able deliver three teraflops of performance per socket. Scale that up to a four socket 1U server and there's the possibility of delivering half a petaflop (one quadrillion operations per second) of performance using a 42U rack. Rumours have suggested the chip would deliver between 14 and 16 gigaflops per watt of performance, which would compare favourably to the bang for buck possible using current supercomputers.

The cores also have the advantage of being able to run code that works on Intel Xeon processors, albeit the instructions won't have been optimised to run in parallel on Knight's Landing's many core architecture.

One of the biggest bottlenecks in HPC, according to Wuischpard, is getting data in and out of the processor cores. To alleviate that problem each Knight's landing processor will have up to 16GB of on-package memory, that can transfer data in and out of the cores at up to 500GB/S, which Intel estimates is about five times the bandwidth provided by DDR4 system memory. The on-package memory is based on the low-latency Hybrid Memory Cube Nand flash DRAM chip, which Intel developed with Micron.

The chip is also rumoured to support up to 384GB of DDR4-2400 system memory via a six channel integrated memory controller.

Image: Intel

Knight's Landing will be available in systems in the second half of 2015. One of the first supercomputers to use the processor will be run by the US Department of Energy’s National Energy Research Scientific Computing (NERSC) Center.

The $70m system will have more than 9,300 Knight's Landing Cores, and is expected to deliver 10x the sustained computing capability of NERSC's Hopper system, a Cray XE6 supercomputer. It will be used to address challenges such as developing new energy sources, improving energy efficiency, understanding climate change, developing new materials and analyzing massive data sets from experimental facilities around the world.

Topics: Data Centers, Hardware


Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Need a New Language

    I have been working on multi-thread asynchronous problems most of my career. For the last few it has been just trying to use the 4 cores in an i7. I have found that even the idea of multiple threads on a single core blows most programmers away. Languages today are so targeted to a single thread that it has produced a mind set in most programmers that is hard to break. There are few tools that let you watch multiple asynchronous thread and view how they interact. Of those available I do not think any of them are very good. Even many parallel processing system focus on a single thread with parts executing on different cores.

    In many ways hardware has outstripped computer science. We are waiting for a real revolution in programming methods that really addresses multiple asynchronous threads.
    • Threads are not massively parallel type jobs

      The idea for using these cores is to create many independent parallel tasks. When the tasks don't signal each other or use common resources then each can be operated autonomously. This makes it much simpler to implement.

      Think like: spawn a new process.

      Note also that you don't have to write completely threaded apps in order to take advantage of threading. The OS helps by taking requests for resources and managing them. You can easily see this in operation when you execute a program you wrote that has say three knonwn threads: Main thread, message thread and perhaps a GUI thread but when you run the program and monitor the total number of threads assigned to your applications (not in the debugger but what windows reports) and you will see dozens of threads. Core threads help to reduce the latency.
  • Not just the language, It is also the type of problem

    Surprisingly few business problems lend themselves to massively parallel processing.
    For it to be effective you need the CPU time to be much greater than the time required to split the problem into pieces & to get each piece+data to/from a CPU. If not, the overhead of Parallel becomes greater than the savings.

    Today for
  • Not just the language, It is also the type of problem

    Surprisingly few business problems lend themselves to massively parallel processing.
    For it to be effective you need the CPU time to be much greater than the time required to split the problem into pieces & to get each piece+data to/from a CPU. If not, the overhead of Parallel becomes greater than the savings.

    Today for less than $10K you can build a workstation with 4 * Dual GPU NVidia video cards. Programming in CUDA you can access ~23K execution units providing ~33 GFlops (more if you overclock it). Awesome, if you have a problem that cleanly breaks up into 23,000 independent compute bound tasks.

    This is great for Finite Element Analysis, Laplace Transforms, Fourier Analysis & similar algorithms common in Engineering & Scientific problems. It is great for simulations & graphic rendering. Handy for some Financial problems ie: Risk Analysis, Modelling, some data mining & prediction.

    You'd think it would be good for big web sites serving thousands of independent queries. But generally it is not as they are IO bound tasks.

    Same for the business problems that most developers work on most of the day. Small scale parallel can sometimes be a big win. But reducing latency via caching, Async IO & efficient indexing is typically where today's business programmer will get most benefit. They can then rely on the threadpooling in their database &/or web servers to spread the load over their CPU Cores.