Chip designers face a daunting task. The tool that they have relied on to make things smaller, faster and cheaper, known as Moore's Law, is increasingly ineffective. At the same time, new applications such as deep learning are demanding more powerful and efficient hardware.
It is now clear that scaling general-purpose CPUs alone won't be sufficient to meet the performance per watt targets of future applications, and much of the heavy lifting is being offloaded to accelerators such as GPUs, FPGAs, DSPs and even custom ASICs such as Google's TPU. The catch is that these complex heterogeneous systems are difficult to design, manufacture and program. One of the key themes at the recent Linley Processor Conference was how the industry is responding to this challenge.
"Architects today are faced with an enormous, almost insurmountable problem," said Anush Mohandass, a marketing vice president at NetSpeed Systems. "You need CPUs, you need GPUs, you need vision processors, and all of these need to work together perfectly."
At the conference, NetSpeed --a private company that specializes in the scalable, coherent network-on-chip technology used to glue together the pieces of a heterogeneous processors --announced Turing, a machine learning algorithm that optimizes chip designs for processors targeted at automotive, cloud computing, mobile and the Internet of Things. Mohandass talked about the how the system often comes up with "non-intuitive recommendations" to meet the design goals not only for power, performance and area, but also the functional safety requirements that are essential in automotive and industrial sectors.
ARM is well positioned to ease this transition because it supplies much of the technology in mobile processors, which already function to some degree as heterogeneous processors. Its latest DynamIQ cluster technology is designed to scale to a much "wider design spectrum" that can meet the needs of new applications from embedded to cloud servers. Each DynamIQ Shared Unit (DSU) can have any combination of up to eight big and little cores, and a CPU can have up to 32 of these DSU clusters, though the practical limit is around 64 large cores. It also has a peripheral port for low latency, tightly-coupled connections to accelerators such as DSPs or neural network engines, and supports the industry-standard CCIX (cache-coherent interconnect) and PCI-Express bus.
In his presentation, Brian Jeff, a marketing director at ARM, talked about the increased performance of the Cortex-A75 and A55 CPU cores, flexible cache and interconnects, and new machine learning features, "We built a product roadmap that is designed to service these changing requirements, even as we push our CPU performance up and up," Jeff said. He showed examples of processors for ADAS (automated driving assistance), network processing and high-density servers that combined these elements.
A 64-core A75 processor will deliver three times the performance of current 32-core A72 server chip making it competitive with Intel's silicon, according to ARM. "We think we can fit this well under 100 watts--and probably in the range of 50 watts--for compute," Jeff said. In a separate presentation on ARM's growing system-level IP, David J. Koenen, a senior product manager, said the A75 pushed them closer to the single-threaded performance of the Xeon E5. But in response to a question, he admitted that they couldn't quite match Intel yet, adding that it would take one or perhaps two more Cortex generations to meet that goal.
Qualcomm's upcoming Centriq 2400 is based on a custom ARMv8 design, known as Falkor, but the 10nm processor with 48 cores running at more than 2GHz should provide a good indication of how well ARM has scaled performance. At the Linley Processor Conference, Qualcomm senior director Barry Wolford disclosed new details on the cache--512K shared L2 cache for each of the 24 Falkor duplexes, for a total of 12MB, and a dozen 5MB pools of last-level cache for a total of 60MB L3--and proprietary, coherent ring bus. Wolford said the Centriq 2400 will deliver competitive single-threaded performance while still meeting the high core counts required for virtualized environments in cloud data centers.
AMD is taking a more practical approach to the problem of increasing core counts at a time when Moore's Law is running out of steam. Rather than trying to build one monolithic processor, the chipmaker took four 14nm Epyc die and packaged them with its Infinity Fabric to create a 32-core server processor. Greg Shippen, an AMD fellow and chief architect, said demand for more cores and greater bandwidth was pushing the die sizes for CPUs and GPUs close to the physical limits of lithography equipment. By splitting it up into four dies, the total area increased about 10% (due to the die-to-die interconnect) but the cost dropped 40% because smaller dies have higher manufacturing yields. Shippen conceded that the multi-chip module (MCM) with separate caches has some impact on performance with code that isn't optimized to scale across nodes, but he said the Coherent Infinity Fabric minimizes the latency hit.
This "chiplets" approach seems to be gaining steam, not only to increase yields and cut cost, but also to mix and match different types of logic, memory and I/O--manufactured on different processes--in the same MCM. DARPA has a program to further this concept known as CHIPS (Common Heterogeneous Integration and Intellectual Property Reuse Strategies) and Intel is developing a MCM that combines a Skylake Xeon CPU with an integrated Arria 10 FPGA, which is scheduled for the first half of 2018. Intel's current solution is a PCI-Express card, the Programmable Acceleration Card, with an Arria 10, that has been validated for Xeon servers. Intel's goal is to standardize FPGA hardware and software so that code runs across the entire family and across multiple generations.
"You can now seamlessly move from one FPGA to the next one without rewriting your Verilog," said David Munday, an Intel software engineering manager. "It means the acceleration is portable--you can be on a discrete implementation and you can move to an integrated implementation."
IBM and the OpenCAPI Consortium have been pushing their own solution for attaching accelerators to a host processor to meet demand for higher performance and greater memory bandwidth in hyperscale data centers, high-performance computing and deep learning. "To get the latency and bandwidth characteristics, we really need a new interface and a new technology," said Jeff Stuecheli, an IBM Power hardware architect.
CAPI started out as an alternative to PCIe for attaching co-processors, but the focus has expanded and the bus now supports standard memory, storage-class memory, and high-performance I/O such as network and storage controllers. Stuecheli said the consortium purposely put most of the complexity in the host controller, so it will be easy for heterogeneous systems designers to attach any type of device. At the conference, IBM was showing a 300mm wafer with Power9 processors, which is nearing commercial release (Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have already received some shipments for future supercomputers).
Heterogeneous systems are not only tough to build, they are also a challenge to optimize and program. UltraSoC is an IP vendor that sells "smart modules" to debug and monitor the entire SoC (ARM, MIPS and others) to identify issues with CPU performance, memory bandwidth, deadlocks and data corruption without impacting system performance. And Silexica has developed an SLX compiler that can take existing code and optimize it to run on heterogeneous hardware for automotive, aerospace and industrial, and 5G wireless base stations.
Brute-force scaling of CPUs isn't going to get us where we need to go, but the industry will continue to come up with new ways to scale power, performance to meet the needs of emerging applications. The key takeaway from the Linley Processor Conference is that this more complex and nuanced approach requires new technology to design, connect, manufacture and program these heterogeneous systems.