For certain classes of problems in high-performance computing, all supercomputers have an unavoidable, and fatal bottleneck: Memory bandwidth.
That is the argument made this week by one startup company at SC20, a supercomputing conference which takes place in a different city each year, but this year is being held as a virtual event given the COVID-19 pandemic.
The company making that argument is Cerebras Systems, the AI computer maker that contends its machine can achieve speed in solving problems that no existing system can.
"We can solve this problem in an amount of time that no number of GPUs or CPUs can achieve," Cerebras's CEO, Andrew Feldman, told ZDNet in an interview by Zoom.
"This means the CS-1 for this work is the fastest machine ever built, and it's faster than any combination of clustering of other processors," he added.
The argument comes in the form of a formal research paper, presented Tuesday, titled "Fast Stencil-Code Computation on a Wafer-Scale Processor."
The paper was written by Cerebras scientist Kami Rocki and colleagues, in collaboration with scientists at the National Energy Technology Laboratory, one of multiple national laboratories of the U.S. Department of Energy. Researchers at scientific research firm Leidos also participated in the work. The paper was posted last month on the arXiv pre-print server.
The class of problem being solved focuses on systems of partial differential equations. The PDE workloads crop up in many fundamental challenges in physics and other scientific disciplines. They include problems of modeling basic physical processes such as computational fluid dynamics and simulating multiple interacting bodies in astronomical models of the universe.
The high-performance work on fluid dynamics is an interesting departure for Cerebras, which has so far focused on machine learning problems. The work that lead to the paper came together over a period of nine months and was the result of a serendipitous encounter between one of NETL's researchers and Cerebras' executive in charge of product development, said Feldman.
The PDE workloads exhibit what is known as weak scaling, Feldman notes, whereby increasing the number of processors in a clustered or multi-processor system provides diminishing returns.
Instead, in the research paper, Rocki and collaborators contend that the problem requires the greater on-chip memory and reduced latency for communication between processing elements that the Cerebras computer offers.
The Cerebras computer, introduced a year ago, is a refrigerator-sized machine that contains the largest computer chip ever made, known as the "wafer-scale engine," or WSE. The chip is a single silicon wafer divided into the equivalent of 84 virtual chips, each with 4,539 individual computing cores, for a total of 381,276 computing cores that can perform mathematical operations in parallel.
Each of the cores has 48 kilobytes of fast SRAM at its disposal, for a total of 18 gigabytes. And each core has a router that connects it to a communications fabric that Cerebras calls Swarm, which runs at 100 petabits per second connecting all the cores together.
In the PDE program tackled by Rocki and colleagues, a series of linear equations, of the form Ax = b, are computed as sparse matrix-dense vector multiplication operations, abbreviated as SpMV. The algorithm handling those operations is known as BiCGStab, an approach to non-linear computation of gradients first proposed in 1992 by H. A. van der Vorst. Those matrix and vector elements are mapped onto the individual cores of the Cerebras chip.
Each core is able to consolidate all the multiply-add operations that have to happen for each vector, reducing what would otherwise be a lot of back and forth communications to memory registers or between multiple processors splitting the work.
The authors benchmarked their performance on the Cerebras machine to a multi-processor supercomputer, the Joule system housed at NETL. The Joule, made up of Intel Xeon chips with 20 cores each, a total of 16,000 cores, took six milliseconds to run the solver operation, the authors write.
In comparison, the Cerebras machine required only 28 microseconds, and was actually able to execute a larger version of the problem in that time than on the Xeon machine.
The Intel machine took 214 times as long to solve the problem, in other words.
"It is interesting to try to understand why this striking difference arises," the authors write. One reason is that the cores in the Xeons at their peak are only running about 40% of the peak rate of the Cerebras cores. Another reason is that although every Xeon core has much more memory than a Cerebras core, "the Xeon caches seem to be less effective at deriving performance from the available SRAM."
The key, they maintain, is that the cores in the Cerebras machine aren't contending for shared RAM memory, they are efficiently managing the 48 kilobytes in each core.
What's more, because the Cerebras machine moves data in hardware, via a predetermined set of routing choices, the Cerebras machine eliminates the overhead that typically takes place in clustered computing systems that have to go through operating system layers with each compute or memory operation.
Given all that, the authors contend that no system in existence could match the results of their machine.
"The achieved performance per Watt (at 20 kW) and for the size of the machine (1/3 rack)," they write, "are beyond what has been reported for conventional machines on comparable problems."
Feldman reiterated that claim to ZDNet.
"The process of trying to solve this problem by tying together lots of little things can't beat where we are, there is nothing in existence that does," said Feldman.
Feldman points out that clustered systems are limited by the metal pin-outs from chip to board, so that scaling multiple processors always brings in the communications bottleneck.
"This fits with our experience in many things in life — the optimal number of engineers on a project, the optimal number of co-authors on a paper, and so on," said Feldman.
In the paper, Rocki and colleagues express the diminishing returns of clustered systems via a graph that shows how the number of floating-point operations in modern systems has exploded as chips have to wait on memory access. They call this "the growing gulf in FLOPs per word," referring to a memory word in a computer.
A way around the bottleneck in future may be something like optical interconnects. But for now, Cerebras has bragging rights.
"We're not making a claim that some stuff in the future can't be invented that's faster," said Feldman. "The claim is there is no combination of existing product that can achieve this performance on this work."
Feldman said Cerebras intends to extend the same kinds of work to similar problems of increasing size. "We need to apply this to problems beyond computational fluid dynamics," he said.
There are interesting clues to future Cerebras hardware in the work. Cerebras has already disclosed, earlier this year, that it is developing its second-generation of the WSE chip, with 850,000 individual compute cores.Cerebras intends in future work to see if clusters of Cerebras systems will lead to even greater scaling of workloads, write Rocki and colleagues. That suggests there is a future business for Cerebras in constructing pods of multiple interconnected systems in a data center.
And memory count looks set to rise dramatically. In the conclusion section of the paper, the authors write that a move from the current 16-nanometer manufacturing technology for the Cerebras chip to 7-nanometer technology will allow for on-chip SRAM to rise to 40 gigabytes. A move to 5-nanometer process technology will further expand that to 50 gigabytes.