ZDNet, which attended a pre-briefing with Intel, asked some of the company executives to dig into the details of artificial intelligence processing.
The big takeaway is that the thing that most needs to be optimized is how to move data to and from the compute logic, because neural network models continue to scale past what can be kept in the on-die memory of any chip.
"One of the things that we have been seeing is the model sizes are exploding," noted Intel's head of architecture, Raja Koduri. "No model fits in one node." He referred to enormous deep learning language models such as the recently released GPT-3 from OpenAI, which has 175 billion parameters, the weights that must be multiplied over each piece of input data.
While there is "a lot of hype on deep learning accelerators," said Koduri, "their utilization is super-low because we are busy moving the parameter data across the network because the 100-billion parameters don't fit."
"Even the teraflops and tera-ops that are on a humble Xeon socket are underutilized for these programs," he added. "Forget a GPU: you put a GPU there, you've got 10x more."
In addition to bandwidth, Intel argues that graphics processing units, where it is ramping up to challenge Nvidia's lock on the data center, have the advantage of a mature software development environment that other kinds of AI processors can't match.
A tantalizing bit left dangling at the end of the day was the question of sparsity, where Intel has work underway that it is not yet ready to disclose fully. The rise of sparsity looms as a potential deep architectural shift in the way chips are designed, Koduri suggested.
On the first point, data bandwidth, ZDNet asked Sailesh Kottapalli, an Intel senior fellow, who runs datacenter processor architecture, a very general question: What are the most important things in different chip architectures that are going to advance the performance for core operations of AI?
Kottapalli replied in two parts, first pointing out the general priorities that all chip vendors have, including Intel.
"The most common thing that's true with what's happening in silicon technologies across the industries is making sure that linear algebra or matrix operations can be done efficiently at the highest level of throughput with the lowest amount of energy." Linear algebra constitutes the bulk of AI compute cycles. It consists of multiplying a vector containing input data by a matrix of the parameters or weights.
Kottapalli noted that all chips, regardless of architecture, are dedicated to accelerating "matrix-matrix" operations and "vector-matrix" operations. "These are the predominant form of compute there."
The other big trend is the focus on different kinds of precision, meaning, how many bits are used for a given operand, 8-bit, 16-bit, 32-bit, etc.
"Any architecture that actually aspires to do well in AI, which is a new way of doing compute, that will become the state of the art in pretty much any architecture."
The next frontier will be all about advances in chip input-output, he said.
"What happens after that is really how you optimize for bandwidth, the caches, to actually optimize the amount of data movement you need to do," said Kottapalli.
Intel's plan to handle that surging demand for interconnects is for the company's connectivity group, run by vice president Hong Hou, to press the pedal to the metal on more bandwidth. "It's a golden age for them," Koduri said of Hou's division.
"We increasingly recognize that I/O could become a very strong bottleneck," said Hou.
One increasingly important direction, noted Hou, will be fiber-optic connections from the computer circuit board to the processor. "We have talked about getting silicon photonics integrated with the chip closer," he noted. "We have a little more freedom to design the most power-efficient high-density scale-up strategy to support the AI cluster," said Hou.
Another element emphasized by Intel is software, and particularly software consistency and support.
ZDNet talked with Intel senior fellow David Blythe and vice president Lisa Pearce, who head up work on the company's graphics processing units. A question for both was what they think of the common critique, from startups such as Cerebras Systems and Graphcore, that GPUs are less than ideal for AI processing.
"There's always the idea of an ideal piece of hardware, but applications don't run alone on an ideal piece of hardware, they need a full ecosystem and software stack," said Blythe. That mature software stack is an advantage of GPUs, he said. That's especially the case when the computer has to support mixed workloads.
"The thing we're trying to do is take advantage of the mature software stack to make it easily programmable."
Blythe hinted at work Intel is doing on sparsity. Sparsity refers to the fact that in vector-matrix operations, many, often most of the values in a vector are null values. That has led to the critique that GPUs waste energy because they are unable to separate out zero-valued items when batching together many vectors to fit the memory layout of a GPU. Sparsity is a "work in progress," said Blythe.
But another Intel fellow, Rich Uhlig, who heads up the Intel Labs operation, went into greater detail on the matter.
"The neural network models are moving toward more sparse representations from dense, there is an algorithm efficiency you get there," said Uhlig. "And that puts a different pressure on the architecture."
"Some architectures we are exploring are, how do you get good at that hybrid between dense and sparse architecture," added Uhlig. "It's not just about memory, it's also about the interconnect, and how the algorithms exploit that sparseness."
Uhlig noted that Intel is working with DARPA on the agency's "HIVE" program, which is focused on what's called graph analytics. "You can think of graph analytics as exactly this problem, how do you get good at operating over sparse data structures, graphs," said Uhlig.
"You need to bring together a collection of technologies," he said.
You want to make sure the memory system is optimized. So you optimize for things like 8-byte access, as opposed to larger cache line access, where oftentimes, the work gets wasted, because you don't have the same spacial locality in more traditional workloads. But optimizing for eight bytes means it's not just tuning the memory hierarchy to that size, but also to the messages that you send over the fabric to other compute nodes. Another thing you look at is pointer tracing, and the dependencies you need to follow. There are a lot of pointer dependencies you have to deal with. And so there are benefits to be had in the architecture to optimizing those linked dependencies more efficiently. And also to do things atomics in a more efficient manner. So there are a whole bunch or architectural techniques you can apply to help you do better with these kind of sparse algorithms. As part of our response to this DARPA program, we are building simulators and working toward prototype implementations that will hopefully at some point in the future — this is not a product statement, to be very clear, this a research investigation — but we are learning about the things that you want to do architecturally to capture these algorithmic trends in deep learning.
Summing up all that, Koduri added, "Things that deal with sparse parallelism much more efficiently will give rise to some new architectural ideas that are very different from what we are doing in vector-matrix, which is very mainstream right now."