That puts much of AI beyond the reach of mere mortals, said Li in an interview with ZDNet.
"We've been very successful in adopting AI in the research lab where Stanford PhD students create algorithms recognizing cats and dogs, and cloud service providers do a lot of stuff," said Li.
"That's not sustainable, it's not going to be widely applicable if everybody has to pay ten million dollars to train a trillion-parameter model," said Li. "If you want to go broad — and that's the opportunity here, to have AI everywhere — the bridge is not quite there yet."
The bridge, in Li's view, is software. Software is the ability to take the breakthroughs of the lab and distribute it to organizations large and small, on different kinds of computer hardware, including CPUs and GPUs and all new kinds of exotic accelerator chips.
"I view software as a bridge to get to AI everywhere," said Li.
For Li, the current impasse has a very familiar feel. He did his PhD in computer science at Cornell University in the 1990s advancing techniques of compilers, the software that makes programs run on a given piece of hardware. His work then was specifically focused on how to run workloads across many computing resources.
"What's amazing is that stuff I learned then was all about running things fast and handling a large amount of data," Li reflected.
Following Cornell, Li was hired at Intel twenty-five years ago as the company was making its leap from being primarily a PC-chip maker to being also a leading server-chip vendor. His expertise was put to work developing compilers for servers. "I've always been at the cutting edge of Intel," he reflected, "that makes my job fun."
As is often the case in technology, everything old is new again. Suddenly, says Li, everything in deep learning is coming back to the innovations of compilers back in the day.
"Compilers had become irrelevant" in recent years, he said, an area of computer science viewed as largely settled. "But because of deep learning, the compiler is coming back," he said. "We are in the middle of that transition."
In his his PhD dissertation at Cornell, Li developed a compiler framework for processing code in very large systems with what are called "non-uniform memory access," or NUMA. His program refashioned code loops for the most amount of parallel processing possible. But it also did something else particularly important: it decided which code should run depending on which memories the code needed to access at any given time.
Today, says Li, deep learning is approaching the point where those same problems dominate. Deep learning's potential is mostly gated not by how many matrix multiplications can be computed but by how efficiently the program can access memory and bandwidth.
"In the early stages [of deep learning], it was all about these primitives, can you run matrix multiplications fast, can you run convolutions fast," said Li.
"Moving forward, the primitives are only the first step," said Li. The problem becomes a decision problem that's more strategic. "How do you run the graph [of compute operations] on all the heterogenous machines we have," he said. "You want to distribute it, you want to parallelize it in a certain way, you want to handle memory hierarchies in a certain way — these are all compiler optimizations."
Some things are harder than others. "Our job is to pick a graph and divide it into these compute units — that's the easy part," said Li. A computer programmer can draw diagrams of the compute hardware, the various nodes, the processors, and think about how to divide execution loops.
Much more difficult, and not as easy to diagram, are memory and communications, especially with increasingly large neural net models.
"The hard part is the memory hierarchy, it's much harder than simple parallelization, and that's the key," said Li. "The hard part is the interconnect bandwidth, the communications organization."
"We don't have a good way of modeling the data movement," he noted. "The majority of the work is about managing data — the hard part of compiler optimization is about managing data in an effective way." Specifically, "data locality is super important," said Li, meaning, one operation works on a given piece of data, and, "hopefully, that piece of data still stays in the closest memory layer, or maybe the cache somewhere."
The scale of the problem increases with certain types of deep learning, such as graph neural networks, used in recommender systems for, say, a commerce application.
"There's a similarity at the core level, we're talking about matrix multiplication, convolution, but then there's a diversity when you go above this," said Li. In computer vision systems, there may be millions of images to handle, but "when you go to graphs, the amount of data is humongous." Every person connected to every other person, as in a social network, means qualitatively different challenges for processing.
"You think about LinkedIn, how many billions of people — that is billions upon billions of nodes — that's really big data."
How to handle memory and interconnect for really big data "comes back to parallelizing compilers," said Li.
"In reality, we need a lot of compiler people because it all comes back to compilers."
Intel has already taken steps to bring more compiler experts into the fold. In June, the company announced it would acquire startup CodePlay of Edinburgh, Scotland, a supplier of parallel compilers.
CodePlay is a component of Intel's effort to have a broad software stack available for all the optimizations that Li discusses. As one possible solution, Li is a champion of a software toolkit and platform Intel has helped to develop called oneAPI, the source code of which is posted on GitHub. Intel has been shipping its own implementation of the open-source spec since December of 2020.
oneAPI offers libraries, such as oneDNN, for the speedup of the matrix multiply primitives. It also includes a programming language Intel intends to be a cross-platform parallelizing language, called DPC++. The language is an adaptation of the SYCL language developed by Khronos Group. CodePlay already had its own effort analogous to DPC++, called ComputeC++.
Asked about the progress of oneAPI, said Li, "the adoption is getting there and these are places where we are getting pretty good adoption from the deep learning side."
"There's still a long way to go," he added. "Establishing a standard in the community and ecosystem is not an easy thing."
Intel can point to some practical successes. One of the oneAPI technologies, OneDNN, an approach to speeding up neural network performance, has become a default setting in TensorFlow and PyTorch. An Intel library called BigDL, which can automatically restructure a neural network developed on a single node to scale to thousands of nodes, has been taken up by large companies such as MasterCard.
The processor licensing giant ARM has implemented oneAPI in its processor designs. "ARM are actually using the same OneAPI, and they're one of the big contributors to our OneDNN," said Li. "They have an implementation of oneDNN on ARM, and that's how they get to their value more quickly."
"Eventually, we will see a lot of performance gains out of the optimization we do," said Li. "We're seeing 10x scale, we're seeing 100x performance gain" from use of oneAPI, he said.
Part of the battle for Li is to continually emphasize the importance of software, something that can be obscured in a world where the latest hardware accelerators from startups such as Cerebras Systems and Graphcore and SambaNova attract most of the attention.
"Even within the industry, there is, you can call it, a disconnect," said Li. "Quite often people talk about how you have data, you have algorithms, you have compute — the notion of software is much, much less understood, but in reality, software plays a key role in making AI happen."
As important as software may be, the discussion of AI acceleration will always come back to Intel's roll as the greatest independent manufacturer of processors, and the battle between it and the young rivals such as Cerebras, and the incumbent AI chip giant, Nvidia.
Intel competes directly with the startups with Habana, the chip maker it purchased in 2019, which continues to deliver competitive performance improvements with successive generations. Intel has the burden of legacy code to support from all its x86 users, something the startups don't have. But, in a sense, says Li, the startups also have to support legacies -- the legacies of AI frameworks such as TensorFlow and PyTorch that dominate the creation of neural networks.
"How do you plug in your unique thing to something which is already there?" is how Li phrases the issue. "That's a challenge all of them have and that's the challenge we have also.
"That's why I have a team working on Python, I have a team working on TensorFlow," he said. "That's why our strategy is software ecosystem in many different ways."
Intel splits the market for high-volume AI chips with Nvidia, with Nvidia dominating training, the initial phase were a neural net is developed, and Intel having the bulk of the market for inference, when a trained neural net runs predictions. Intel also has competition at the edge from Qualcomm, which is promoting its own software stack that could be seen as an alternative to oneAPI.
The proliferation of processors in Intel's chip roadmap — a datacenter GPU code-named Ponte Vecchio, the continual enhancement of the Xeon server CPU with matrix multiplication abilities, new field programable gate arrays (FPGAs), among others — mean that there is a very large Venn diagram where Intel competes with Nvidia and AMD and Qualcomm and the startups.
"In CPU today, we have majority of the inference market, and that's where we we play, and we are going into the training market because of the other accelerators we are doing here," said Li.
Nvidia is generally regarded has having strong control of the training market by virtue not only of its volume shipment of GPUs, but also because its CUDA programming tools have been around longer than anything else.
"The CUDA question is interesting because it really depends on how low-level will people go," said Li. Only a small audience really cares about writing to CUDA libraries and the CUDA programming language, Li maintains. The vast majority of the people Intel is trying to reach with oneAPI are the masses: the data scientists who work at the level of the frameworks.
"When we talk about AI everywhere, you have millions of developers, but these are the developers mostly at the top level" he said. "They're not going to be at the CUDA programming level, they're going to be at the TensorFlow programing level — those are the people who will get AI up to scale, right?"
Does Intel have a chance of breaking Nvidia's lead among deep learning researchers? Li refrains from making grandiose predictions. Instead, he said, "I think it's good for the overall marketplace because it provides an alternative," meaning, oneAPI.
"My goal is that we develop our product well, and let the product speak for itself in the market," he said. "Intel, in my view, is well positioned to lead this journey to get to AI everywhere, particularly on the software side."
There is one other cohort with whom Intel competes but with which it also partners, the cloud service providers, Amazon, Alphabet and Microsoft. All three have their own silicon programs. In the case of Alphabet's Google's TPU, Tensor Processing Unit, the chip family is now into its fourth generation, and getting substantial lift as a dedicated workhorse in Google's cloud. Amazon's "Trainium" chip is less developed as an ecosystem at this point.
While the cloud vendors buy tons of processor from Intel, their home-grown parts can be seen as an obstacle for both Intel and its merchant silicon competitors Nvidia and AMD. In-house chips can take away workloads from the merchant offerings.
On the one hand, Intel works hard to pivot, said Li, to stay in touch with what cloud giants need. "It's interesting: initially, AI was all about computer vision, and cats and dogs," said Li. "But when we started working with the CSPs [cloud service providers] we found out the number one cycle there, the largest amount of compute they're burning, is not computer vision, it's actually recommendation systems."
Startups can't keep up with that pace, insists Li. "You can imagine, five years ago, there's a startup everywhere, and they only designed the machine for AlexNet," a neural net that does the widely used ImageNet test. "Today they're no longer in business because they just couldn't handle the trend in the CSPs."
At the same time, "Cloud is a priority for us, but AI everywhere is actually going beyond cloud, to enterprises."
Those enterprise customers "have a different set of people, they have different requirements, and different types of applications." Those enterprises could include Facebook, Li observed, or it could include a retailer.
Applications of IoT and edge computing may demand a different sensibility than cloud operations of AI, Li maintains.
"I'll talk to some some big retailer, and their applications are not quite cloud applications," said Li. "They may have some machine learning thing happening in the store that will be slightly different from the cloud side."
"These things are changing in terms of the variety of things being done," Li observed of AI workloads, "and that's why I think Intel is in a good position, because Intel has scale." The portfolio of chips, the volume that Intel ships, means it can handle diversity, he insists.
"Intel is big, we have a portfolio of high quality hardware designed and built today in a broad set of markets," he said, "And, quite often, these are connected," as in the case of edge workloads that connect to cloud data centers.
As for the very large neural networks, Li contends that techniques such as distillation and transfer learning will allow the fruits of those supercomputer neural networks to trickle down to ordinary users.
"I would say, it's going to be a pyramid, where, at the tip of the pyramid are all the models where only a few rich companies can continue to push for these things," observed Li. "And then there will be a breadth of it, the bulk of it, and you will find more economic models — take a big model and find the equivalent small model that does the same thing.
"But at the top, people will always go for that supercomputing type of thing."