AI chips in the real world: Interoperability, constraints, cost, energy efficiency, and models

The answer to the question of how to make the best of AI hardware may not be solely, or even primarily, related to hardware

How do you make the best out of the proliferating array of emerging custom silicon hardware while not spreading yourself thin to keep up with each and every one of them?

If we were to put a price tag on that question, it would be in the multi-billion dollar territory. That's what the combined estimated value of the different markets it touches upon is. As AI applications are exploding, so is the specialized hardware that supports them.

For us, interest in so-called AI chips came as an offshoot of our interest in AI, and we've tried to keep up with developments in the field. For Evan Sparks, Determined AI CEO and founder, it goes deeper. We caught up to discuss the interplay between hardware and models in AI.

An interoperability layer for disparate hardware stacks

Before founding Determined AI, Sparks was a researcher at the AmpLab at UC Berkeley. He focused on distributed systems for large scale machine learning, and this is where he had the opportunity to work with people like David Patterson, a pioneer in computer science, and currently vice-chair of the board of directors of the RISC-V Foundation.

Patterson was, as Sparks put it, banging the drum about Moore's Law being dead and custom silicon being the only hope for continued growth in the space early on. Sparks was influenced, and what he wants to do with Determined AI is build software to help data scientists and machine learning engineers.

The goal is to help data scientists and machine learning engineers accelerate workloads and workflows and build AI applications faster. To do that, Determined AI provides a software infrastructure layer that sits underneath frameworks like TensorFlow or PyTorch and above various chips and accelerators.

Being in the position he is, Sparks's interest lay not so much in dissecting vendor strategies, but rather in walking in the shoes of people developing and deploying machine learning models. As such, a natural place to start was ONNX.

xenonstack-onnx-overview-advantages.png

ONNX is an interoperability layer thay enables machine learning models trained using different frameworks to be deployed across a range of AI chips.

ONNX is an interoperability layer that enables machine learning models trained using different frameworks to be deployed across a range of AI chips that support ONNX. We've seen how vendors like GreenWaves or Blaize support ONNX.

ONNX came out of Facebook originally, and Sparks noted the reason ONNX was developed was that Facebook had a very disparate training and inference stack for machine learning applications.

Facebook developed using PyTorch internally, while the bulk of deep learning models running in production were computer vision models that were running backed by Caffe. Facebook's mandate was that research can be done in whatever language you want, but production deployment had to be in Caffe.

That led to the need for an intermediate layer that would translate between the model architectures that were output in PyTorch and input into Caffe. Soon enough, people realized this is a good idea more broadly applicable. Not too different in fact from things that we've seen previously in programming language compilers.

ONNX and TVM: Two ways to solve similar problems

The idea is to utilize an intermediate representation between multiple high-level languages and plug things in with multiple languages at the source and multiple frameworks at the destination. It does sound a lot like compilers, and a good idea. But ONNX is not the end-all in AI chip interoperability.

TVM is the new kid on the block. TVM started as a research project out of the University of Washington, it recently became a top-level Apache open source project, and it also has a commercial effort behind it in OctoML.

TVM's goals are similar to ONNX's: Making it possible to compile deep learning models into what they call minimum deployable modules, and automatically optimize these models for different pieces of target hardware.

Sparks noted TVM is a relatively new project, but it's got a pretty strong open source community behind it. He went on to add that many people would like to see TVM become a standard: "Hardware vendors not named in Nvidia are likely to want more openness and a way to enter into the market. And they're looking for a kind of narrow interface to implement."

There is nuance in pinpointing the differences between ONNX and TVM, and we defer to the conversation with Sparks on that. In a nutshell, TVM is a bit lower level than ONNX, Sparks said, and there are some trade-offs associated with that. He opined TVM has the potential to be perhaps a little bit more general.

Sparks noted, however, that both ONNX and TVM are early in their lifetime, and they will learn from each other over time. For Sparks, they are not immediate competitors, just two ways to solve similar problems.

AI constraints, cost, and energy efficiency

Whether it's ONNX or TVM, however, dealing with this interoperability layer should not be something data scientists and machine learning engineers have to do. Sparks advocates for a separation of concerns between the various stages of model development -- very much in line with the MLOps theme:

"There are many systems out there for preparing your data for training, making it high performance and compact data structures and so on. That is a different stage in the process, different workflow than the experimentation that goes into model training and model development.

As long as you get your data in the right format while you're in model development, it should not matter what upstream data system you're doing. Similarly, as long as you develop in these high-level languages, what training hardware you're running on, whether it's GPU's or CPU's or exotic accelerators should not matter."

determined-components.jpg

Determined AI's stack aims to abstract different underlying hardware architectures

What does matter is how that hardware can satisfy application constraints, as per Sparks. Imagine a medical devices company that has legacy hardware out in the field. They're not going to upgrade just to run slightly more accurate models.

Instead, the problem is almost the inverse: How to get the most accurate model that can run on this particular hardware. So they might start with a huge model and employ techniques like quantization and distillation to fit that hardware.

This refers to deployment/inference, but the same logic can be applied to training as well. The cost of training AI models, both financial and environmental, is hard to ignore. Sparks referred to work from OpenAI, according to which the cost of training went up three hundred thousand times in the last few years.

That was two years ago. As more recent work coming from former co-lead of Google's ethical AI team shows, this trend has anything but slowed down. The cost to train OpenAI's latest language model, GPT3, has been estimated between $7 million and $12 million.

Sparks pointed out the obvious: This is an insane amount of computation, of energy, of money, which most mortals don't have. So we need tools that help reason about this cost and assign quotas; Sparks is busy building those.

Infusing knowledge in models

Determined AI's technology provides a way of specifying a budget, a number of models to be trained to convergence, and the space of models to explore. Training ceases before convergence, and users can explore the models without breaking the bank. This approach is based on active learning, but there are more approaches, like distillation, fine-tuning, or transfer learning:

"You let the big guys, the Facebooks and the Googles of the world do the big training on huge quantities of data with billions of parameters, spending hundreds of GPU years on a problem. Then instead of starting from scratch, you take those models and maybe use them to form embeddings that you're going to use for downstream tasks."

Sparks mentioned NLP and image recognition, with BERT and ResNet-50, as good examples of this approach. He also offered a word of warning, however: This won't always work. Where it gets tricky is when the modality of the data that people are training on is totally different than what's available.

hybrid.jpg

A hybrid approach to AI, infusing knowledge in machine learning models, may be the best way to minimize training costs

But there may be another way. Whether we call it robust AIhybrid AIneuro-symbolic AI, or by any other name, would infusing knowledge in machine learning models help? Sparks's answer was a definite yes:

"In 'commodity' use cases like NLP or vision, there are benchmarks that people agree on and standard data sets. Everyone knows what the problem is, image classification or object detection or language translation. But when you start to specialize more, some of the greatest lift we have seen is where you get a domain expert to infuse their knowledge."

Sparks used physical phenomena as an example. Let's say you set up a feed-forward neural network with 100 parameters and ask it to predict where a flying object is going to be in a second. Given enough examples, the system will probably converge to a reasonably good approximation of the function of interest and will predict with a high degree of accuracy:

"But if you infuse the application with a little bit more knowledge of the physical world, the amount of data is going to go way down, the accuracy is going to go way up, and we're going to see some gravitational constant start to emerge as maybe one feature of the network or some combination of features.

Neural networks are great. They're super powerful, functional approximations. But if I tell the computer a little bit more about what that function is, hopefully, I can save everybody a few million bucks in compute and get models that more accurately represent the world. To abandon that thinking would be irresponsible."