In the center of that loop sits software technology that converts neural net programs to run on novel hardware. And at the center of that sits a recent open-source project gaining momentum.
Apache TVM is a compiler that operates differently from other compilers. Instead of turning a program into typical chip instructions for a CPU or GPU, it studies the "graph" of compute operations in a neural net and figures out how best to map those operations to hardware based on dependencies between the operations.
At the heart of that operation sits a two-year-old startup OctoML, which offers ApacheTVM as a service. As explored in March by ZDNet's George Anadiotis, OctoML is in the field of MLOps, working to operationalize AI. The company uses TVM to help companies optimize their neural nets for a wide variety of hardware.
In the latest development in the hardware and research feedback loop, TVM's process of optimization may already be shaping aspects of how AI is developed.
"Already in research, people are running model candidates through our platform, looking at the performance," says OctoML co-founder Luis Ceze, who serves as CEO, in an interview with ZDNet via Zoom. The detailed performance metrics mean that ML developers can "actually evaluate the models and pick the one that has the desired properties."
Today, TVM is used exclusively for inference -- the part of AI where a fully-developed neural network is used to make predictions based on new data. But down the road, TVM will expand to training, the process of first developing the neural network.
"Training and architecture search is in our roadmap," says Ceze, referring to the process of designing neural net architectures automatically by letting neural nets search for the optimal network design. "That's a natural extension of our land-and-expand approach" to selling the commercial service of TVM, he explains.
Will neural net developers then use TVM to influence how they train?
"If they aren't yet, I suspect they will start to," says Ceze. He also notes that if you "[come] to us with a training job, we can train the model for you," taking into account how the trained model would perform on hardware.
That expanding role of TVM -- and the OctoML service -- is a consequence of the fact that the technology is a broader platform than what a compiler typically represents.
"You can think of TVM and OctoML by extension as a flexible, ML-based automation layer for acceleration that runs on top of all sorts of different hardware where machine learning models run -- GPUs, CPUs, TPUs, accelerators in the cloud," Ceze tells ZDNet.
"Each of these pieces of hardware -- it doesn't matter which -- have their own way of writing and executing code," he adds. "Writing that code and figuring out how to best utilize this hardware today is done today by hand across the ML developers and the hardware vendors."
The compiler, and the service, replace that hand tuning -- today at the inference level, with the model ready for deployment, tomorrow, perhaps, in the actual development/training.
The crux of TVM's appeal is greater performance in terms of throughput and latency, as well as efficiency in terms of computer power consumption. That is becoming increasingly important for neural nets that keep getting larger and more challenging to run.
"Some of these models use a crazy amount of compute," observes Ceze.
This is especially true of natural language processing models, such as OpenAI's GPT-3, that are scaling to a trillion neural weights or parameters and more. As such models scale up, they come with "extreme cost," and "not just in the training time, but also the serving time" for inference.
"That's the case for all the modern machine learning models," says Ceze.
As a consequence, without optimizing the models "by an order of magnitude," the most complicated models aren't really viable in production; they remain merely research curiosities.
But performing optimization with TVM involves its own complexity. "It's a ton of work to get results the way they need to be," explains Ceze.
OctoML simplifies things by making TVM more of a push-button affair. Ceze characterizes the cloud service as "an optimization platform."
"From the end user's point of view, they upload the model, they compare the models, and optimize the values on a large set of hardware targets," says Ceze. He adds that "the key is that this is automatic -- no sweat and tears from low-level engineers writing code."
OctoML does the development work of making sure the models can be optimized for an increasing constellation of hardware. That means "specializing the machine code to the specific parameters of that specific machine learning model on a specific hardware target." For example, an individual convolution in a typical convolutional neural network may become optimized to suit a particular hardware block of a particular hardware accelerator.
The results are demonstrable. In benchmark tests published in September for the MLPerf test suite for neural net inference, OctoML had a top score for inference performance for the venerable ResNet image recognition algorithm in terms of images processed per second.
The OctoML service has been in a pre-release, early access state since December 2020.
To advance its platform strategy, OctoML earlier this month announced it had received $85 million in a Series C round of funding from hedge fund Tiger Global Management, along with existing investors Addition, Madrona Venture Group, and Amplify Partners. The round of funding brings OctoML's total funding to $132 million.
The funding is part of OctoML's effort to spread the influence of Apache TVM to more and more AI hardware. Also this month, OctoML announced a partnership with ARM Ltd., the UK company in the process of being bought by AI chip powerhouse Nvidia. That follows partnerships announced previously with Advanced Micro Devices and Qualcomm. Nvidia is also working with OctoML.
The ARM partnership is expected to spread use of OctoML's service to the licensees of the ARM CPU core, which dominates mobile phones, networking and the Internet of Things.
The feedback loop will probably lead to other changes besides design of neural nets. It may affect more broadly how ML is commercially deployed, which is, after all, the whole point of MLOps.
As optimization via TVM spreads, the technology could dramatically increase portability in ML serving, Ceze predicts. Because the cloud offers trade-offs with all kinds of hardware offerings, being able to optimize on the fly for different hardware targets ultimately means being able to move more nimbly from one target to another.
"Essentially, being able to squeeze more performance out of any hardware target in the cloud is useful because it gives more target flexibility," describes Ceze. "Being able to optimize automatically gives portability, and portability gives choice."
That includes running on any available hardware in a cloud configuration, but also choosing the hardware that happens to be cheaper for the same SLAs, such as latency, throughput, and cost in dollars.
"As long as I hit the SLAs, I want to run it as cheaply as possible," says Ceze.