If artificial intelligence is going to spread to trillions of devices, those devices will have to operate in a way that doesn't need a human to run them, a Google executive who leads a key part of the search giant's machine learning software told a conference of chip designers this week.
"The only way to scale up to the kinds of hundreds of billions or trillions of devices we are expecting to emerge into the world in the next few years is if we take people out of the care and maintenance loop," said Pete Warden, who runs Google's effort to bring deep learning to even the simplest embedded devices.
"You need to have peel-and-stick sensors," said Warden, ultra-simple, dirt-cheap devices that require only tiny amounts of power and cost pennies.
"And the only way to do that is to make sure that you don't need to have people going around and doing maintenance."
Warden was the keynote speaker Tuesday at a microprocessor conference held virtually, The Linley Fall Processor Conference, hosted by chip analysts The Linley Group.
Warden offered the assembled, mostly chip industry executives, a wish list, as he put it, for hardware for devices.
That wish list includes ultra-low-power chips that do away with complex memory access and file access mechanisms, and instead focus on the repetitive arithmetic operations required in machine learning. Machine learning makes heavy use of linear algebra, consisting of vector-matrix and matrix-matrix multiplication operations.
Embedded deep learning needs chips that have "more arithmetic," said Warden.
"ML workloads are usually compute-bound," he told the audience. "We load a few activation and weight values, and do a lot of arithmetic on them in registers."
Warden's vision is that of self-sufficient devices that would run on battery power, perhaps for years, without needing to connect to a wall socket very often, perhaps not ever.
That would exclude the Raspberry Pi, said Warden, and anything else that requires "mains power," being plugged into a wall, and things that draw watts of power from a battery, such as a smartphone.
Instead, "We are aiming at the edge of the edge," said Warden, devices that are even more resource-constrained than cell phones, things such as peel-and-stick sensors that can be used in industrial applications.
"We are really looking at running on devices that are less than a dollar, maybe even 50 cents in price, that have a very small form factor."
Also: What is edge computing? Here's why the edge matters and where it's headed
Such devices might draw a single milliwatt to operate, he said, which "is really important, because that means you have a device that can run on double-A batteries for a year or two years, or even via energy harvesting from solar or vibration."
The challenge at present for deep learning forms of machine learning, Warden told the audience, is that many deep learning neural networks can't run at all on embedded devices because of the diffuse requirements of all the many micro-controller platforms that exist.
"We interact with a lot of product teams inside Google trying to build very interesting new products, and product teams at companies all over the world, and we often have to say, No, that's not quite possible yet," Warden told the audience.
"Because what's happening is the technology around deep learning, and the kinds of models that you can actually build on the training side that would be useful for product features, they often can't actually be deployed on the kinds of devices that people have in their actual hardware platforms."
If such models could be made to run on those billions of devices, "they would enable a whole bunch of new experiences for users," he said.
Embedded machine learning of the kind Warden discussed is part of a broader movement called TinyML. Today, examples of TinyML are fairly limited, things such as the wake word that activates a phone, such as "Hey, Google," or "Hey, Siri." (Warden confided to the audience, with a chuckle, that he and colleagues have to refer to "Hey, Google" around the office as "Hey, G," in order not to have one another's phones going off constantly.)
Warden has been leading the software effort to make possible the kinds of ultra-light-weight devices he was talking about. That effort is called TensorFlow Lite Micro, or TF Micro.
Warden and colleagues built on the existing TensorFlow Lite framework that exports trained machine learning models to run on embedded devices. While TF Lite removes some of the complexity of TensorFlow to make it feasible in a smaller-footprint device, TF Micro goes even further, to make machine learning able to run in devices with as little as 20 kilobytes of RAM.
TF Micro was introduced this month in a formal research paper by Warden and colleagues. The researchers had to build a framework that would work across numerous chip instruction sets, work with low-power microcontrollers, and they had to design it to support a greatly-reduced number of operations, excluding functions such as loading files from external locations.
The team also had to handle refinement of machine learning models for low-resource devices, which meant optimizing the quantization of models, representing operands in 8-bit integer form, say, rather than 32-bit floating point.
What Warden and team settled on is an interpreter that runs multiple models simultaneously. Using an interpreter not only makes it possible to run across the plethora of embedded platforms, it also makes it possible to update machine learning models as they improve without having to recompile models for a given device.
Chips to run TF Micro will have to do things that get around the limited nature of the embedded framework, Warden said. While full-blown TensorFlow supports 1,200 operations, TF Micro only supports a small fraction of those.
As a result, chips for running inference have to be able to "fall back to general-purpose code" rather than supporting every single last instruction.
"One of the real drawbacks of a lot of hardware accelerators is that they fail to run a lot of the models that people want to run on them," said Warden. "We want custom accelerators to fall back to run general-propose code without a massive performance penalty."
Summing up his wish list, Warden told the audience, "Really, what I'm looking for is tens or hundreds of billions of operations per second per milliwatt."
Some of the demands may be beyond what's feasible at present, he acknowledged. "I would love to have megabytes of model storage space instead of kilobytes," although, "I understand that's challenging."
"And, of course, I want it cheaper," he said.
The Linley conference, now in its fifteenth year, has over 1,000 attendees this year, conference organizer Linley Gwennap told ZDNet, which is more than three times as many attendees as in prior years, when the event was held at hotel ballrooms in the Silicon Valley area.
The conference continues through today.