And yet, Toon, who was in town last week from Graphcore's headquarters in Bristol, England, told ZDNet that software is at the heart of the very large challenge of increasingly large AI problems, whereas the hardware, though by no means trivial, is, in a sense, secondary.
"You can build all kinds of exotic hardware, but if you can't actually build the software that can translate from a person's ability to describe at a very simple level into hardware, you're not really producing a solution," said Toon over lunch at The Grey Dog cafe in Manhattan's Union Square neighborhood.
It was one of many trips by Toon since the pandemic lockdowns have faded. He's relishing getting back to meeting with customers. "It's good to be traveling again," he said, and getting face to face.
Among the points to be emphasized on his swing through the U.S. is the software factor. Specifically, the capability of Graphcore's Poplar software, which translates programs written on AI frameworks such as PyTorch or TensorFlow to efficient machine code.
It is, in fact, the act of translation that is key to AI, Toon argues. No matter what hardware you build, the challenge is how to translate from what the PyTorch or TensorFlow programmer is doing to whatever transistors are available.
A common conception is that AI hardware is all about speeding up matrix multiplications, the building block of neural net weight updates. But, fundamentally, it's not.
"Is it just matrix multiplication, and is it just convolutions that we need, or are there other operations that we need?" asked Toon, rhetorically.
In fact, he said, "it's much more about the the complexity of the data."
A large neural net, said Toon, such as GPT-3, is "really an associated memory," so that connections between the data are what are essential, and movement of things in and out of memory becomes the bottleneck for computing.
Toon is intimately familiar with such a connection problem.Toon spent fourteen years at programmable chip maker Altera, he recalled, later bought by Intel. A programmable logic chip, known as an "FPGA," operates by having its computer blocks, called cells, linked for each task by burning a fuse between them.
"All of the software" of an FPGA "is about how you take the graph, which is your net-list or your RTL, and translate it into the interconnect inside the FPGA," he explained.
Such software tasks become very complex.
"You build a hierarchy of interconnects inside the chip to try to make that work, but, from a software point of view, it's an NP-Hard problem to map a graph to the interconnects," he said, referring to "non-deterministic polynomial time," a measure of computational complexity.
Because it is about translating complexity of associations into transistors, "it's really a graph problem, that's why we named the company Graphcore," said Toon. In general terms, a graph is the totality of interdependencies between different computing tasks in a given program.
"You have to start from a computer science approach that says, It's going to be graphs, you need to build a processor to work on graphs, to do highly parallel graph processing."
"We build the software up from that, we build the processor down," he said.
That means the hardware is just serving the software. "The computer follows the data structure," Toon contends. "It's a software problem."
That is Toon's chance to riff on Nvidia's CUDA software, which holds tremendous power in the AI world.
"It's interesting: many people say that CUDA is somehow this ecosystem that makes it impossible for anybody else to compete," observed Toon. "But what you misunderstand is nobody programs in CUDA, nobody wants to program in CUDA, people want to program in TensorFlow and now PyTorch, and next, JAX — they want a high-level construct," he said, referring to the various open-source development libraries built by Meta and Google and others.
"All of those frameworks are graph frameworks," he observed, "you're describing quite an abstract graph, with large operators at the core of each of the elements inside the graph."
Nvidia, noted Toon, "have built an amazing set of libraries to translate from that high-level abstraction that the programmer is comfortable with — that's the piece that Nvidia have done, not CUDA necessarily."
Enter Graphcore's competing offering, Poplar, the thing that translates from PyTorch and the rest to what he contends is Graphcore's more-efficient hardware. Poplar takes apart the compute graph and translates it to whatever gates are in Graphcore hardware today, and whatever will replace those gates tomorrow.
There is, however, skepticism about Graphcore, or any of the many other young hopefuls, such as Cerebras Systems and SambaNova, competing with Nvidia on those libraries to which Toon refers. In an editorial in April, editor Linley Gwennap of the prestigious Microprocessor Report, wrote that "software, not hardware," is still the issue. Gwennap argues the time to close the gap is fading for Graphcore and others because Nvidia keeps getting better with hardware improvement such as Hopper.
Are the skeptics such as Gwennap not appreciating the progress of Poplar software?
"It's a journey," said Toon. "If you were to engage with Poplar two years ago, you would have said, It's not good enough; if you engage with Poplar now, you'd say, it's actually pretty good.
"And in two years time, people will say, Wow, this allows me to do things I can't do on a GPU."
The software is already, Toon asserts, its own expanding ecosystem. "Look at the ecosystem that we've create around Poplar, things like PyTorch Lightning, PyTorch Geometric," two extensions to PyTorch that are ported to Poplar and the Graphcore IPU chips.
"It's not just TensorFlow, it's a whole suite," he said."TensorFlow is fine for an AI researcher, but it's not an environment in which a data scientist, an individual, or a very large enterprises can come and just play with."
Practitioners, versus scientists, need accessible tools. "We work with Hugging Face, Weights and Biases," among other machine learning tools, he noted. "There's many other libraries that are coming out, there are companies who are building services on top of IPUs," and, "there's MLOps that have been ported to work with Poplar."
Graphcore is "much farther ahead in terms of building out that software ecosystem to create that ease of use, that way people can come in," he said, compared to Cerebras and other competitors.
It is, in fact, coming down to a software duopoly, Toon maintains. "You look at anybody else, even the big companies, there's nobody else that's got that apart from us and Nvidia, that range of ecosystem."
Meantime, he claims, Nvidia's hardware advancements are not all their cracked up to be because Nvidia's freedom of design is constrained by its own success. "What's Nvidia doing? They've added Tensor cores, and now they've added Transformer cores — they can't change the fundamental core of the processor because if they did, all the libraries would have to be thrown away."
"On some models, like a graph neural network, for example, we're seeing five or ten times the performance" as compared to Nvidia-based machines, he said, "because the data structure, the underlying architecture that we built inside the IPU, is aligned much more so to that kind of sparse, graph type of computation."
The Poplar software has also achieved two to three times speed-up in running Transformer models by finding clever ways to parallelize elements of the graph, he said.
The premise that software is a battle ground, and that Nvidia can have real competition, rears on the premise that AI itself is still very much evolving. There is lots of runway for AI programs to get bigger, Toon maintains, straining the capabilities of compute.
And the fundamental problem of cracking the code of human cognition is still remote.
On the first score, programs are, indeed, getting larger all the time.
Today's biggest AI models, things such as Nvidia's and Microsoft's Megatron-Turing NLG, a natural language generative model deriving from the Transformer innovation of 2017, has half-a-trillion parameters, or, weights, the elements in a neural network that get tuned and that are the analog to synapses in real human neurons.
"As you increase the number of parameters, and you increase the amount of data," observed Toon, "you're increasing the compute by a multiplicand of those two pieces, and that's why there are these massive GPU farms that are evolving."
There is clearly no controversy on the matter of bigger and bigger, given that Graphcore and Nvidia and everyone else are building more and more powerful machines for it.
Toon is interested in the second point, however, the computer science question of whether anything useful gets done with all that, and whether it can approach human intelligence.
"The challenge around that is, you know, if you had a model that was one-hundred trillion parameters, would that make it as clever as a person?" said Toon.
It is a problem of not merely throwing transistors at the matter, but one of designing a system.
"You know, do we actually know how to train it?" meaning, train the neural net once it has 100 trillion weights. "Do we know how to give it the right information? Do we know how to construct that model in a way that it would actually match the intelligence of a human, or would it be so inefficient despite many more parameters?"
In other words, "Could we actually know how to build a machine that would match the intelligence of the brain?"
One answer, he offers, is specialization. A model of a hundred trillion parameters could be really good at something defined narrowly. "In a system like [DeepMind's algorithm for playing] Atari, you've got enough constraints that you can understand that world," said Toon.
Similarly, "Maybe we could build enough understanding, for example, of how cells work, and how DNA converts to RNA through to proteins, that then you could have a reinforcement learning system that would use that understanding to work out, for example, Okay, so how can I fold the proteins in such a way that if it bonds to this cell, and I could communicate with the cell — and let's say the cell was a cancerous cell — I could attach a drug to that protein, and it could could fix the cancer," Toon mused.
"It would be a bit like the Atari game that DeepMind built that became superhuman, a system superhuman at killing cancers — it would be specialized."
Another approach, he suggests, is a "more general understanding of the world" that would be akin to how human infants learn, by "being exposed to huge amounts of data about the world." The hundred-trillion synapse problem would then become one of constructing "hierarchies," said Toon.
"Humans build a hierarchy of understanding of the world," he said, and then they "interpolate" by filling in the blanks. "You use the things you do know to extrapolate, and imagine," he said.
"Humans are very bad at extrapolating; what we're much better at is interpolation, you know, there's some piece that's missing — I know this, I know this, and this is somewhere in the middle."
Toon's thinking about hierarchy echoes some theorists in the field, including Meta's AI chief scientist, Yann LeCun, who has spoken of composing hierarchies of understanding in neural nets. Toon indicated that he is in agreement with aspects of LeCun's thinking.
Framed from that standpoint, said Toon, the AI challenge becomes one of "how do you build a big enough understanding of the world that you are doing a lot more interpolation then extrapolation?"
And that challenge would be one, he believes, of highly "sparse" data, updates to the parameters from small, follow-on pieces of data, rather than large amounts of re-training on all data.
"Even within a specific thing you're updating the world about, you might have to touch different points of your understanding of the world," explained Toon. "It might not all be neatly together in one spot, the data is very messy and very sparse."
From a computing standpoint, "You would end up with many different pieces of parallel operation," he said, "all of which are very highly sparse, because they're working on different pieces of the data."
Both ideas, the interpolation, and the more-specific cancer-killer model, are in keeping with ideas laid out by Toon's co-founder, Graphcore's CTO, Simon Knowles, who has talked about "distilling" a more-general, very large neural net down to something specific.
They are also both ideas which would seem to play to the notion of the Poplar software as serving a key function. If new pieces of data are sparse, filling in gaps, and have to be drawn from many places in an associative memory, and across many graph operations, then Poplar has an important role as a sort of traffic cop to distribute such data and tasks in parallel amongst the IPU chips.
Despite laying out that scenario, Toon is by no means ideological. He is mindful that no one yet has the answer for sure. "I think there are different philosophies and different ideas of how that works, but nobody quite knows," he said of the field. "And that's what people are exploring."
When do all the deep questions get answered? Probably not soon.
"The amazing thing about AI is, we're ten years in from AlexNet, and we still feel like we're exploring," he said, referring to the neural network that famously excelled at the ImageNet competition in 2012, bringing deep learning forms of AI to the fore.
"I always use the analogy of computer gaming," said Toon, who wrote a Star Wars game for his first computer, a 6502 kit from a "long-forgotten" computer maker. "We're probably still at the stage of Pac Man, we haven't yet gotten to three-dimensional games," as far as AI's evolution, he said.
Along the way to 3-D games, "I don't think there'll be an AI winter," opined Toon, referring to the many times over the decades that funding has dried up and cratered the industry.
"The difference today is, it works, this is real," he said. In prior eras, with AI companies such as Thinking Machines in the 1980s, "it just didn't work, we didn't have enough data, we didn't have enough compute.
"Now, it's clear it works, there are clear proof points, it's bringing massive value," said Toon. "People are building whole businesses on the back of it," he said. "I mean, ByteDance and TikTok is fundamentally an AI-driven company, and it's just a matter of how fast it's permeating across the whole tech space, and into the enterprises."
Battles between tech giants, such as between TikTok and Meta's instagram, can be seen as a battle of AI, an arms race to have the best algorithms.