More than any of the many headline achievements of artificial intelligence -- winning at chess, predicting the folding of proteins, labeling cats and dogs -- the form of AI known as generative AI has captivated the global imagination.
ChatGPT became the fastest-growing software program in history in January, reaching a hundred million users in less than two months from its public debut. It spawned numerous rivals, both proprietary programs such as Google's Bard, and open-source alternatives such as the University of California at Berkeley's Koala. The rush of excitement has prompted an arms race between tech giants Microsoft and Google and their peers, and a surge in the business of AI chip maker Nvidia.
All of this fervent activity has its roots in the simple fact that unlike past AI programs, which mostly produced a numeric score -- a "1" for a cat picture, a "0" for a dog picture -- ChatGPT, and image programs such as Stability AI's Stable Diffusion, and OpenAI's DALL-E, reproduce something of the world.
By outputting a paragraph, a picture, or even the skeleton of a computer program, such programs are mirroring society's creations.
The mirroring aspect is going to increase dramatically in a very short span of time.
Today's generative programs will seem primitive in comparison to the powers of programs that will be prevalent at the end of this year as they output many more kinds of things.
Moving to multiple modalities
What computer scientists call mixed modalities, or "multi-modality," will take center stage, as programs fuse text, images, "point clouds" of physical space, sounds, video, and entire computer functions as smart applications.
The mixed modality will make possible far more capable programs and will contribute to a long-held goal of continuous learning. It may even advance the goal of "embodied AI" by giving a lift to robotics.
"ChatGPT was made for entertainment, and it does a lot of things really well, but it's, sort-of, a demo," said Naveen Rao, founder of AI startup MosaicML, in an interview with ZDNET. "Now we have to start thinking about, well, if I'm using this for a purpose, how do I make that better?"
Rao, whose company was acquired by Databricks for its expertise in running AI programs, now serves as vice president of generative AI at Databricks.
Part of that improvement will be making generative AI more than just a personal "Copilot," like Microsoft's GitHub Copilot, which assists a single individual typing in a chat prompt. The programs will instead become collaborative, for teams, said Emad Mostaque, founder and CEO of Stability AI, in an interview with ZDNET.
"A lot of AI is just used as a one-to-one thing, or it's an autonomous agent," said Mostaque. "It's at iPhone 2G phase now, where it's just a single mode and you cut and paste, whereas I think the most exciting thing is how we can collaborate better and tell better stories with it, and that's not a solitary endeavor."
One of the things that is "fundamentally missing," said Databricks's Rao, "is the multi-modal-ness of the world," given that "large language models are very one-dimensional in that they only see the world through text."
Modalities refer to the nature of the input and the output, such as text, image, or video. A variety of modalities are possible and have been explored with increasing diversity, because the same basic concepts that drive ChatGPT can be applied to any type of input.
"Multi-modality is the way, definitely," said Mostaque. "You'll need models of every type, and maybe if you bring them together, it'll be amazing."
"The language-only stuff got a lot of traction and excitement, and so the media focuses on that, but people are working seriously on other things," said Jim Keller, a renowned computer-chip designer who is CEO of AI chip startup Tenstorrent, in an interview with ZDNET. Keller is betting his company on the prospect that handling mixed modalities will be one of the big AI demands going forward.
A machine for any kind of data
In a large language model, which is the heart of ChatGPT's technology, text is turned into a token, a quantified mathematical representation. The machine then has to find what is missing from either masked parts of an entire phrase, or the latter part of a phrase. It is the act of recreation that brings about the paragraphs that ChatGPT spits out.
Likewise, in the case of images, the widely used diffusion process -- popularized by Stability AI's Stable Diffusion version -- corrupts images with noise, and the act of recreating the original image trains a neural network to generate high-fidelity images.
The same processes of recovering what's missing or corrupted are spreading rapidly to numerous modalities, or, types of data. For example, in a recent issue of Nature magazine, University of Washington biologist David Baker and team corrupted the amino acid sequences of proteins via a process they call RFdiffusion. That process will train a neural network to produce a protein, in simulation, a novel synthetic protein, that has desired properties.
"We have labs for every modality," said Stability AI's Mostaque, who claims his company and OpenAI are "the only two independent multi-modal companies," outside of the tech giants such as Google. That multiple modality includes a lab at Stability AI just for audio, he said, a lab just for code generation, even a lab for biology that works on things such as re-creating fMRI images using the Stable Diffusion technology.
The magic, however, happens when more modalities are combined. The "breakthrough," said Mostaque, came in work last year by Katherine Crowson and several other researchers who trained an image-generating neural network to keep refining their output until the output satisfied a text-based prompt. They found that re-working images to match the "semantic" content of the text improved image quality. Crowson is now at Stability AI, noted Mostaque.
That image-text work has been proceeding swiftly at numerous institutions. The AI researchers at Meta have proposed a combination of text and image machines called CM3Leon that excels at not merely outputting text or outputting images, but carrying out tasks that involve both at the same time such as identifying objects in a given image or generating captions from a given image.
A richer picture of the world
The combination of multiple modalities starts to build a richer picture of the world for the neural network. Databricks's Rao cites the neuroscience concept of "stereognosis," which means to know the world by sense of touch. If someone asks how much change you have in your pocket, you can feel the coins and tell by size and weight without seeing them. "I have a representation of the world and objects that are actually represented in multiple modalities," he said. "If I can learn concepts that span modalities, then we've done something interesting."
The idea that different senses flesh out understanding is echoed in the multi-modal experiments being carried out. Research is active into how to make so-called "backbone" neural networks that can mix and match a dizzying array of modalities, and they show intriguing performance benefits.
Scholars at Carnegie Mellon University recently offered what they call a "High-Modality Multimodal Transformer," which combines not just text, image, video, and speech but also database table information and time series data. Lead author Paul Pu Liang and colleagues report that they observed "a crucial scaling behavior" of the 10-mode neural network. "Performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks."
Scholars Yiyuan Zhang and colleagues at the Multimedia Lab of The Chinese University of Hong Kong boosted the number of modalities to a dozen in their Meta-Transformer. Its point clouds model 3D vision, while its hyper-spectral sensing data represents electromagnetic energy reflected back from the ground to fly-over images of landscapes.
Making a storybook from multiple modes
The immediate payoff of multi-modality will simply be to enrich the output of a thing such as ChatGPT in ways that go far beyond the "demo" mode. A children's storybook, a book with text passages combined with pictures illustrating the text, is one immediate example. By combining the language and image attributes, the kinds of pictures created by the diffusion process can be more subtly controlled from picture to picture.
As explained by scientists at Google and lead author Wan-Duo Kurt Ma of Victoria University of Wellington in New Zealand, a process known as directed diffusion can move the cat -- or a castle, or a bird -- through various scenes, creating a series of images that afford not only greater control but transitions as in a narrative.
Similarly, Hyeonho Jeong of Korea's Sungkyunkwan University, along with scholars at the Korea Advanced Institute of Science & Technology, came up with yet another twist on diffusion -- latent diffusion -- which they detailed in a recent paper. They claim it gives access to many more details in an image at a low level of granularity.
The result is the ability to generate storybooks where a character moves through different scenarios image by image, like adding knobs to the text prompt to dial in different scenarios. The consistency of the object across images is what they call "Iterative Coherent Identity Injection."
Just as with the protein synthesis at the Baker Lab, the applications of mixed modality can become pretty wild. Another recent paper by Chenyu Tang and colleagues at Cambridge University's Department of Engineering proposes constructing a "digital twin," a computer simulation of the human body, with all the organs and tissues rendered, and the flows of blood and such depicted, by combining data from multiple medical instruments in the same process as stable diffusion.
"Both movement sensors (such as accelerometers, EMG sensors, etc.) and biochemical sensors (for detecting disease-corresponding biomarkers, such as saliva sensors, sweat sensors, etc.) can produce specific outputs for the patient," the authors wrote. "Although these outputs have distinct patterns, they all correspond to the same disease."
Special modal masters
How the modalities get put together will be as important as which ones, said Stability AI's Mostaque. "The final bit will be composition, as these building blocks that we build are put into proper software that is AI-first, that reimagines all of this creation, consumption, and these process flows with these cool new tools," he said.
While some massive models such as Google's PaLM LLM or GPT-4 may be called in, a lot of mixed modality will happen as an orchestration of components, he said. "How do you bring together models in really interesting ways, and have many different models working together to achieve the outcomes that you want to really augment that?"
While PaLM and GPT-4 can be powerful, he said, there's ample evidence that "a lot more specialized models can outperform" the biggest programs. As a result, "We're gonna have a lot of specialist models, I think, across the modalities," he said, a process of "de-constructing" the technology into its appropriate roles, "and then some multi-modal models that can do everything, and they're called at the appropriate time for the appropriate thing."
Robotics is the next AI frontier
The mixing of modalities is noteworthy for the realm of embodied AI -- in the form of robotics.
Sergey Levine, associate professor in the electrical engineering department at the University of California at Berkeley, told ZDNET that as it relates to generative AI, systems in robotics have a significant role.
"The multi-modal stuff is quite exciting," added Levine, a member of the University's Berkeley Artificial Intelligence Research facility who also works with teams at Google.
By processing images and text, a multi-modal neural network is already able to produce "high-level robot commands," he said. The code that a roboticist would ordinarily write to instruct a robot can be "fully automated, essentially," said Levine.
"What we want is the ability to quickly and easily command the robots to do stuff," said Levine. "Bridging that gap is something that language models are gonna be great at."
Levine helped oversee an early demonstration at Google that was published recently, called PaLM-E, which the Google researchers call "An Embodied Multimodal Language Model." The robot is able to follow a series of instructions such as "bring me the rice chips from drawer," which the language model breaks down into atomic instructions, such as "go to the drawer," "open the drawer," "pick the green rice chip bag," etc.
A subsequent work, by Google's DeepMind unit, called RT-2, builds upon PaLM-E by adding the ability to generate spatial coordinates for the robot. Levine calls that work "a significant advance."
As with the concept of stereognosis, Levine argues that increasing modalities may bring an enriched model of the world and thereby bring some basic reasoning abilities.
If large language models and diffusion models can integrate the process of "taking previous images and predicting [text] descriptions, and taking previous descriptions and predicting images," said Levine, "now they might start, kind-of, drilling further down in terms of how they understand the world."
A primitive example of world knowledge is a robot bartender that Levine has worked on, which checks people's I.D. "You can actually tell the language model, write me some code for a robot bartender, and it generates some logic to do that, and if someone orders a cup of water, that's not an alcoholic beverage," and therefore doesn't require an I.D. check.
We're going to need a lot more memory
The combination of robotics and multi-modality has more profound implications because it expands the appetite for data dramatically. Today's generative AI such as ChatGPT has no explicit memory. It only works on the last bunch of stuff you typed at the prompt, and after a while, it forgets things from long ago.
Using mixed modality that includes many more data samples will force generative AI to develop something like a real memory of data. "When we start moving to multi-modal models, now that starts being much more demanding on context," said Levine, "because the current prototype of that model takes in one image, but maybe you want to give it a thousand images.
"Maybe you want to show it a tour of your house so that it knows where everything in your house is, so that when you ask it to bring you the car keys, it can sort of examine its memory and figure out where the car keys are -- now that requires a much longer context."
Video data can be equally if not more critical for letting a robot build a portrait of the world. Those videos, coupled with text and point clouds and other modalities, become a simulator by which a robot can build a model of the world, said Levine. "If these models essentially provide a way to learn very high fidelity simulators, that could have a very, significant impact in the future."
Expanding to thousands of images and possibly hours of video, perhaps gigabytes of point-cloud, 3D data, to train multi-modal programs, means ChatGPT and the rest will have to dramatically expand their access to data via a so-called memory bank.
Many efforts are underway to "augment" language models with what's called retrieval from a database. That can be seen in Meta's CM3Leon program, which lets the software dip into a database and find relevant images.
Efforts such as the Hyena technology at Stanford University and Canada's MILA institute attempt to dramatically expand what can be fed into a program's prompt so that any amount of data can be input, of any modality.
That means that along with mixed modality, the successors to ChatGPT will be able to juggle far greater context -- whole books, series of articles, movies, and records of physical structures in three dimensions. It also means that the context for any task can become much more tailored to an individual or a group's acquired knowledge. Mostaque said such models will not only bring the generalized knowledge of GPT-4, but also specific knowledge, as well as the knowledge of your team, your company, and beyond.
"I think that's the big unlock, when it goes enterprise next year," said Mostaque, referring to the imminent popular adoption of generative AI in corporate settings.
Continuous learning attainable
As multi-modality expands to video and audio and point clouds and all the rest, Keller, the CEO of AI chip company Tenstorrent, believes that more advanced generative models, especially those coming from the open-source software community, will lead to a profound change in the field's distinction between training and inference.
Training is when a neural net is first developed. It is an extremely costly scientific process, with hundreds or even thousands of GPUs used. Inference is when the finished network is used to make predictions for end users, a much less demanding process that is widely deployed as a cloud service.
But "the generative models actually use quite a few features from training in inference," said Keller. A program such as Stability AI's Stable Diffusion, for generating images, updates its neural network during inference, he said. "It is multi-pass: it has a back pass" as well as the typical forward process of predictions, so that "it looks like it's in training mode."
For that reason, "I think the AI engine of the future … will have a fairly diverse set of capabilities that won't look like inference versus training," but more like a fusion of the two.
If Keller is right, the future generative models could be the start of a long-held goal of continuous learning for machine learning, also sometimes called online learning, whereby a generative neural network is not fixed once trained but evolves continually as people use it more.
"I think this is going to be the case" agreed Stability AI's Mostaque. "Continuous learning will be key, because the way we do it now, teaching [the model] the same thing over and over, is not appropriate."
Already, said Mostaque, things such as Stability AI's "Dream Booth," which lets one build a customized version of an image, are moving beyond the rigid notion of re-training a language-image model to something more fluid. He said these become personal avatars -- and over the next few months -- a kind of hyper-Dream Booth that allows for the personalization of all your images in real time.
"That's why continuous learning will be so important: to enable that continuous process so that it evolves."