The race is on to create one neural network that can process multiple kinds of data -- a more-general artificial intelligence that doesn't discriminate about types of data but instead can crunch them all within the same basic structure.
The genre of multi-modality, as these neural networks are called, is seeing a flurry of activity in which different data, such as image, text, and speech audio, are passed through the same algorithm to produce a score on different tests such as image recognition, natural language understanding, or speech detection.
And these ambidextrous networks are racking up scores on benchmark tests of AI. The latest achievement is what's called "data2vec," developed by researchers at the AI division of Meta (parent of Facebook, Instagram, and WhatsApp).
The point, as Meta researcher Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli reveal in a blog post, is to approach something more like the general learning ability that the human mind seems to encompass.
"While people appear to learn in a similar way regardless of how they get information -- whether they use sight or sound, for example -- there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities," the blog post states.
The main point is that "AI should be able to learn to do many different tasks, including those that are entirely unfamiliar."
Meta's CEO, Mark Zuckerberg, offered a quote about the work and its ties to a future Metaverse:
People experience the world through a combination of sight, sound, and words, and systems like this could one day understand the world the way we do. This will all eventually get built into AR glasses with an AI assistant so, for example, it could help you cook dinner, noticing if you miss an ingredient, prompting you to turn down the heat, or more complex tasks.
The name data2vec is a play on the name of a program for language "embedding" developed at Google in 2013 called "word2vec." That program predicted how words cluster together, and so word2vec is representative of a neural network designed for a specific type of data, in that case text.
In the case of data2vec, however, Baevski and colleagues are taking a standard version of what's called a Transformer, developed by Ashish Vaswani and colleagues at Google in 2017, and extending it to be used for multiple data types.
The Transformer neural network was originally developed for language tasks, but it has been widely adapted in the years since for many kinds of data. Baevski et al. show that the Transformer can be used to process multiple kinds of data without being altered, and the trained neural network that results can perform on multiple different tasks.
In the formal paper, "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language," Baevski et al., train the Transformer for image data, speech audio waveforms, and text language representations.
The very general Transformer becomes what is called a pre-training that can then be applied to specific neural networks in order to perform on specific tasks. For example, the authors use data2vec as pre-training to equip what's called "ViT," the "vision Transformer," a neural network specifically designed for vision tasks that was introduced last year by Alexey Dosovitskiy and colleagues at Google.
When used on ViT to try to solve the standard ImageNet test of image recognition, their results come in at the top of the pack, with accuracy of 84.1%. That's better than the score of 83.2% received by a team at Microsoft that pre-trained ViT lead by Hangbo Bao last year.
And the same data2vec Transformer outputs results that are state-of-the-art for speech recognition and that are competitive, if not the best, for natural language learning:
Experimental results show data2vec to be effective in all three modalities, setting a new state of the art for ViT-B and ViT-L on ImageNet-1K, improving over the best prior work in speech processing on speech recognition and performing on par to RoBERTa on the GLUE natural language understanding benchmark.
The crux is that this is happening without any modification of the neural network to be about images, and the same for speech and text. Instead, every input type is going into the same network and is completing the same very general task. That task is the same task that Transformer networks always use, known as "masked prediction."
The way that data2vec performs masked prediction, however, is an approach known as "self-supervised" learning. In a self-supervised setting, a neural network is trained or developed by having to pass through multiple stages.
First, the network constructs a representation of the joint probability of data input, be it images or speech or text. Then, a second version of the network has some of those input data items "masked out," left unrevealed. It has to reconstruct the joint probability that the first version of the network had constructed, which forces it to create increasingly better representations of the data by essentially filling in the blanks.
The two networks, the one with the full pattern of the joint probability, and the one with the incomplete version that it is trying to complete, are called, sensibly enough, "Teacher" and "Student." The Student network tries to develop its sense of the data, if you will, by reconstructing what the Teacher has already achieved.
How is the neural network performing Teacher and Student for three very different types of data? The key is that the "target" of joint probability in all three data cases is not a specific output data type, as is the case in versions of the Transformer for a specific data type, such as Google's BERT or OpenAI's GPT-3.
Rather, data2vec is grabbing a bunch of neural network layers that are inside the neural network, somewhere in the middle, that represent the data before it is produced as a final output.
As the researchers write, "One of the main differences of our method […] other than performing masked prediction, is the use of targets which are based on averaging multiple layers from the teacher network." Specifically, "we regress multiple neural network layer representations instead of just the top layer," so that "data2vec predicts the latent representations of the input data."
They add, "We generally use the output of the FFN [feed-forward network] prior to the last residual connection in each block as target," where a "block" is the Transformer equivalent of a neural network layer.
The point is that every data type that goes in becomes the same challenge for the Student network of reconstructing something inside the neural network that the Teacher had composed.
This averaging is different from other recent approaches to building One Network To Crunch All Data. For example, last summer, Google's DeepMind unit offered up what it calls "Perceiver," its own multi-modal version of the Transformer. The training of the Perceiver neural network is the more-standard process of producing an output that is the answer to a labeled, supervised task such as ImageNet. In the self-supervised approach, data2vec isn't using those labels; it's just trying to reconstruct the network's internal representation of the data.
Even more ambitious efforts lie in the wings. Jeff Dean, head of Google's AI efforts, in October teased about "Pathways," calling it a "next generation AI architecture" for multi-modal data processing.
Mind you, data2vec's very general approach to a single neural net for multiple modalities still has a lot of information about the different data types. Image, speech, and text are all prepared by pre-processing of the data. In that way, the multi-modal aspect of the network still relies on clues about the data, what the team refer to as "small modality-specific input encoders."
We are not yet at a world where a neural net is trained with no sense whatsoever of the input data types. We are also not at a point in time when the neural network can construct one representation that combines all the different data types, so that the neural net is learning things in combination.
That fact is made clear from an exchange between ZDNet and the researchers. ZDNet reached out to Baevski and team and asked, "Are the latent representations that serve as targets a combined encoding of all three modalities at any given time step, or are they usually just one of the modalities?"
Baevski and team responded that it is the latter case, and their reply is interesting enough to quote at length:
The latent variables are not a combined encoding for the three modalities. We train separate models for each modality but the process through which the models learn is identical. This is the main innovation of our project since before there were large differences in how models are trained in different modalities. Neuroscientists also believe that humans learn in similar ways about sounds and the visual world. Our project shows that self-supervised learning can also work the same way for different modalities.
Given data2vec's modality-specific limitations, a neural network that might truly be One Network To Rule Them All remains the technology of the future.