Meta's Data2vec 2.0: Second time around is faster

Meta's generalist network for speech, text, and images returns, but this time with a few tricks to speed up how fast it learns.
Written by Tiernan Ray, Senior Contributing Writer
Overlapping pictures of a large black dog and the words "I drink black tea"

Meta's Data2vec is an example of a generalist neural network that can use the same exact code to crunch examples of data in different modalities -- in this case, speech, text, and images -- and make predictions about that data.

Baevski et al.

What do you do when you've proven your point in neural networks?

Do it faster is one answer. 

On Tuesday, Meta, the owner of Facebook, Instagram, and WhatsApp, unveiled Data2vec 2.0, a revamp of a neural network introduced earlier this year that behaves as a sort of generalist, performing on tasks that involve text, image, and speech data with the same basic approach to all three. 

The second time around, Meta's scientists made the program faster and, in a few cases, more accurate on benchmark tests of machine learning tasks.

"Data2vec 2.0 shows that the training speed of self-supervised learning can be substantially improved with no loss in downstream task accuracy," write authors Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli, four of the authors of the original Data2vec paper, in this new work, Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language, posted on arXiv

Also: What is ChatGPT and why does it matter?

The singular accomplishment of this second Data2vec is to reduce the time it takes to train Data2vec. Training a neural net is typically measured in terms of "epochs," meaning the number of times the neural net is given the training examples. It can also be measured by the wall clock time, the literal hours, minutes, and days counted from start to finish.

"Experiments show that Data2vec 2.0 can reach the same accuracy as many popular existing algorithms in 2-16x the training speed," they write.

The name Data2vec is a play on the name of a program for language "embedding" developed at Google in 2013 called Word2vec. That program predicted how words cluster together, and so Word2vec is representative of a neural network designed for a specific type of data, in that case text. 

In the case of Data2vec, however, Baevski and colleagues are taking a neural network called a Transformer, developed by Ashish Vaswani and colleagues at Google in 2017, and extending it to be used for multiple data types. The same structure of the neural network can serve to train all three -- image, speech, and text -- without being altered to suit the particularities of any of those, making it a generalist program. 

Baevski and colleagues extend the Transformer to what's called "self-supervised" learning. In a self-supervised setting, a neural network is trained by having to pass through multiple stages whose results are compared to each other.

First, the network compresses a data sample, what's known as constructing a representation of input data. Then, a second version of the network has some of those input data items "masked out," left unrevealed. It has to reconstruct the representation that the first version of the network had constructed, which forces the second network to build a better model of how data fit together by essentially filling in the blanks.

Also: AI's true goal may no longer be intelligence

The two networks -- the one with the compressed representation of the full, unmasked input data, and the one with the incomplete version that it is trying to complete -- are called, sensibly enough, Teacher and Student, respectively. The Student network tries to develop its sense of the data, if you will, by reconstructing what the Teacher has already achieved despite the masking.

The authors this time made two key changes to Data2vec to make it faster: using "convolutions," and "amortizing" the compressed representations of the teacher network.

On the first score, the student network that has to predict the representations of the teacher is no longer using the part of the Transformer called a decoder to do so. 

That is the standard approach, to de-compress, in a sense, the compressed representations of the teacher network. Instead, the authors use what are called convolutional neural networks, a bedrock tool in neural nets to represent data samples in compressed form, and a tool that is much older than the Transformer. It's a good example of how older technology can stick around in programming.

"Instead of using a Transformer-based decoder, we use a smaller convolutional decoder, which we find to be easier and faster to train," they write. 

For the second change, instead of repeatedly creating a compressed representation in the teacher network, the new Data2vec creates the representation only once. It then reuses that as the target, the thing to be guessed, for each of the masked data points.

As the authors put it, "In order to amortize the cost of the teacher model computation, we reuse the teacher representation for multiple masked versions of the training sample.

"Concretely, we consider M different masked versions of the training sample and compute the loss with respect to the same target representation."

Data2vec 2.0 diagram

The architecture of Data2vec 2.0. Meta this time has replaced the second part of the program, what had been a Transformer-based decoder, with a decoder that is based on convolutional neural networks, an older technology. They also reused the compressed representations of the "teacher" network as a single target for multiple masked instances of the "student" network's data.

Baevski et al 2022

In the results section of the paper, Baevski and team relate how they both cut training time and improved accuracy across all three domains of image recognition, speech recognition, and natural language processing. 

For image processing, the authors used Data2vec as the basis for fine-tuning what's called "ViT," the "vision Transformer," a neural network specifically designed for vision tasks that was introduced last year (PDF) by Alexey Dosovitskiy and colleagues at Google. The Data2vec program is a pre-trained foundation, on top of which ViT is a fine-tuning, in the terms of the literature. 

Compared with January's results, the Data2vec-backed ViT once again topped other neural nets used as a basis for ViT in terms of accuracy on ImageNet, the classic test of assigning labels to images, and it topped the prior version of Data2vec as well.

But in addition to accuracy, the new Data2vec took far fewer training epochs. The prior Data2vec took 800 epochs; this time, that was reduced to 150 epochs. And next to a competing self-supervised network, masked auto-encoders, or MAE, another Meta creation (PDF), the training is cut from 1,600 epochs to 100, even as accuracy of the new Data2vec topped MAE. The faster training regimen results in a big reduction in absolute time to train, just 66 hours for Data2vec 2.0 versus 113.6 hours for MAE.

Also: Artificial intelligence: 5 innovative applications that could change everything

In speech recognition, the task is to fill in the missing parts of a snippet of an audio file of a spoken phrase. The new Data2vec went up against multiple competing neural nets for speech, including the original data2vec, and programs called Wav2vec, HuBERT, and WavLM. In no case did Data2vec 2.0 beat those networks, but it "obtains higher accuracy than other models at faster training time." For example, 43 hours of training Data2vec 2.0 reaches accuracy that requires 57 hours for the original Data2vec.

In the third arena, natural language processing, Data2vec 2.0 was tested across a spectrum of challenges comprising the General Language Understanding Evaluation framework, known as GLUE, developed by NYU's Courant Institute of Mathematical Sciences in 2019

In one test, the network has to predict whether a sentence follows from another -- logical entailment -- while another representative task challenges the network to label a phase grammatically correct or not.

Going up against the original Data2vec, plus two Transformer-based programs, Google's BERT and a revised version, called RoBERTa, introduced in 2019 by the Paul Allen School of Computer Science at University of Washington and Meta, the 2.0-version of Data2vec scores handsomely across the GLUE results while being faster to train. 

The total average accuracy score across all the GLUE tasks for this new version is 82.6, just a fraction below the original Data2vec's 82.7, but higher than BERT's 81.2 and higher than RoBERTa's 82.5. But, Data2vec 2.0 takes only 28.2 hours to reach that level, less than half the 69 hours it took for the original Data2vec, and much less than the 50.5 hours it takes for RoBERTa.

Also: The people building artificial intelligence are the ones who need AI the most

Baevski and team write that they will extend Data2vec in future to other forms of data beyond speech, image, and text, raising the prospect it can be even more of a generalist. 

One limitation seems likely to stay in place. As with the original Data2vec, the 2.0 version still is handling each data type differently when they are first input to the network during training. That means Data2vec hasn't yet developed a completely generic way to handle the data types. 

Image, speech, and text are all prepared by pre-processing of the data. In that way, the multi-modal aspect of the network still relies on clues about the data, what the team refers to as "small modality-specific input encoders."

Moreover, each of the compressed encodings from the teacher network is created separately for the three data types. There isn't yet an ability to create a kind of "super-encoding" that will combine all the data types at once into one representation. 

And so, as with Data2vec 1.0, a neural network that might truly be One Network to Rule Them All remains the technology of the future.

As with the original Data2vec, Meta has posted the code on GitHub.

Editorial standards