X
Innovation

Meta's 'pruning' of Llama 2 model shows path to slimmer AI

Half a neural network can be ripped away without affecting performance, thereby saving on memory needs. But there's bad news, too.
Written by Tiernan Ray, Senior Contributing Writer
marching-band-full-2

Like rows of a marching band that aren't heard, layers of a neural network can be silenced and have little effect on the accuracy of the net's predictions. 

Tiernan Ray/ZDNET

One of the seminal insights of artificial intelligence work in the past decade is that very large AI programs contain smaller sections within them that can do the work of the total program with less memory and fewer operations, thereby speeding up performance and reducing energy use.

That insight, most commonly referred to as the "lottery ticket hypothesis," for a famous paper in 2019 by scholars Jonathan Frankle and Michael Carbin (then at MIT, currently at database company DataBricks), is now being put to increasingly practical use as companies find ways to shrink down AI to fit on fewer GPU chips and with less memory and bandwidth needed.

Also: Move over Gemini, open-source AI has video tricks of its own

In a paper introduced last week by a team of scholars -- from Meta's AI lab, MIT, Cisco Systems, and start-up Zyphra -- removing as much as half of Meta's open-source Llama 2 large language model cut the amount of memory needed by three quarters, with the result that the program could be run on a consumer-grade Nvidia or AMD GPU rather than a huge rack of servers.

"We can remove a substantial fraction of the deepest layers from models with minimal degradation in downstream performance, write Andrey Gromov and colleagues in the paper, somewhat mysteriously titled "The Unreasonable Ineffectiveness of the Deeper Layers" and posted on the arXiv pre-print server

For Llama 2, the authors write, "we can eliminate up to roughly half of the layers before the performance collapses."

The reference to "deep layers" refers to the latter parts of a neural network. Imagine a neural network as ranks of musicians in a marching band. The direction of marching is the way the whole enterprise flows through the data, if you will. At the front of the band might be smaller brass instruments such as trumpets; at the middle of the pack, trombones and tubas; and at the back, the "deep" part, might be percussion instruments such as drums of various sizes and symbols. 

What Gromov and team are seeing is that the drums and cymbals, and perhaps even some tubas, are making no discernible contribution to the sound. They're there but ineffectual; all the output that matters is in the smaller brass and maybe some of the tubas. It's as if you could remove a good chunk of the musicians -- just do without them -- and have a more efficient band.

Also: Generative AI fails in this very common ability of human thought

In actual neural networks, including generative AI programs such as OpenAI's GPT-4, instead of rows of musicians, you have successive layers of neural network "parameters" or "weights" -- mathematical values that successively transform the input data by multiplying and summing it up, and then producing the output, i.e., the prediction.

The experimental approach taken by Gromov and team is to "prune" layers of the network to see what removing them does. 

They start by building on insights from other scholars who have tried to take apart OpenAI's GPT to see what's making it tick. For example, a 2022 study by Kevin Meng and team at MIT's Computer Science and Artificial Intelligence Laboratory used a variety of techniques to find out which GPT layers seem to contain information of a factual nature. By following the "information flow," Meng and colleagues deduced the facts are usually in the "middle" layers of a deep neural network. 

Also: The best AI chatbots: ChatGPT isn't the only one worth trying

Building on that insight, Gromov and team hypothesize that removing the deep layers -- the percussion and some tubas -- should have little effect on benchmark tests of AI skill that large language models use, such as question answering. They go about that in two steps. 

First, they try a sophisticated approach, which involves measuring which layers are most similar, and dropping ones that seem to add little. It's as if you asked one of two rows of trumpeters to leave. With each pruning step, they continuously test how the modified network performs on tests such as question answering and a basic test of "predicting the next token" that's common for generative AI. 

meta-2024-pruning-transformer-blocks

Blocks of a Transformer-based language model contain successive layers. The Meta team tested whether removing layers starting at the final, or deepest, layers of the network, would affect performance. 

Meta

Then they try an even simpler approach: successively removing layers starting from the back of the neural net. It turns out that in the second case, the simpler case, all they need to do is apply a little re-training of the remaining layers, via what's called fine-tuning, to maintain performance at a relatively constant level. 

meta-2024-pruning-accuracy

Layers of a neural net can be removed up to about half, as shown in the blue and black lines, and the accuracy, left, remains about the same as the baseline, the normal, untouched neural net. Past about forty-five percent of layers removed, the neural net plunges in accuracy.

Meta

Gromov and team find that their pruned neural nets score just as well as the original version. That implies that "the essential knowledge required to achieve a model's top score isn't removed by significant layer removal – even though the fraction can be quite large(!) – until eventually that knowledge is lost at a critical model-dependent threshold."

The findings of Gromov and team deliver good news and bad news.

Also: 2024 may be the year AI learns in the palm of your hand

On the one hand, their findings mean that large language models can dramatically shrink down in the computing they need. "In particular, the released version of Llama-2-70B spans 140 GB of memory and consumes approximately 3 × 1010 FLOPs [floating-point operations per token]," write the authors. 

"With 4-bit quantization [a reduction in the precision of the numbers to save space], and a layer-pruning fraction of 50%, the model fits in approximately 17.5 GB of memory and requires roughly 1.5 × 1010 FLOPs per token. These memory and compute requirements enable open-weight state-of-the-art models to be run and even fine-tuned efficiently on consumer-level GPUs without any CPU off-loading and with only minor performance trade-offs."

Also: How LangChain turns GenAI into a genuinely useful assistant

That's a nice efficiency boost, but, here's the bad news: The fact that so much can be pared away with such a pruning implies there could be a lot in a neural network that's being underutilized. Gromov and team are left with the open question of whether "current pre-training methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge." 

To know the answer to that question, more research is required with more extensive tests of benchmark tasks, to see if other challenges fail differently than basic question-answering.

Editorial standards