Google DeepMind’s ‘Sideways’ takes a page from computer architecture

To get greater efficiency, Google DeepMind's researchers did what chip designers have long done, built a pipeline so that the learning rule for machine learning -- backpropagation -- is more efficient.
Written by Tiernan Ray, Senior Contributing Writer

Increasingly, machine learning forms of artificial intelligence are contending with the limits of computing hardware, and it's causing scientists to rethink how they design neural networks.  

That was clear in last week's research offering from Google, called Reformer, which aimed to stuff a natural language program into a single graphics processing chip instead of eight. 

And this week brought another offering from Google focused on efficiency, something called Sideways. With this invention, scientists have borrowed a page from computer architecture, creating a pipeline that gets more work done at every moment.   

What is Sideways? Most machine learning neural nets during their training phase use a forward pass, a transmission of a signal through layers of the network, followed by backpropagation, a backward pass through the same layers, only in reverse, to gradually modify the weights of a neural network till they're just right. Sideways is an alternative to that traditional rule for performing the forward and backward passes. 

The authors of sideways, Mateusz Malinowski, Grzegorz Świrszcz, João Carreira, and Viorica Pătrăucean, all with the DeepMind unit of Google, noticed that a deep learning neural net is doing less than it could be doing at every moment in time during the forward pass and the backward pass. 

The neural net proceeds with samples of data in batches, a single batch at a time, leaving all other batches of data waiting until both the forward pass and backward pass have been computed. The reason is that the activations of layers of the network that are triggered during the forward pass need to be held onto by the computer so it can use those activations when it goes to compute backpropagation. 

Think of it as an assembly line where only one car gets built at a time so that the people at the front of the assembly line stand around waiting until the people at the very back have done their part. 

Also: Will Google's more-efficient 'Reformer' mitigate or accelerate the arms race in AI?

It should be possible, they reasoned, to fill those moments in time with other batches of signals, to get more work done. So they made what's called a pipeline. In computer chips, the pipeline has been a way to get more work done for decades. Intel famously developed very deep pipelines to move lots of instructions through its microprocessors simultaneously rather than just one at a time. By breaking down the operations into more and more stages, more pieces of code could be running in the processor at any one moment. 


An illustration of the typical forward and backward computations of a neural net, left, and the revised form, "Sideways," developed by Mateusz Malinowski and colleagues at DeepMind. In the old way, a single batch of data has to traverse the entire neural net and back again before anything else can be done, like an assembly line doing one unit at a time. With Sideways, batches of data enter a pipeline at each moment so that the various stages of the network are kept full at all times. 

Malinowski et. al.

With Sideways, at every moment in time, a layer of the neural net takes a new batch of data as the last batch moves on to the next layer. No parts of the network are left idle. They call this a pseudo-gradient because it's assembled differently than what usually happens with gradient descent, the term for that forward and backward pass combo mentioned earlier. 

To do that, of course, something's got to give because those activations will be over-written, which was the thing to avoid in the first place. The great insight at the heart of the paper is that some kinds of data can afford to be over-written and it wouldn't be a problem. If the batches of data are close in some respect to one another, then it should be just fine to over-write one batch's activations with another, you're not really changing much. 


Sideways applied to an encoder-decoder task, a somewhat more complex requirement. Again, multiple signals enter the network one after another before any one signal has finished being computed, keeping compute resources fully occupied.

Malinowski et. al.

And that's what they found in video. Video, in the form of frames of image data, contains a lot of redundant objects from one frame to another, since most of the scene doesn't usually change. 

Also: Google says 'exponential' growth of AI is changing nature of compute

As Malinowski and colleagues put it, "The smoothness of the input space is the key underlying assumption behind the Sideways algorithm."

The authors found that when they tried to classify actions in video, a classic machine learning task, using a typical convolutional neural network, their test results were as good as traditional backpropagation. Nothing was undone by writing over those weights. "Sideways training achieves competitive accuracy to BP [backpropagation]," they write.

In some cases, the accuracy was even better with Sideways, which they attribute to the possibility that over-writing activations are serving the additional purpose of regularizing the data, which can be a desirable process. 

They also tried an encoder-decoder task, where the neural net has to faithfully reconstruct frames of video. Here, they actually found much better results than traditional backpropagation. They write that backpropagation typically drops frames of video to keep up with the pace of compute, whereas the pipeline of Sideways has enough processing to handle all the frames of video. 

A big payoff of all this is that there's a big speed-up in compute. Training the convolutional neural network in the classifier case is five times faster, they claim, than with backpropagation.

They also hypothesize they can use computer memory more efficiently: "Placing different Sideways modules in different GPUs will also significantly reduce memory requirements for training large neural networks."

That last point raises an interesting question not addressed in the work. Numerous companies are producing new kinds of chips dedicated to improving the performance of deep learning, companies such as Cerebras Systems. Where is the dividing line between pipeline approaches of the sort Sideways represents and what these chipmakers are doing? Will the neural net designers and the chip makers collectively converge on something altogether new via their respective efforts? The future looks very interesting. 

Editorial standards