Facebook’s PyTorch 1.1 does the heavy lifting for increasingly gigantic neural networks

As neural networks scale to dozens of layers and billions of parameters, Facebook offers greater parallelism for models with PyTorch 1.1.
Written by Tiernan Ray, Senior Contributing Writer

On the second day of Facebook's "F8" conference at the San Jose convention center, the company announced an updated version of its PyTorch development framework, Version 1.1, with a raft of new features, including, perhaps most interesting, the ability to split up the parts of a neural network across multiple graphics processing units, or GPU, in training A.I. systems. 

In case you hadn't heard, neural networks are getting big, really big. 

"Models keep getting bigger and bigger, they are really, really big, and really expensive to train," said Joe Spisak, PyTorch product manager, in an interview with ZDNet. "They are exceeding the memory of a 16-gigabyte GPU in many cases," he observed. 

"Our latest models are exceeding 10 gigabytes," Spisak said of the neural networks Facebook develops internally, "and parameter counts are approaching and in some cases even exceeding one billion parameters."

Also: Facebook's Mark Zuckerberg: "The future is private"

Plenty of examples can be found, notes Spisak, in the common research. For example, a "large" version of the popular BERT natural language network can be found in implementations with twenty four layers, with 1,024 hidden units, 340 million parameters, and 16 "heads" that manage the movement of the network across its inputs. 

"The sky's the limit," according to Spisak. "These models can get as big as we allow them too."

To handle that, PyTorch 1.1 adds the ability to split networks across GPUs, known as "sharding" the model. Previously, PyTorch allowed developers to split the training data across processors, known in the parallel processing computing world as "data parallelism." The splitting of networks makes possible "instruction parallelism." Hence, networks can now achieve what's known as "MIMD," or "multiple instruction, multiple data."

Also: Facebook's Zuckerberg preaches privacy, but his delivery makes it hard to even ponder believing

"Traditionally, these models sit within a GPU, and you can deal with distributed parallel data, meaning, you shard your data set, and you replicate the model over the system," Spisak explained. 

"Once you get to these larger models, the model itself has to be sharded. You put certain model layers, or certain sub-graphs, on one node, and then carve off another sub-graph onto another piece of compute."

After the sharding, an algorithm in PyTorch can that combines during training.

The problem exists for both training of neural nets but also inference, Spisak said, but it is less serious in the case of inference because Intel's CPUs, which dominate inference in the data center, tend to support much more memory and so they don't get taxed as much from that standpoint. 

Must read

"The core insight from that paper -- the one I found most interesting -- is that it's really not the number of qubits that we have, say, 100 versus 110 qubits, but rather, what among those seven milestones, which engineering problems, have we solved?" Fernick says.

Near the top of the stack of seven milestones, far from the field of today's NISQ, are the quantum algorithms that will ultimately drive the use of logical gates of qubits. 

Science is "still very much in the infancy of quantum algorithms," observes Fernick. "It's very naive to think that the quantum algorithms we have now are what we will be excited about 20 years from now."

Those still-undiscovered algorithms are probably a better place to look for a quantum A.I. gain. 

Muses Fernick, "Wouldn't it be more interesting to exploit those quantum physical properties in an entirely new way to make algorithms that are very different?"

Cloud services: 24 lesser-known web services your business needs to try

Editorial standards