IBM Research has achieved a new milestone in distributed deep learning (DDL), building software that scales up DDL across hundreds of GPUs at near-ideal efficiency.
The research tackles one of the major challenges of deploying deep learning: Large neural networks and large datasets help deep learning thrive but also lead to longer training times. Training large-scale, deep learning-based AI models can take days or weeks.
The process takes a long time as the scaled-up number of GPUs communicate with one another. And in fact, it's gotten worse as GPUs have gotten faster. Faster GPUs can learn faster, but with conventional software, their communications with each other can't keep up.
"Basically, smarter and faster learners (the GPUs) need a better means of communicating, or they get out of sync and spend the majority of time waiting for each other's results," IBM's Hillery Hunter wrote in a blog post. "So, you get no speedup-and potentially even degraded performance-from using more, faster-learning GPUs."
The new DDL software addresses that, and it should make it possible to run popular open source codes like Tensorflow, Caffe, Torch, and Chainer over massive neural networks and data sets with very high performance and accuracy.
IBM Research demonstrated how it achieved record communication overhead and 95 percent scaling efficiency on the Caffe deep learning framework over 256 GPUs in 64 IBM Power systems. The previous scaling record was set by Facebook AI Research, which achieved close to 90 percent efficiency for a training run on Caffe2, at higher communication overhead.
Additionally, with this new software, IBM Research achieved a new image recognition accuracy of 33.8 percent for a neural network trained on a very large data set (7.5 million images from the ImageNet-22k dataset) and achieved in just seven hours. Microsoft held the previous record, demonstrating 29.8 percent accuracy in 10 days.
"A 4% increase in accuracy is a big leap forward; typical improvements in the past have been less than 1%," Hunter wrote.
IBM Research was able to achieve those fast and accurate results, Hunter explained, by leveraging the power of tens of servers, each equipped with hundreds of GPUs.
"Most popular deep learning frameworks scale to multiple GPUs in a server, but not to multiple servers with GPUs," Hunter explained. "Specifically, our team (Minsik Cho, Uli Finkler, David Kung, and their collaborators) wrote software and algorithms that automate and optimize the parallelization of this very large and complex computing task across hundreds of GPU accelerators attached to dozens of servers."
With these improvements in deep learning training, IBM expects we could see advances in a range of AI use cases, such as more accurate medical image analyses or better speech recognition technologies. IBM is making a technical preview of the software available now in version 4 of PowerAI, its deep learning software distribution package.