The team have scaled the Microsoft Cognitive Toolkit -- an open-source suite that trains deep learning algorithms -- to more than 1,000 Nvidia Tesla P100 GPU accelerators on the Swiss centre's Cray XC50 supercomputer, which is nicknamed Piz Daint.
The project could allow researchers to run larger, more complex, and multi-layered deep learning workloads at scale on the supercomputers, Cray said.
Deep learning is an emerging branch of machine learning, which uses multiple processing layers to work on complicated problems. And while researchers want to run larger deep-learning models, conventional systems and architectures limit the problems that can be addressed, as models take too long to train.
But by accelerating the training process, instead of waiting weeks or months for results, data scientists can obtain results within hours or even minutes. This could help researchers tackle new computing problems like stepping up from image recognition to video recognition, or from speech recognition to natural language processing with context.
Cray said that deep learning problems share algorithmic similarities with applications traditionally run on a massively parallel supercomputer, and that by optimizing inter-node communication each training job can leverage significantly more compute resources, reducing the time required to train an individual model.
Professor Thomas Schulthess, director of the Swiss supercomputing centre, said the work meant researchers and scientists will be able to use their existing Cray XC supercomputer to take on a new class of deep learning problems "that were previously infeasible".