Microsoft builds a supercomputer for OpenAI for training massive AI models

Microsoft is building a supercomputer for and with OpenAI and is using it to train massive distributed AI models, which it is counting on to improve the AI capabilities in its own software and services.

Microsoft Build 2020: All developers need to know

read this

Everything you need to know about AI

An executive guide to artificial intelligence, from machine learning and general AI to neural networks.

Read More

Microsoft invested $1 billion last July in OpenAI, a research organization co-founded by former Y Combinator president Sam Altman and Tesla CEO Elon Musk. At that time, Microsoft and OpenAI said they'd engage in an exclusive, multi-year partnership to build new Azure AI supercomputing technologies. At Build 2020 on May 19, Microsoft went public with more details on the supercomputer work that's been happening.

Microsoft officials said they've built the fifth most powerful publicly recorded supercomputer (as ranked on the TOP500 supercomputers list) in collaboration with and exclusively for OpenAI. This supercomputer is specifically for training massive distributed AI models. AI researchers believe that single, massive models will perform better than the smaller, separate AI models of the past.

Microsoft has built its own family of large AI models, which it calls the Microsoft Turing models. These models have been used to improve language understanding across Bing, Office, Dynamics, and other products. Microsoft has made publicly available what is believed to be the largest publicly available AI language model in the world: The Turning model for natural language generation.

Officials said at Build that they are going to begin open-sourcing the Microsoft Turing models "soon," as well as recipes for training them using Azure Machine Learning. Microsoft also is adding support for distributed training to its ONNX Runtime, an open-source library for making models portable across hardware and operating systems.

While the AI supercomputer Microsoft has built is exclusively for OpenAI, Microsoft is planning to make its large AI models and training optimization tools available through Azure AI services and GitHub, they have said. Microsoft also makes various accelerators and services available under its "Azure AI" banner to customers who don't need a dedicated supercomputer.

Microsoft said the supercomputer built for OpenAI is a single system with more than 285,000 CPU cores; 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. The supercomputer is hosted in Azure and has access to Azure services.

I believe the supercomputer work Microsoft has been doing may be codenamed "Odyssey." I found some references to Odyssey in a Microsoft job listing recently for an individual to be the central point of contact in Azure for work with Open AI, and who would work with "a vast array of vendors and partners (including Cray, HPE, Mellanox and NVIDIA)." 

The Azure Storage team also recently advertised an opportunity to work on "a massively parallel supercomputer comprised of tens of thousands of commodity PCs" -- the equivalent in power and storage of 20,000, 30,000, or 100,000 machines -- working on solving problems that directly affect Microsoft's search, ads, and portal businesses. I believe this could be connected with the Turing/Odyssey work, as well.

Turing is part of Microsoft's broader initiative called "AI at Scale." The core premise of AI at Scale is people can train really large neural networks on high powered infrastructure once, and then reuse that same model across many scenarios to significantly improve AI inside various products. Microsoft trained one language understanding model, called Turing NLR, and is reusing that same model adapted for various scenarios across multiple products in Microsoft Bing, Word, SharePoint, and Outlook.

While Microsoft customers can't directly use the OpenAI supercomputer, they can use the company's upgraded Azure compute infrastructure; its open-sourced DeepSeed software to train large-scale models; and its ONNX runtime to deploy and run the models faster and more cheaply, officials said. Those who aren't able or willing to train their own models can reuse Microsoft's Turing model, and likely, at some point, the Turing NLR model.