My mid-2022, Meta will control what it believes will be the world's fastest AI supercomputer. Dubbed the AI Research SuperCluster (RSC), the system is already running, and among the world's fastest AI supercomputers, the company said in a blog post on Monday.
Development of the RSC is ongoing, but once the second phase is completed by the second half of this year, the system will deliver nearly 5 exaflops of mixed-precision compute.
Meta, formerly known as Facebook, is already using the supercomputer to train large models in natural language processing (NLP) and computer vision for research. The company uses large-scale AI models for ongoing priorities, such as detecting harmful content on its social platforms. Ultimately, though, it wants to train models with trillions of parameters to help it power the metaverse -- the virtual world that Meta intends to support with its platforms and products.
"The experiences we're building for the metaverse require enormous compute power (quintillions of operations/second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more," Meta CEO Mark Zuckerberg said in a statement.
Currently, RSC comprises a total of 760 Nvidia DGX A100 systems as its compute nodes, for a total of 6,080 GPUs. The GPUs communicate via an Nvidia Quantum 200Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC's storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
By comparison, the US Energy Department's Perlmutter AI supercomputer, unveiled last summer as the world's fastest AI supercomputer, delivers nearly four exaflops of mixed-precision performance with 6,159 Nvidia A100 Tensor Core GPUs.
By the time Meta's RSC is complete, the InfiniBand network fabric will connect 16,000 GPUs as endpoints, making it one of the largest such networks deployed to date. Additionally, Meta designed a caching and storage system that can serve 16 TB/s of training data. The company plans to scale it up to 1 exabyte -- that's equivalent to 36,000 years of high-quality video.
RSC is replacing Meta's legacy infrastructure, designed in 2017, which has 22,000 Nvidia V100 Tensor Core GPUs in a single cluster and performs 35,000 training jobs a day. Early benchmarks on RSC suggest it runs computer vision workflows up to 20x faster than the old system, runs the Nvidia Collective Communication Library (NCCL) more than 9x faster and trains large-scale NLP models 3x faster.
While Meta's previous AI research infrastructure used only open source and other publicly available data sets, RSC includes privacy and security controls that will allow it to teach models with real-world data from Meta's production systems.
RSC leverages encrypted user-generated data that is decrypted right before training. The system has no direct inbound or outbound connections to the internet, and traffic can flow only from Meta's production data centers. The entire data path from Meta's storage systems to the GPUs is end-to-end encrypted. Before data is imported to RSC, it must go through a privacy review process to confirm it has been correctly anonymized. The data is then encrypted before it can be used to train AI models, and decryption keys are deleted regularly to ensure older data is not still accessible.