Supercomputers are becoming another cloud service. Here's what it means

Designing for the usual cloud workloads isn't the same as designing for high performance computing: Azure is trying to achieve both.
Written by Mary Branscombe, Contributor

These days supercomputers aren't necessarily esoteric, specialised hardware; they're made up of high-end servers that are densely interconnected and managed by software that deploys high performance computing (HPC) workloads across that hardware. Those servers can be in a data centre – but they could also be in the cloud as well.

When it comes to large simulations – like the computational fluid dynamics to simulate a wind tunnel – processing the millions of data points needs the power of a distributed system and the software that schedules these workloads is designed for HPC systems. If you want to simulate 500 million data points and you want to do that 7,000 or 8,000 times to look at a variety of different conditions, that's going to generate about half a petabyte of data; even if a cloud virtual machine (VM) could cope with that amount of data, the compute time would take millions of hours so you need to distribute it – and the tools to do that efficiently need something that looks like a supercomputer, even if it lives in a cloud data centre.

When the latest Top 500 list came out this summer, Azure had four supercomputers in the top 30; for comparison, AWS had one entry on the list, in 41st place. 

SEE: Nextcloud Hub: User tips (free PDF) (TechRepublic)

HPC users on Azure run computational fluid dynamics, weather forecasting, geoscience simulation, machine learning, financial risk analysis, modelling for silicon chip design (a popular enough workload that Azure has FX-series VMs with an architecture specifically for electronic design automation), medical research, genomics, biomedical simulations and physics simulations, as well as workloads like rendering.

They do some of that on traditional HPC hardware; Azure offers Cray XC and CS supercomputers and the UK's Met Office is getting four Cray EX systems on Azure for its new weather-forecasting supercomputer. But you can also put together a supercomputer from H and N-Series VMs (using hardware like NVidia A100 Tensor Core GPUs and Xilinx FPGAs as well as the latest Epyc 7300 CPUs) with HPC images. 

One reason the Met Office picked a cloud supercomputer was the flexibility to choose whatever the best solution is in 2027. As Richard Lawrence, the Met Office IT Fellow for supercomputing. put it at the recent HPC Forum, they wanted "to spend less time buying supercomputers and more time utilizing them". 

But how does Microsoft build Azure to support HPC well when the requirements can be somewhat different? "There are things that cloud generically needs that HPC doesn't, and vice versa," Andrew Jones from Microsoft's HPC team told us.

Everyone needs fast networks, everybody needs fast storage, fast processors and more memory bandwidth, but the focus on how all that is integrated together is clearly different, he says.

HPC applications need to perform at scale, which cloud is ideal for, but they need to be deployed differently in cloud infrastructure from typical cloud applications.

SEE: Google's new cloud computing tool helps you pick the greenest data centers

If you're deploying a whole series of independent VMs it makes sense to spread them out across the datacenter so that they are relatively independent and resilient from each other, whereas in the HPC world you want to pack all your VMs as closest together as possible, so they have the tightest possible network connections between each other to get the best performance he explains.

Some HPC infrastructure proves very useful elsewhere. "The idea of high-performance interconnects that really drive scalable application performance and latency is a supercomputing and HPC thing," Jones notes. "It turns out it also works really well for other things like AI and some aspects of gaming and things like that."

Although high speed interconnects are enabling disaggregation in the hyperscale data centre, where you can split the memory and compute into different hardware and allocate as much as you need of each, that may not be useful for HPC even though more flexibility in allocating memory would be helpful, because it's expensive and not all the memory you allocate to a cluster will be used for every job.

"In the HPC world we are desperately trying to drag every bit of performance out of the interconnect we can and distributing stuff all over the data centre is probably not the right path to take for performance reasons. In HPC, we're normally stringing together large numbers of things that we mostly want to be as identical as possible to each other, in which case you don't get those benefits of disaggregation," he says.

Cloudy HPC

What will cloud HPC look like in the future? 

"HPC is a big enough player that we can influence the overall hardware architectures, so we can make sure that there are things like high memory bandwidth considerations, things like considerations for higher power processes and, therefore, cooling constraints and so on are built into those architectures," he points out.

The HPC world has tended to be fairly conservative, but that might be changing, Jones notes, which is good timing for cloud. "HPC has been relatively static in technology terms over the last however many years; all this diversity and processor choice has really only been common in the last couple of years," he says. GPUs have taken a decade to become common in HPC.

SEE: What is quantum computing? Everything you need to know about the strange world of quantum computers

The people involved in HPC have often been in the field for a while. But new people are coming into HPC who have different backgrounds; they're not all from the traditional scientific computing background.

"I think that diversity of perspectives and viewpoints coming into both the user side, and the design side will change some of the assumptions we'd always made about what was a reasonable amount of effort to focus on to get performance out of something or the willingness to try new technologies or the risk reward payoff for trying new technologies," Jone predicts.

So just as HPC means some changes for cloud infrastructure, cloud may mean big changes for HPC.

Editorial standards