At the end of last year, Nvidia unveiled plans to start building Cambridge-1, a £40 million device ($51.7 million) that would become the UK's fastest supercomputer. But with a global health crisis still in full swing, Nvidia's team was facing a host of potential challenges; remotely managing the setting up of a supercomputer on the other side of the Atlantic was bound to come with a fair share of unforeseen complications.
Yet only 20 weeks after it was first announced, the Cambridge-1 has already entered its first stages of operation – a timeline impressive enough in normal circumstances, let alone in the context of a pandemic. To compare, the majority of supercomputers that are currently on the Top500 list took, on average, a couple of years from concept planning to final build.
Now sitting in one of data center provider Kao Data's buildings in Cambridge, the supercomputer is undergoing final tests before scientists can start running projects with the device – in this case, with a focus on healthcare research.
Cambridge-1 is designed specifically for machine-learning applications: the supercomputer is powered by 80 of Nvidia's DGX A100 systems, which are built to run AI software at a large scale. Just 20 of the DGX A100 provide the equivalent of hundreds of CPUs, enabling Cambridge-1 to pack a total 400 petaflops of AI performance, and effectively making the system the fastest computer in the country.
For Spencer Lamb, vice president of sales and marketing at Kao Data, deploying a device of this scale in such a short period of time is nothing short of "extraordinary".
"It was a challenge," Lamb tells ZDNet. "Nvidia's team are West Coast-based, and normally they would have come to the facility to have a look. What they had to do was to manage that installation remotely, without physically being in the building."
When Cambridge-1 was first announced, Kao Data was already more than half a year into implementing strict operational measures to safeguard the company's data centers against the spread of COVID-19. This means that no one aside from essential staff is allowed on-site access – and even working on the UK's fastest supercomputer failed to grant Nvidia's team entry to the building.
Typically, explains Lamb, when customers buy data center space, a big crowd is sent on-site to wander around and take a good look at the building. "All of that was done in this medium, as we are speaking today," he continues – over a Zoom call.
"The reality that we found out, is that the less humans you have there who are not strictly necessary, the better you can get on with doing the job more effectively. The eyes on the ground, working in partnership with the remote Nvidia team, achieved the outcome that was required, without the need to physically send Nvidia individuals all the way to Cambridge," adds Lamb.
For the team in Santa Clara, of course, Zoom calls didn't always cut it. Nvidia's engineers used a method called computational fluid dynamics to precisely model the space at their disposal in Kao Data's building, and decide where they wanted to place the servers and computer racks that constitute the building blocks of the supercomputer.
Based on previous models of supercomputers built by Nvidia, Cambridge-1 was designed to extend across three rooms in the building, all fitted with separate power and air conditioning systems. Each room is equipped with two rows of 12 refrigerator-sized racks, and thousands of fiber optic cables connecting the systems, set up like horizontal ladders on top of the racks.
To let US-based engineers, quite literally, keep an eye on what was going on in the building, Nvidia also brought in a small mobile robot, which the company's vice president of solutions architecture and engineering Marc Hamilton describes as "a little thing on two wheels that looks like a tablet on a stick".
The robot had previously been deployed for the construction of another one of Nvidia's supercomputers, Selene. Selene has a similar configuration to Cambridge-1, extending across several rooms; but the difference this time is that the supercomputer is located one block away from Nvidia's headquarters in California. There was always an Nvidia employee on site, therefore, to open the right door to one of the supercomputer rooms, should a remote engineer wish to send in the robot.
"With Kao, of course, we didn't have any employees on site in the building," Hamilton tells ZDNet. "So one of the small enhancements our engineers did is that they put sliding glass doors that automatically open. That's such a trivial thing – it's definitely not supercomputing – but I had never seen a supercomputer in a data center with sliding doors."
Despite all that can be achieved remotely thanks to technology, however, Cambridge-1 still had to be physically put together, and in this case by a team that was being instructed at a distance. Given the complexity of a supercomputer, there are many occasions where things could have gone wrong; but according to Hamilton, Nvidia's experienced hand in developing devices of this type prevented bad surprises from happening.
The refrigerator-sized racks that make up a supercomputer are all made of smaller computers, each of which has ten fiber optic cables sticking out. "That's a lot of manual assembly if you have to connect all those thousands of cables inside the data center," says Hamilton.
"At first, to go from one supercomputer to the next, we would re-do and re-cable all those thousands of cables. That's when we said: 'We want to make building a supercomputer as easy as building Lego blocks.' So we designed a supercomputer that's modular, and as much as possible, pre-built at the factory."
Nvidia started implementing this new approach in 2018, for the third generation of supercomputer that was designed by the company, and the same principles applied to Cambridge-1. Bundles of hundreds of fiber optic cables were connected and pre-packaged, and then shipped to the data center, where engineers were only left with the task of plugging one end into the servers and another into the network switches.
For both Hamilton and Lamb, the method was key to building Cambridge-1 at pace. "As far as our schedule is concerned, things have been very boring, and there haven't really been any surprises," says Hamilton. "Now we are mostly testing and fine-tuning things, before we go on to think about where we'll next put our supercomputers."
Hamilton expects that by mid-April, when Nvidia holds its annual GPU Technology Conference, the first research results from some of the early projects run on Cambridge-1 will be announced. The Santa Clara company has already announced partnerships with four healthcare organizations, which are set to be granted access to the device for medical research: pharmaceutical companies AstraZeneca and GSK, as well as King's College London and Guy's and St Thomas's NHS Foundation Trust.
With the extra computational power enabled by Cambridge-1, scientists will be able to solve data-based problems that were previously difficult to harness, such as better diagnosing patients and identifying appropriate treatments; but they are also confident that the device could cause breakthroughs in medical research, for example with new drug discoveries.