GPU computing: Accelerating the deep learning curve

To build and train deep neural networks you need serious amounts of multi-core computing power. We examine leading GPU-based solutions from Nvidia and Boston Limited.
Written by Alan Stevens, Contributor

Video: Cryptocurrency mining raises GPU prices, causes shortage

Artificial intelligence (AI) may be what everyone's talking about, but getting involved isn't straightforward. You'll need a more than decent grasp of maths and theoretical data science, plus an understanding of neural networks and deep learning fundamentals -- not to mention a good working knowledge of the tools required to turn those theories into practical models and applications.

Read also: Nvidia expands new GPU cloud to HPC applications

You'll also need an abundance of processing power -- beyond that required by even the most demanding of standard applications. One way to get this is via the cloud but, because deep learning models can take days or even weeks to come up with the goods, that can be hugely expensive. In this article, therefore, we'll look at on-premises alternatives and why the once-humble graphics controller is now the must-have accessory for the would-be AI developer.

Enter the GPU

If you're reading this it's safe to assume you know what a CPU (Central Processing Unit) is and just how powerful the latest Intel and AMD chips are. But if you're an AI developer, CPUs alone are not enough. They can do the processing, but the sheer volume of unstructured data that needs to be analysed to build and train deep learning models can leave them maxed out for weeks on end. Even multi-core CPUs struggle with deep learning, which is where the GPU (Graphics Processing Unit) comes in.

Again, you're probably well aware of GPUs. But just to recap, we're talking about specialised processors developed originally to handle complex image processing -- for example, to enable us to watch movies in high definition or participate in 3D multiplayer games or enjoy virtual reality simulations.

GPUs are particularly adept at processing matrices -- something CPUs have trouble coping with -- and it's this that also suits them to specialised applications like deep learning. Also, a lot more specialised GPU cores can be crammed into the processor die than with a CPU. For example, whereas with an Intel Xeon you might currently expect to get up to 28 cores per socket, a GPU can have thousands -- all able to process AI data simultaneously.

Because all those cores are highly specialised, they can't run an operating system or handle core application logic, so you still need one or more CPUs as well. What these systems can do, however, is massively accelerate processes such as deep learning training, by offloading the processing involved from CPUs to all those cores in the GPU subsystem.

The GPU in practice

So much for the theory, when it comes to the practice there are a number of GPU vendors with products aimed at everything from gaming to the specialist HPC (High Performance Computing) market and AI. This market was pioneered by Nvidia with its Pascal GPU architecture, which has long been the role model for others to aim at.

In terms of actual products, you can get into AI for very little outlay using a low-cost gaming GPU. An Nvidia GeForce GTX 1060, for example, can be had for just £270 (inc. VAT), and delivers 1,280 CUDA cores -- the Nvidia GPU core technology. That sounds like a big deal, but in reality it's nowhere near enough to satisfy the needs of serious AI developers.

For professional AI use, therefore, Nvidia has much more powerful and scalable GPUs based both on its Pascal technology and a newer architecture, Volta, which integrates CUDA cores with Nvidia's new Tensor core technology specifically to cater for deep learning. Tensor cores can deliver up to 12 times the peak teraflops (TFLOPS) performance of CUDA equivalents for deep learning training and 6 times the throughput for inference -- when deep learning models are actually used.

The first product to be based on Volta is the Tesla V100, which features 640 of the new AI-specific Tensor cores in addition to 5,120 general HPC CUDA cores, all supported by either 16GB or 32GB of second-generation High Bandwidth Memory (HBM2).


As well as a PCIe adapter, the Tesla V100 is available as an SXM module to plug into Nvidia's high-speed NVLink bus.

Image: Nvidia

The V100 is available as either a standard plug-in PCIe adapter (these start at around £7,500) or as a smaller SXM module designed to fit into a special motherboard socket which, as well as PCIe connectivity, enables V100s to be connected together using Nvidia's own high-speed NVLink bus technology. Originally developed to support first-generation (Pascal-based) Tesla GPU products, NVLink has since been enhanced to support up to six links per GPU with a combined bandwidth of 300GB/sec. NVLink is also available for use with a new Quadra adapter and others based on the Volta architecture; also, such is the pace of change in this market, there's now a switched interconnect -- NVSwitch -- enabling up to 16 GPUs to be linked with a bandwidth of 2.4TB/sec.

Off-the-shelf AI

Of course, GPUs by themselves aren't of much use, and when it comes to serious AI and other HPC applications there are a number of ways to put them to work. One is to buy the individual GPUs plus all the other components required to build a complete system and assemble it yourself. However, few business buyers will be happy to go down the DIY route, with most preferring to get a ready-made -- and, more importantly, vendor-supported -- solution either from Nvidia or one of its partners.

These ready-made solutions, of course, all use the same GPU technology but deployed in different ways. So, to get an idea of what's on offer we took a look at what Nvidia is selling and a Supermicro-based alternative from Boston Limited.


Take your AI pick: Nvidia (bottom) and Boston (top) deep learning servers together in the same rack.

Image: Alan Stevens/ZDNet

The Nvidia AI family

Nvidia is keen to be known as the 'AI Computing Company' and under its DGX brand sells a pair of servers (the DGX-1 and newer more powerful DGX-2) plus an AI workstation (the DGX Station), all built around Tesla V100 GPUs.


The sleek Nvidia DGX family of ready-to-use AI platforms are all powered by Tesla VX100 GPUs.

Image: Nvidia

Delivered in distinctive gold crackle-finish cases, DGX servers and workstations are ready-to-go solutions comprising both a standard hardware configuration and an integrated DGX Software Stack -- a pre-loaded Ubuntu Linux OS plus a mix of leading frameworks and development tools required to build AI models.

We looked first at the DGX-1 (recommended price $149,000) which comes in a 3U rack-mount chassis. Unfortunately the one in the lab at Boston was busy building real models so, apart from an outside shot, we couldn't take any photos of our own. From others we've seen, however, we know that the DGX-1 is a fairly standard rack-mount server with four redundant power supplies. It's standard on the inside too, with a conventional dual-socket server motherboard equipped with a pair of 20-core Intel Xeon E5-2698 v4 processors plus 512GB of DDR4 RAM.

A 480GB SSD is used to accommodate the operating system and DGX Software Stack, with a storage array comprising four 1.92TB SSDs for data. Additional storage can be added if needed, while network connectivity is handled by four Mellanox InfiniBand EDR adapters plus a pair of 10GbE NICs. There's also a dedicated Gigabit Ethernet interface for IPMI remote management.


We couldn't open up the DGX-1 as it was busy training, but here it is hard at work in Boston Limited's Labs.

Image: Alan Stevens/ZDNet

The all-important GPUs have a home of their own, on an NVLink board with eight sockets fully populated with Tesla V100 SXM2 modules. The first release only had 16GB of dedicated HBM, but the DGX-1 can now be specified with 32GB modules.

Whatever the memory configuration, with eight GPUs at its disposal the DGX-1 boasts a massive 40,960 CUDA cores for conventional HPC work plus 5,120 of the AI-specific Tensor cores. According to Nvidia that equates to 960 teraflops of AI computing power which, it claims, makes the DGX-1 the equivalent of 25 racks of conventional servers equipped with CPUs alone.

It's also worth noting that the leading deep learning frameworks all support Nvidia GPU technologies. Moreover, when using Tesla V100 GPUs, these are up to 3 times faster than using Pascal-based P100 products with CUDA cores alone.

Read also: NVIDIA brings its fastest GPU accelerator to IBM Cloud to boost AI, HPC workloads (TechRepublic)

Buyers of the DGX-1 can also benefit from 24/7 support, update and on-site maintenance direct from Nvidia, although this is a little pricey at $23,300 for a year or $66,500 for three years. Still, given the complex requirements of AI, many will see this as good value and in the UK customers should expect to pay around £123,000 (ex. VAT) to get a fully-equipped DGX-1 with a year's support.

AI gets personal


The sleek DGX Station on a bench in Boston Limited's Labs.

Image: Alan Stevens/ZDNet

Unfortunately the newer DGX-2 with 16 GPUs and the new NVSwitch didn't ship in time for our review, but we did get to look at the DGX Station, which is designed to provide a more affordable platform for developing, testing and iterating deep neural networks. This HPC workstation will also appeal to companies looking for a platform for AI development prior to scaling up to on-premises DGX servers or the cloud.

Housed in a floor-standing tower chassis, the DGX Station is based on an Asus motherboard with a single 20-core Xeon E5-2698 v4 rather than two as on the DGX-1 server. System memory is also halved, to 256GB, and instead of eight GPUs, the DGX Station has four Tesla V100 modules implemented as PCIe adapters but with a full NVLink interconnect linking them together.

Storage is split between a 1.92GB system SSD and an array of three similar drives for data. Dual 10GbE ports provide the necessary network connectivity and there are three DisplayPort interfaces for local displays at up to 4K resolution. Water cooling comes as standard and the end result is a very quiet as well as hugely impressive-looking workstation.


We did get to see inside the smart-looking DGX Station where there's just one Xeon processor, 256GB of RAM, four Tesla V100 GPUs and a lot of piping for the water cooling.

Image: Alan Stevens/ZDNet

With half the complement of GPUs, the DGX Station delivers a claimed 480 teraflops of AI computing power. Unsurprisingly that's half what you get with the DGX-1 server, but still a lot more than using CPUs alone It's also a lot more affordable, with a list price of $69,000 plus $10,800 for a year's 24/7 support or $30,800 for three years.

UK buyers will have to find around £59,000 (ex. VAT) for the hardware from an Nvidia partner with a one-year support contract, although we have seen a number of promotions -- including a 'buy four get one free' offer! -- which are worth looking out for. Educational discounts are also available.

Boston Anna Volta XL

The third product we looked at was the recently launched Anna Volta XL from Boston. This is effectively the equivalent of the Nvidia DGX-1 and is similarly powered by dual Xeons plus eight Tesla V100 SXM2 modules. These are all configured inside a Supermicro rack-mount server with a lot more customisation options compared to the DGX-1.


The Anna Volta XL from Boston features dual Xeon processors and eight Tesla V100 GPUs in a customisable Supermicro server platform.

Image: Supermicro

A little bigger than the Nvidia server, the Anna Volta XL is a 4U platform with redundant (2+2) power supplies and separate pull-out trays for the conventional CPU server and its GPU subsystem. Any Xeon with a TDP of 205W or less can be specified -- including the latest Skylake processors, which Nvidia has yet to offer on its DGX-1 product.


The CPU tray on the Anna Volta can accommodate two Xeons and up to 3TB of DDR4 RAM.

Image: Alan Stevens/ZDNet

There are 24 DIMM slots available alongside the Xeons to take up to 3TB of DDR4 system memory and, for storage, sixteen 2.5-inch drive bays able to accommodate either 16 SATA/SAS or 8 NVMe drives. Network attachment is via dual 10GbE network ports with a dedicated port for IPMI remote management. You also get six PCIe slots (four in the GPU tray and two in the CPU tray) so there's the option of adding InfiniBand or Omni-Path connectivity if required.

The GPU tray is fairly spartan, filled by a Supermicro NVLink motherboard with sockets for the Tesla V100 SXM2 modules, each with a large heatsink on top. GPU performance is, naturally, the same as for the DGX-1 although overall system throughput will depend on the Xeon CPU/RAM configuration.


The all-important Tesla V100 modules are mounted on a NVLink card in the top of the Boston Anna Volta server (one of the heatsinks has been removed for the photo).

Image: Alan Stevens/ZDNet

The Anna Volta is priced a lot lower than the Nvidia server: Boston quotes $119,000 for a similar specification to the DGX-1 (a saving of $30,000 on list price). For UK buyers that translates to around £91,000 (ex. VAT). The AI software stack isn't included in the Boston price, but most of what's required is open source; Boston also offers a number of competitive maintenance and support services.

And that's about it in this rapidly emerging market. In terms of the GPU hardware there's really no difference between the products we looked at, so it's all down to preference and budget. And with other vendors preparing to join the fray, prices are already starting to drop as demand for these specialist AI platforms grows.

Previous and related coverage:

Nvidia reveals special 32GB Titan V 'CEO Edition' GPU, and then gives away a bunch
Nvidia makes a special 32GB edition of its most powerful PC graphics card, the Titan V.

Google Cloud expands GPU portfolio with Nvidia Tesla V100
Nvidia Tesla V100 GPUs are now publicly available in beta on Google Compute Engine and Kubernetes Engine.

NVIDIA HGX-2 GPU blends AI and HPC for next-gen business computing (TechRepublic)
NVIDIA's new GPU compute appliance is touted as being able to replace 300 dual CPU server nodes.

Editorial standards