Nvidia is no longer just a graphics card company. Its advances in graphics processing unit — or GPU — computing with the Cuda parallel architecture in its Tesla and Fermi-based GeForce graphics cards have made massively parallel computing mainstream, using everyday hardware rather than research workstations.
Slightly disappointing second-quarter results and a continuing legal battle with Intel have not daunted chief executive Jen-Hsun Huang, who boasts of leading the list of Top500 supercomputers worldwide with the Nvidia-based Nebulae.
When the company announced its Cuda roadmap at the recent GPU Technology Conference in San Jose, California, ZDNet UK sat down with Huang to find out why parallel computing is proving so popular now and where he believes Nvidia can make a difference.
Q: We've had parallel GPU computing for several years, but why is it really taking off now?
A: It showed up at a time when performance was having a very difficult time scaling because of power and architectural challenges for processors that were designed for instruction-level parallelisation.
It wasn't until recently that parallel computing — inspired a lot by Cuda — made people realise that there are whole areas in computing science that we can tackle. We have the architecture, the compilers, the software and the programming tools — now how do we scale out into tens of thousands of processor clusters? This whole area is now vibrant again. It's fun, it's sexy again. When you can do something 10 or 100 times faster, something magical happens and you can do something completely different.
The way we design chips today and the way we designed chips 15 years ago are completely different. We now assume that computational resource is infinite and we just keep building out a server room so that we can explore the entire design space simultaneously and pretty close to exhaustively. Suppose I were to do that in car, tennis shoe or golf club design — we'll be able to build much more amazing things.
Will Cuda persist alongside open-standard OpenCL and other approaches, or will everyone switch to a common parallel computing standard?
It's hard to say exactly when it's going to get to that. We're one of the most enthusiastic supporters of OpenCL. One of our executives at Nvidia is the president of [not-for-profit open-standards consortium] The Khronos Group.
We put a lot of energy into developing OpenCL. We're the first to market with every release of OpenCL and are known to have the best implementation. We believe in that particular standard.
But the challenge with GPU computing is that it's changing so fast. We're improving performance by a factor of four every other year. We're adding features so quickly — there are enormous feature deficiencies that we all know from parallel computing. This is an area of development that has gone on for more than three decades now.
We're not going to come up with a standard tomorrow morning. On the one hand we have a standard, in OpenCL, and that's evolving. On the other hand we have revolutionary changes in GPU computing and we're just going to have to juggle these two things until we settle down a bit.
We focus all our evangelism on Cuda because Cuda requires us to do it. OpenCL does not. OpenCL has the benefit of IBM and AMD and Intel. There's a lot of people to carry that water, we don't have to carry it all ourselves. And Cuda's programming approach is a little different. Instead of a programming API, it's really a language extension so you're programming in C — you're not calling APIs.
On the other hand, the reason why Cuda is more adopted than OpenCL is simply because it's more advanced. We've invested in Cuda for much longer. The quality of the compiler is much better. The robustness of the programming environment is better. The tools around it are better. There are more people programming it.
What does it take for parallel GPU computing to be truly pervasive? Do we need operating systems and standard apps to take more advantage of them? Do we need to make development easier?
It turns out that parallel computing as we've implemented it is successful because we didn't try to take over the CPU. We didn't wake up and say we've invented something that's going to disrupt the CPU and the hundred trillion dollars of R&D budget that's already gone into making that thing better.
However, there is a class of problems where we could add something to the CPU and turbo-charge it. The CPU does some things really well, but many things...
... not so well. You want to use the right tools for the job. The GPU is like the hammer. It's not a very elegant tool. It's not architecturally sophisticated, but it's like a jackhammer — it really powers through stuff.
You don't have to make the operating system or Excel run on the GPU, you just run it on the CPU. If you are Wall Street running this massive Monte Carlo simulation on a spreadsheet, you can get a plug-in that can run the simulation on our GPU and the spreadsheet still runs on the CPU — so you get the best of both worlds.
I believe heterogeneous computing is the way to go. You have a CPU that's becoming more and more vectorised. You have a GPU that's very, very parallel and is able to deal with more and more complex types of parallel tasks and they'll meet in the middle and some day all apps will simply run incredibly fast.
Virtual memory lets us make it easier for software developers to do their jobs. That's an ease-of-programming issue. We will add more and more features for GPUs to address ease of programmability. Memory coherence is another example. It would be nice for all apps if the first version just works. Not very fast but it just works. Then you can tear it apart and get more performance.
Right now Cuda is interesting in the sense that the app might not work at all. It doesn't work and then 'boom' — it's infinitely fast. It would be better if it works but is only three times as fast. Then you can work towards very fast.
How do you move more of the power of GPUs from workstations and supercomputers and make it more generally available, through the cloud or in the datacentre?
The GPU is better at one application. One reason is because we are so stateful. Our pipeline, the amount of data streaming through our GPUs, doesn't compare with a CPU. We just have so much state inside our processors. We are running a million threads today — there's a lot of threads inside these processors that have to be kept coherent.
On our roadmap we have pre-emption and virtualisation of memory. Those techniques are vital to the era where you have multiple applications on one GPU. Today we have one large app on many GPUs. In the future we'll go the other way. You'll be able to do both — you'll be able to mix and match.
You should be able to have an enterprise server with Tesla inside and you could have that one Tesla serve up simultaneously a G4 session for a gamer, a Quadro session for a car designer and a Tesla session for someone who's doing high-performance computing — and any combination of that mixture.
That's the future server architecture we imagine; something that's not only able to do computing but able to do visualisation and parallel computing, all in the private cloud, and serve up a compressed high-quality image to your desktop or your tablet or your phone.
Whether it's in the cloud or inside the computer, how are you going to deal with the bandwidth issues to keep GPU computing efficient as you scale up?
There is an enormous challenge in computing, which is just moving data around. It's an enormous challenge for us because we are crunching through data so fast. This is a classic computer graphics problem. Moving the data is just evil — the answer is: don't. So you need to figure out a way to move data as little as possible.
In computer graphics, the traditional APIs of the past, the ones that all failed are the ones that moved data back and forth. They're all dead. We want the parallel computing environment that streams the data to the right place so that the processors can all access that large memory space, and move it around as little as we can. Conceptually that's what we need to do.
Some of the things that we are already working on, say, with InfiniBand, we want to feed directly into our GPU or we want to DMA into our GPU so that you don't copy into system memory and then copy back out from system memory. So you want to figure out a way to move data as little as possible and now that you've moved it as little as you can, you just need to move it as fast as you can. There is just no replacement for terabytes per second.