... not so well. You want to use the right tools for the job. The GPU is like the hammer. It's not a very elegant tool. It's not architecturally sophisticated, but it's like a jackhammer — it really powers through stuff.
You don't have to make the operating system or Excel run on the GPU, you just run it on the CPU. If you are Wall Street running this massive Monte Carlo simulation on a spreadsheet, you can get a plug-in that can run the simulation on our GPU and the spreadsheet still runs on the CPU — so you get the best of both worlds.
I believe heterogeneous computing is the way to go. You have a CPU that's becoming more and more vectorised. You have a GPU that's very, very parallel and is able to deal with more and more complex types of parallel tasks and they'll meet in the middle and some day all apps will simply run incredibly fast.
Virtual memory lets us make it easier for software developers to do their jobs. That's an ease-of-programming issue. We will add more and more features for GPUs to address ease of programmability. Memory coherence is another example. It would be nice for all apps if the first version just works. Not very fast but it just works. Then you can tear it apart and get more performance.
Right now Cuda is interesting in the sense that the app might not work at all. It doesn't work and then 'boom' — it's infinitely fast. It would be better if it works but is only three times as fast. Then you can work towards very fast.
How do you move more of the power of GPUs from workstations and supercomputers and make it more generally available, through the cloud or in the datacentre?
The GPU is better at one application. One reason is because we are so stateful. Our pipeline, the amount of data streaming through our GPUs, doesn't compare with a CPU. We just have so much state inside our processors. We are running a million threads today — there's a lot of threads inside these processors that have to be kept coherent.
On our roadmap we have pre-emption and virtualisation of memory. Those techniques are vital to the era where you have multiple applications on one GPU. Today we have one large app on many GPUs. In the future we'll go the other way. You'll be able to do both — you'll be able to mix and match.
You should be able to have an enterprise server with Tesla inside and you could have that one Tesla serve up simultaneously a G4 session for a gamer, a Quadro session for a car designer and a Tesla session for someone who's doing high-performance computing — and any combination of that mixture.
That's the future server architecture we imagine; something that's not only able to do computing but able to do visualisation and parallel computing, all in the private cloud, and serve up a compressed high-quality image to your desktop or your tablet or your phone.
Whether it's in the cloud or inside the computer, how are you going to deal with the bandwidth issues to keep GPU computing efficient as you scale up?
There is an enormous challenge in computing, which is just moving data around. It's an enormous challenge for us because we are crunching through data so fast. This is a classic computer graphics problem. Moving the data is just evil — the answer is: don't. So you need to figure out a way to move data as little as possible.
In computer graphics, the traditional APIs of the past, the ones that all failed are the ones that moved data back and forth. They're all dead. We want the parallel computing environment that streams the data to the right place so that the processors can all access that large memory space, and move it around as little as we can. Conceptually that's what we need to do.
Some of the things that we are already working on, say, with InfiniBand, we want to feed directly into our GPU or we want to DMA into our GPU so that you don't copy into system memory and then copy back out from system memory. So you want to figure out a way to move data as little as possible and now that you've moved it as little as you can, you just need to move it as fast as you can. There is just no replacement for terabytes per second.