Tera-Scale: What Would We Do with All These Cores and How Would We Feed Them?

Tera-Scale: What Would We Do with All These Cores and How Would We Feed Them?

Summary: Last week’s Tera-scale announcement at the International Solid State Circuits Conference (ISSCC) certainly created a lot of buzz in the press and on the Web. I have to admit being somewhat surprised by how extensively the story was picked up, not just in the technical press, but the popular press as well.

SHARE:
TOPICS: Hardware
12

Last week’s Tera-scale announcement at the International Solid State Circuits Conference (ISSCC) certainly created a lot of buzz in the press and on the Web. I have to admit being somewhat surprised by how extensively the story was picked up, not just in the technical press, but the popular press as well. From the many interviews I did, it was quite clear that people have an insatiable desire to know what their future computing devices will do and how soon they will do it. Fortunately, researchers at Intel and elsewhere have spent several years, not just thinking about the question, but actually building prototypes of those next-decade applications. Believe me when I say it’s much more credible to talk about a specific example than just blow some smoke and promise that whatever those applications are, they will be really cool.

Back to Recognition, Mining, and Synthesis

I first addressed the issue of why now is the time to create these ideas in my post Cool Codes in which I introduced the RMS categories. The important point is there is an entirely new breed of applications waiting to be invented that doesn’t simply benefit from Tera-scale performance, it requires it. Let me refresh you on RMS by talking about real-time motion capture and rendering and a few other examples to illustrate the idea.

Today, to produce a Pixar-quality image takes about 6 hours of computing on a current-generation, dual-processor rack-mount server. That's to render one frame out of the 144,000 frames required for a feature-length, animated movie. How cool would it be if you could bring that quality of image rendering to your desktop in real-time? Imagine playing the Cars video game with imagery that's comparable to what you see in the theater. To create that user experience, we have to go from 6 hours per frame to 124th of a second per frame, but at least it’s a very well-characterized computational improvement. It will take a combination of teraFLOPS of computing power and huge advances in the algorithms that render the image. Note that synthesis is the “S” in RMS, and this is but one example.

By the way, synthesis is not just about making pictures.  It's making sounds, making things move and interact with one another in physically accurate ways. When an animated character speaks in these future desktop animations, their facial muscles will move exactly as they do when a real person speaks. It does beg the question whether we’ll actually need actors at some point, but that’s a topic for another blog.

Here’s another example: Today in our labs we can data mine the imagery found in a recorded multi-camera video of an individual moving within a defined 3D space. The goal of this video stream mining is to extract their full body motion. We can’t quite do it in real-time at this point, but we are pretty close and there’s no need for marks or lights on the clothing or a background blue screen to do it. By the way, mining is the M in RMS.

Once we have the body motion information, we use it to animate a skeletal model of a human. It’s the skeletal model that makes sure we have the kinematics right and the motion is consistent with how people move. At that point, we can put the “skin on the bones” to create a fully synthetic person moving identically to the real one. Adding lights, shadows, and reflections to our little virtual world gives us a synthetic figure moving naturally and accurately within it.

If you started to think how the above technology could replace the Wii handheld remote controllers, you’ve got the idea. Future video entertainment will use full-body motion capture to put your virtual self in the game, dance instruction, or Tai Chi lesson.

Take out the Noise, Take out the Shake

Most of us have cassettes full of VHS quality (or worse) home video. When we put it up on our new 50-inch HD displays, it simply looks awful. Adding video cameras to cell phones has further exacerbated the problem. Fortunately, there is a way to rescue these old videos. The technique is called super-resolution and it takes advantage of the tremendous amount of redundancy in a video stream.  Using statistical techniques, we can dramatically reduce camera shake, improve resolution, and fix a variety of other visual problems by exploiting all the extra information provided by each frame. Imagine being able to bring all your cell phone videos up to standard definition quality and reprocess those “obsolete” DVDs into high-definition DVDs. It’s a Tera-scale problem for sure, and the reconnaissance satellite folks have been doing it for years. It’s time to make it safe for home use.

How Is It Possible to Feed Such a Beast?

Silent E was right in pointing out that memory capacity and bandwidth have to match or the cores will “starve” and users will not see the performance benefits. It’s relatively easy to pack a lot of processing power on a single chip. It’s much, much harder to provision the memory and I/O bandwidth to keep those processors productive. Fortunately, there are several approaches which promise to meet the future needs. Let me briefly mention two of them.

First, we need to bring more memory closer to the processors, and three approaches do this with varying degrees in bandwidth and capacity. The first is to use system-in-package (SIP) technology to place memory chips in the same package as the processor. Microsoft uses this approach in the Xbox 360. The next approach is to stack a memory chip underneath the processor, which is what we have planned as a future experiment with the Tera-scale Research Processor. Finally, there is embedding DRAM on the processor, as IBM described last week at ISSCC. Much work is required to decide which approach is best in a given situation, but the point is there is more than one solution.

Getting data on and off the chip is also a challenge. While we continue to push electrical signaling to higher and higher speeds, optical signaling is an increasingly attractive option. Costs are coming down and may decline even further when we move to silicon-based photonic solutions. If we can approach electrical costs, but still provide the flexibility and interference advantages of optical, we might just go optical. Once you make that transition, things look good out to about 10 terabits per second per fiber, which should keep us going for a little while to say the least.

Tera-scale keeps sounding more and more fun. Stay tuned as I continue to paint to complete picture. The blog is long overdue for a discussion of the programming challenges ahead.

Topic: Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • A few notes . . .

    "To create that user experience, we have to go from 6 hours per frame to 124th of a second per frame"

    Why 1/124th? 30 frames/sec is a decent framerate, last I heard, and it's also the framerate of full-motion video. 124 frames/sec is just plain overkill.

    And video games are a lot more than motion, physics, and a fancy body motion detector - you forgot about displaying the game world. You didn't mention shaders or triangles - if we ever want to see Pixar quality games, [b]vast[/b] improvements have to still be made in these areas as well.

    You started out getting us all excited about Pixar quality pictures, but you go into detail about everything else. Geeze.

    Somebody please remind the tech experts [i]again[/i] that business and scientific computing is [b]not[/b] the same as gaming computing, and just because you know all of this fancy stuff does not a gaming expert you make.
    CobraA1
    • 1/124th of a second is compute time

      the rest of the 1/30 of a second is used to render and plot the pixel, transmit it to the graphics interface and then shove it out the digital or analog video interface. I'm sure I left something out.
      BTW I think he's probably got a tighter grip on the technology needed for gaming then you appreciate.
      Xwindowsjunkie
    • 1/124th of a second

      is 71,4 milliseconds. To paraphrase Data, "to an android it's an eternity." It also means that the CPU is running at least 2.4 GHz assuming that the cycle time for each instruction is 4 T intervals or clock cycles since x86 architecture uses a minimum of 4 clock cycles per instruction. It also ignores how much processing can be off-loaded to the graphics processor.

      Also FYI 30 frames a second is the INTERLACED frame rate which in reality is 2 fields of half-resolution video.

      If you want to run at full speed NON-interlaced HDTV rates, you'll need a lot faster processing for 60 frames non-interlaced video because its 4 times as much data to manipulate. If you are running three monitors for a wide or tall desktop, your CPU needs to run at 3 times the normal speed to display full motion video on all 3 screens, hence 1/120 or 1/124th of a second. No its not overkill. Its barely enough compute-power.
      Xwindowsjunkie
  • Bitgrid?

    I've long held the opinion that we need to take this grid idea to it's logical conclusion... a grid of processors working on 4 bits of input and creating 4 bits of output, 1 to/from each neighbor in a Cartesian grid. I've written it up at bitgrid.blogspot.com

    What do you think?
    --Mike--
    m.warot@...
    • Algorithms

      Trouble is most algorithms don't lend themselves to be deployed on such massively parallel communicating processors. Its been tried before for HPC (which has probably the problems best suited for such implementations) The HPC crowd may well go back to such architectures when multicore with interconnects becomes the norm. But general purpose 'desktop' applications will be hard pressed to utilize such an architecture.
      NetNet
  • 124th... probably meant 1/24th

    It probably got chopped up in editing from 1/24th of a second to 124th... but I could be wrong.
    --Mike--
    m.warot@...
  • THINK OF COMPUTER ANIMATION AS THOUGH IT WERE A COMPUTER GAME

    In a computer game the actions are carried out in real time.In computer animation the art and the art movements are programmed and viewed.I click "START" and I see the animation play on my computer display.Rendering is the process of constructing the frame by frame video file.I render the animation and wait for the computer to finish the file rendering.This finished file is the file that the DVD is made from.The audio must also be added.When I view the animation,before rendering,I am seeing the program run,frame by frame,in the computer's memory.(somehow the video card's memory is used here) I am not certain as to why it takes so long to render.For High Definition the display's screen resolution has to be chosen carefully.800 x 600 or 1280 x 1024 or even higher.Resoultion is a selection in the animation software program itself.
    BALTHOR
    • Rendering

      When a computer renders it needs to calculate all the different angles, etc. throughout each frame until the end of the composition. It's the same as if you took a stickit pad and created each character, object and motion as they moved through time on each sheet. You refer to the previous sheets when making your decision to place something somewhere, etc. The same goes for a computer.

      This is why you won't see movie animation quality in computer or console video games in the near future. There's just too much to calculate in a timely manner.
      THEE WOLF
  • Hi

    I agree to you in this matter...
    brianpippen12@...
  • Polaris is not what it seems

    My reading of the ISSCC paper on Polaris (http://forums.techgage.com/attachment.php?attachmentid=155&d=1171236277) is that this is a purely an exercise in technology development. Borkar's group had published a paper on ISSCC a few years ago on a fast FP MAC out of which Polaris has been developed.
    Despite Intel's claims Polaris in this form is not usable for HPC applications as it only supports single-precision arithmetic.
    The FP unit uses deferred normalisation so any loop of code which stores results back to memory will have to perform an additional normalisation step which will degrade the top-line performance.
    Furthermore the on-chip data-memory per node is only 2kB (512x 32-bit words) deep, and the 3kB instruction memory will only hold 256x 96-bit instructions.
    The small memory and reliance on the NoC to supply data will mean that performance and power will be very high when compared with the IBM Cell which has 256kB/node local data/program storage.
    Most tellingly of all the instruction-set only supports FP MACs, no divides, square-roots etc. so it is only really of use for a headline-grabbing marketing exercise.
    In terms of graphics and ray-tracing etc. single-precision will suffice (as in the Cell/SPE) but there is no support for sqrt and reciprocals which is a big problem and the lack of these functions would also make game-physics impossible as collision detection requires heavy use of trigonometric operations.
    moloned@...
  • Fast Ray Tracing using all these Cores

    Computer games will look more realistic if Ray tracing replaces current rendering using GPU. The main complaint is that Ray tracing is slow and no specilized hardware is available. Now that Intel's tera chip answers the hardware needs, the next step is to develop fast ray tracing software for further speed enhancements. The most time consuming part of ray tracing is ray-syrface intersection. Fast explicit intersection algorithm is available for second order surfaces. To this end I have devloped fast explicit intersection algorithm for cubic and quartic bezier triangles, we can revolunize the Technology. I am wondering if Intel will be interested to make use of this fast algorithm which will user fewer elements to model as well as fast to ray trace.
    jchinniah
  • Multi-Core To The Masses

    Mr. Rattner,
    I am writting a paper on multi-core processing and was wondering how I can obtain access to your article "Multi-Core to the Masses." Any help would be greatful, thanks.

    E-mail: aim54pheonix@hotmail.com
    aim54pheonix