The many-core performance wall

The many-core performance wall

Summary: Many-core chips are the great hope for more performance but Sandia National Lab simulations show they are about to hit a memory wall. How bad is it?

SHARE:

Many-core chips are the great hope for more performance but Sandia National Lab simulations show they are about to hit a memory wall. How bad is it?

Some background Memory bandwidth is the limiting performance factor in CPUs. If you can't feed the beast it stops working, simple as that.

John von Neumann - your PC/Mac is a von Neumann architecture machine - made the point in his very technical First Draft of a Report on the EDVAC (pdf)

. . . the main bottleneck, of an automatic very high speed computing device lies: At the memory.

Here's the ugly results of Sandia's simulations:

Simulated performance of many-core chips with 2 different memory implementations[graph courtesy Sandia National Lab]

Performance roughly doubles from 2 cores to 4 (yay!), near flat to 8 (boo!) and then falls (hiss!).

Did Pink Floyd forecast this? Chip packages support just so many pins and so much bandwidth. Transistors per chip double every couple of years - but the number of pins don't.

Professors William Wulf and Sally McKee named it "the memory wall" in their 1994 paper Hitting the Memory Wall: Implications of the Obvious (pdf) saying:

We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed - each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.

According to an article in IEEE spectrum that time is almost upon us. With cores per processor doubling every 2-3 years - and graphics chips moving faster - we don't have long to wait.

The memory wall's impact is greatest on so-called informatics applications, where massive amounts of data must be processed. Like sifting through petabytes of remote sensing data to find bad guys with nukes.

Can this be fixed? Sandia is investigating stacked memory architectures, popular in cell phones for space reasons, to get more memory bandwidth. But as the simulation shows, that doesn't improve performance.

RAMbus is working on a Terabyte Bandwidth Initiative that may help. Their goal: 64 16GB/sec DRAMs with differential data channels to feed a system-on-a-chip memory controller.

Intel needs to pick up the pace. Nehalem processors are the first with an on-chip memory controller and the new Quick Path Interconnect. But server-class Nehalems are now limited to 2 QPI links for a total theoretical bandwidth of only 50 GB/sec. Faster, pussycat, faster!

The Storage Bits take Many-core is the future for computer performance. Memory bandwidth is one big problem. Software support for efficient many-core use is another. Either could bring the performance expected from Moore's Law to dead stop.

The industry is making big investments in both problems. If it is a problem for Sandia today it will be a problem for consumers in 10 years.

What if one doesn't get solved? Then the Moore's Law rocket we're been riding will sputter and die. Life on the glidepath won't be nearly so much fun.

Comments welcome, of course. If anyone wants to make the case that von Neumann was wrong, I'm all ears.

Topics: Processors, CXO, Hardware, IT Employment

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

37 comments
Log in or register to join the discussion
  • Ultimately, software will need to change

    The problem with parallel computing has always been
    contention over shared resources.

    For computational problems that can be solved using a
    divide-and-conquer strategy, there is a solution: give
    each processor core it's own private bank of memory
    (whether it is on the chip or off is not essential to
    the solution). If you remember the Transputer in the
    late 1980's, it used this architecture.

    The problem of course is that you have to be able to
    divide-and-conquer and not all problems can be solved
    this way. Even worse, it requires a different way of
    thinking and a different way of writing software.

    Long live OCCAM. :-)
    timboldt
    • Parallel computing

      It seems to me that since our hardware in terms of multiple core is running in parallel, the software equivalent has to be developed. For too long, we have relied on hardware solutions to our need for speed. Now we need to start relying on software to catch up to the hardware architectures we have in place.
      ArrowQuick
  • RE: The many-core performance wall

    Well it seems obvious that soon or late this would be a problem.
    I have always be fascinated by the fact that company like AMD and especially Intel were so focused on CPU(more Mhz than now more cores) theorical performances at the expense of the overall pratical performance which could be brought by the significant improvement of memory architecture and overall performances.
    The exponential increase of CPU power is mainly benefical to applications which requires a lot of processing power and not so much of efficient and fast memory.
    The funny thing is that this kind of job could be significantly better done by specific CPU or processing units rather than the generic CPU used in today PC, as proven by the cell based coprocessor used in the Toshiba G50 which is several more powerful for video editing than the main CPU of this laptop despites a significantly lower frequency.
    timiteh
    • only innovation will save us

      "The funny thing is that this kind of job could be
      significantly better done by specific CPU or
      processing units rather than the generic CPU used in
      today PC, as proven by the cell based coprocessor used
      in the Toshiba G50"

      Interesting to note that this solution, the Cell and
      it's brethren, use the first generation of the RAMbus
      technology noted in the original article. According
      to Sony/Toshiba/IBM 96% of the Cells interconnects use
      RAMbus technology. It's too bad so many folks are
      focused on driving this company of clever engineers
      (400 engineers, 10 lawyers, not the other way around)
      out of business because they are so ill informed as to
      who is doing the innovating and who is just trying to
      free-ride.
      jsm88
      • We've got a Solution. AMD had nailed it a long time ago 64bit

        But at the time Intel and Microsoft were buddy biddies. Now that Intel finally has a functional 64 bit processor suddenly Microsoft is pushing 64 bit. Had they been pushing it this aggressively back when AMD came out with the 64 bit processor we'd have a complete eco system by now. But the monopoly slowed Moore's law by a few years.
        Breetai
  • RE: The many-core performance wall

    Well, unless they're prepared to start from scratch and build the equivalent of the Connection Machine on a chip, then progress will flatline.

    On the connection machine the memory 'is' the ALU. It had 65536 bit-slice CPUs. On current CPUs the memory is idle 'most of the time', even though it still requires power to sit idle.
    V@...
  • time to separate "performance" from "scalability"

    "Performance" is about going faster, and in a strict sense, more cores will always impose a performance penalty.

    "Scalability" is about being able to add more users and more tasks with minimal loss of performance as workload increases. Scalable systems always underperform against systems tuned for performance. But scalable systems will keep performing when high-performance systems hit the wall.

    So, which are we talking about here?

    And forget cores. Where does cloud computing start to figure in here, since clouds are all about scalability?
    diane wilson
    • Re: Scaleability

      I think in this case you're off the mark somewhat. :)

      This issue is about removing a problem that has existed since the late 1940s and 50s, a bottleneck on the memory throughput due to the way that the original model of computation was strictly sequential. This is limiting the speed of N-core chips due to the nature of the cpu-memory link.
      V@...
    • Great point, but the results of this test are even less worrisome

      Great point, but the results of this test are even
      less worrisome. That's because it's specifically
      talking about certain types of HPC applications that
      are memory bandwidth hungry, and not all HPC apps are
      that memory hungry. HPC does not represent the
      majority of server loads and it's the reason server
      CPU pricing is related to SPECint performance and not
      SPECfp performance. Furthermore, Intel has plenty of
      time to solve the problem and current memory bandwidth
      (especially Nehalem) is more than sufficient for the
      current CPU engines.

      This story has gotten way out of hand and it almost
      seems like a way to bash x86 and Intel architecture.
      The fact of the matter is, if x86/x64 can't keep up
      with HPC requirements, the HPC community can always go
      and use boutique processors if they think they can get
      a better deal.
      georgeou
      • Still Waiting for the CPU & GPU to Merge

        Then we'll know what direction General Purpose computing takes.

        Hoping AMD/ATi can survive the downturn to keep the market competitive. (ie affordable)
        V@...
  • RE: The many-core performance wall

    I've seen similar plots in simulations I used to do. The performance hits I saw were due to the time required to load the L1 caches (in the CPUs) whenever a cache miss occurred - the first access to the main memory is very slow and this can easily stall the processor. L2 caches help, but there's a limit to the help.
    bbodnar1@...
  • RE: The many-core performance wall

    If you think in terms of pipe size. You need the same or bigger as the smallest pipe to prevent resistance.
    Its obvious that the CPU is not the problem anymore. It has not been for some time. Even graphics has hit a brick wall too. Unless we come up with new technology we are at the end for what we have now.
    jscott418-22447200638980614791982928182376
  • Annual - "The end is near"

    Every year about this time someone comes along to tell us why Moores Law is at an end, of course right after thet the Manufactures tell us how they have solved it for another year. Yawn...
    No_Ax_to_Grind
    • Nearer than you think.

      Ya canny change the laws of physics.
      V@...
      • Well, your laws of physics...

        Let's step back for a moment and realize that we only know physics from a very basic standpoint, and as time goes by we become acutely aware of the flaws in our perception. Newtonian physics is also vastly different from high energy physics.

        The laws of conservation of mass come into play right up until you introduce fusion into the equation, then they go out the window. The voyager spacecraft left the oort cloud and slowed down for no apparent reason, as it's supposedly empty space, and its telemetry reported nothing at all.

        We also have new materials we didn't have 5 years ago that allow us to construct at progressively smaller levels, granted at a rudimentary level, but it's not impossible for us to produce sophisitcated micronized objects.

        To put it simply, we're far closer to the limits of manufacturing than we are to the limits of physics.
        Spiritusindomit@...
      • Yup, someone makes that claim every year.

        Yawn...
        No_Ax_to_Grind
      • Maybe not so near

        The "laws" of physics can and do change. The 20th century saw the greatest change of all so far.

        If you don't think so, then just recall these names: Einstein, Heisenberg, Hamilton, Feynan, Hawking, etc.
        ron.cleaver@...
        • Ugh?

          No it's the concepts that get revised. Physics will be the same no matter what happens.

          Assuming that electrical conduction remains the method of transmission.
          Q1. What is the smallest 'discernable' voltage that can be reliably used 'above noise floor'?

          Q2. How many atoms thick does a conductor need to be to handle the X Amps passing through it, without being vaporised?

          AFAIK the laws of physics still apply no matter what exotic material appears in the near future. As every element in the periodic table has practical limits. Cooling fans and water-cooling keep the current silicon and aluminium/copper within their physical 'useful' limits.

          IBM's 350GHz speeds at 4.5Kelvin are extremely unlikely for the rest of us.
          V@...
          • Quantum computing, photonic CPU's, etc...

            nt
            T1Oracle
          • Yeah, Yeah - Still Waiting for Serious Commitment

            I can recall Intel's fanfare about photonics ages ago. They sure don't seem in a hurry to get it into the marketplace. I believe it's only currently used for the endpoints of fibre-optic links.

            BTW, why are your posts so damned vague?
            V@...