The many-core performance wall
Summary: Many-core chips are the great hope for more performance but Sandia National Lab simulations show they are about to hit a memory wall. How bad is it?
Many-core chips are the great hope for more performance but Sandia National Lab simulations show they are about to hit a memory wall. How bad is it?
Some background Memory bandwidth is the limiting performance factor in CPUs. If you can't feed the beast it stops working, simple as that.
John von Neumann - your PC/Mac is a von Neumann architecture machine - made the point in his very technical First Draft of a Report on the EDVAC (pdf)
. . . the main bottleneck, of an automatic very high speed computing device lies: At the memory.
Here's the ugly results of Sandia's simulations:
[graph courtesy Sandia National Lab]
Performance roughly doubles from 2 cores to 4 (yay!), near flat to 8 (boo!) and then falls (hiss!).
Did Pink Floyd forecast this? Chip packages support just so many pins and so much bandwidth. Transistors per chip double every couple of years - but the number of pins don't.
Professors William Wulf and Sally McKee named it "the memory wall" in their 1994 paper Hitting the Memory Wall: Implications of the Obvious (pdf) saying:
We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed - each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.
According to an article in IEEE spectrum that time is almost upon us. With cores per processor doubling every 2-3 years - and graphics chips moving faster - we don't have long to wait.
The memory wall's impact is greatest on so-called informatics applications, where massive amounts of data must be processed. Like sifting through petabytes of remote sensing data to find bad guys with nukes.
Can this be fixed? Sandia is investigating stacked memory architectures, popular in cell phones for space reasons, to get more memory bandwidth. But as the simulation shows, that doesn't improve performance.
RAMbus is working on a Terabyte Bandwidth Initiative that may help. Their goal: 64 16GB/sec DRAMs with differential data channels to feed a system-on-a-chip memory controller.
Intel needs to pick up the pace. Nehalem processors are the first with an on-chip memory controller and the new Quick Path Interconnect. But server-class Nehalems are now limited to 2 QPI links for a total theoretical bandwidth of only 50 GB/sec. Faster, pussycat, faster!
The Storage Bits take Many-core is the future for computer performance. Memory bandwidth is one big problem. Software support for efficient many-core use is another. Either could bring the performance expected from Moore's Law to dead stop.
The industry is making big investments in both problems. If it is a problem for Sandia today it will be a problem for consumers in 10 years.
What if one doesn't get solved? Then the Moore's Law rocket we're been riding will sputter and die. Life on the glidepath won't be nearly so much fun.
Comments welcome, of course. If anyone wants to make the case that von Neumann was wrong, I'm all ears.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Ultimately, software will need to change
contention over shared resources.
For computational problems that can be solved using a
divide-and-conquer strategy, there is a solution: give
each processor core it's own private bank of memory
(whether it is on the chip or off is not essential to
the solution). If you remember the Transputer in the
late 1980's, it used this architecture.
The problem of course is that you have to be able to
divide-and-conquer and not all problems can be solved
this way. Even worse, it requires a different way of
thinking and a different way of writing software.
Long live OCCAM. :-)
Parallel computing
RE: The many-core performance wall
I have always be fascinated by the fact that company like AMD and especially Intel were so focused on CPU(more Mhz than now more cores) theorical performances at the expense of the overall pratical performance which could be brought by the significant improvement of memory architecture and overall performances.
The exponential increase of CPU power is mainly benefical to applications which requires a lot of processing power and not so much of efficient and fast memory.
The funny thing is that this kind of job could be significantly better done by specific CPU or processing units rather than the generic CPU used in today PC, as proven by the cell based coprocessor used in the Toshiba G50 which is several more powerful for video editing than the main CPU of this laptop despites a significantly lower frequency.
only innovation will save us
significantly better done by specific CPU or
processing units rather than the generic CPU used in
today PC, as proven by the cell based coprocessor used
in the Toshiba G50"
Interesting to note that this solution, the Cell and
it's brethren, use the first generation of the RAMbus
technology noted in the original article. According
to Sony/Toshiba/IBM 96% of the Cells interconnects use
RAMbus technology. It's too bad so many folks are
focused on driving this company of clever engineers
(400 engineers, 10 lawyers, not the other way around)
out of business because they are so ill informed as to
who is doing the innovating and who is just trying to
free-ride.
We've got a Solution. AMD had nailed it a long time ago 64bit
RE: The many-core performance wall
On the connection machine the memory 'is' the ALU. It had 65536 bit-slice CPUs. On current CPUs the memory is idle 'most of the time', even though it still requires power to sit idle.
time to separate "performance" from "scalability"
"Scalability" is about being able to add more users and more tasks with minimal loss of performance as workload increases. Scalable systems always underperform against systems tuned for performance. But scalable systems will keep performing when high-performance systems hit the wall.
So, which are we talking about here?
And forget cores. Where does cloud computing start to figure in here, since clouds are all about scalability?
Re: Scaleability
This issue is about removing a problem that has existed since the late 1940s and 50s, a bottleneck on the memory throughput due to the way that the original model of computation was strictly sequential. This is limiting the speed of N-core chips due to the nature of the cpu-memory link.
Great point, but the results of this test are even less worrisome
less worrisome. That's because it's specifically
talking about certain types of HPC applications that
are memory bandwidth hungry, and not all HPC apps are
that memory hungry. HPC does not represent the
majority of server loads and it's the reason server
CPU pricing is related to SPECint performance and not
SPECfp performance. Furthermore, Intel has plenty of
time to solve the problem and current memory bandwidth
(especially Nehalem) is more than sufficient for the
current CPU engines.
This story has gotten way out of hand and it almost
seems like a way to bash x86 and Intel architecture.
The fact of the matter is, if x86/x64 can't keep up
with HPC requirements, the HPC community can always go
and use boutique processors if they think they can get
a better deal.
Still Waiting for the CPU & GPU to Merge
Hoping AMD/ATi can survive the downturn to keep the market competitive. (ie affordable)
RE: The many-core performance wall
RE: The many-core performance wall
Its obvious that the CPU is not the problem anymore. It has not been for some time. Even graphics has hit a brick wall too. Unless we come up with new technology we are at the end for what we have now.
Annual - "The end is near"
Nearer than you think.
Well, your laws of physics...
The laws of conservation of mass come into play right up until you introduce fusion into the equation, then they go out the window. The voyager spacecraft left the oort cloud and slowed down for no apparent reason, as it's supposedly empty space, and its telemetry reported nothing at all.
We also have new materials we didn't have 5 years ago that allow us to construct at progressively smaller levels, granted at a rudimentary level, but it's not impossible for us to produce sophisitcated micronized objects.
To put it simply, we're far closer to the limits of manufacturing than we are to the limits of physics.
Yup, someone makes that claim every year.
Maybe not so near
If you don't think so, then just recall these names: Einstein, Heisenberg, Hamilton, Feynan, Hawking, etc.
Ugh?
Assuming that electrical conduction remains the method of transmission.
Q1. What is the smallest 'discernable' voltage that can be reliably used 'above noise floor'?
Q2. How many atoms thick does a conductor need to be to handle the X Amps passing through it, without being vaporised?
AFAIK the laws of physics still apply no matter what exotic material appears in the near future. As every element in the periodic table has practical limits. Cooling fans and water-cooling keep the current silicon and aluminium/copper within their physical 'useful' limits.
IBM's 350GHz speeds at 4.5Kelvin are extremely unlikely for the rest of us.
Quantum computing, photonic CPU's, etc...
Yeah, Yeah - Still Waiting for Serious Commitment
BTW, why are your posts so damned vague?