AMD: Is closing the quad-core deficit enough?

AMD kicked off the "premier" of their latest microprocessor offering with a launch party at the Herbst International Exhibit Hall Monday night.  Partners like VMware, HP, Dell, Sun, Oracle, IBM, Microsoft and others were on hand or were there by video link to celebrate the launch of AMD's single-die quad-core milestone processor.
Written by George Ou, Contributor

AMD kicked off the "premier" of their latest microprocessor offering with a launch party at the Herbst International Exhibit Hall Monday night.  Partners like VMware, HP, Dell, Sun, Oracle, IBM, Microsoft and others were on hand or were there by video link to celebrate the launch of AMD's single-die quad-core milestone processor.  Barcelona is critical for AMD since it was getting battered by Intel's ten-month head start in quad-core processors which used a cheaper-to-manufacture dual-die process.

AMD argues that its single-die "native quad-core" process with its lower latency is architecturally superior to Intel's dual-die process.  But the challenges of manufacturing a massive 283mm squared die with high yields and high clock speeds is daunting and the fact that Barcelona is 6 months late and 600 MHz short makes this fact painfully clear.  AMD’s executive VP Mario Rivas admitted back in March that he wished AMD had “immediately done a MCM - two dual cores and call it a quad-core” if he could do it all over again.  Intel takes the easy manufacturing route of combining two 143mm squared dies which allows Intel to mix and match the best combinations.  Intel's soon to launch 45nm chip takes the die size down to an even more manageable 107mm squared.

Earlier this year, AMD had told several news organizations such as ZDNet and TGDaily that Barcelona will outperform Intel's Clovertown 2.66 GHz quad-core processor by margins of 20 and 50 percent on both integer and floating point.  It was initially implied that AMD was comparing a 2.6 GHz Barcelona processor, but it wasn't confirmed until AMD distributed 2.6 GHz benchmarks in July with similar performance claims.  The actual launch speed for Barcelona was 2.0 GHz and it fell short of AMD's original claims by a significant margin.  The actual benchmarks which were leaked to me last Friday which are now confirmed by published SPEC.org results indicate a 24% deficit on SPECint_rate2006 and a 15.4% lead on SPECfp_rate2006 over Intel's best two-socket processors which is a far cry from the 20 and 50 lead claimed by AMD earlier in the year.

<Next page - SPEC CPU 2006: Intel Clovertown/Tigerton and AMD Barcelona>

SPEC CPU 2006: Intel Clovertown, Tigerton and AMD Barcelona

The charts below were compiled from official SPEC.org published results as of September 12 2007 with the lone exception of the IBM results.  They are virtually identical to the Barcelona results I extrapolated from the leaked slides sent to me last Friday.  AMD has implied that their 2350 score may now be a few tenth of a point higher on SPECint_rate2006 but I don't know exactly how much and it's such a small delta that I'll wait until the official results get published.


<Next page - Pros and cons of AMD and Intel architecture>

Pros and cons of AMD and Intel architecture

While it's true that AMD's single-die process and superior memory subsystem allows AMD to scale at a near perfect trajectory as it increases clock speed and socket count, AMD sacrificed raw execution speed which puts it at a lower starting point compared to Intel (see note below).  This is primarily due to the 4-issue execution engine in Intel’s Core Micro-architecture versus AMD’s 3-issue execution engine.  Intel on the other hand sacrificed the integrated memory controller but implemented a faster execution engine and a massive 8 MB level 2 cache that mitigates the effects of slower memory.

Note: From IBM’s latest data, Intel still holds more than a 31.2% advantage on SPECint_2006 performance and more than an 18.2% advantage on SPECfp_2006 at 1.9 GHz on single threaded performance when comparing Intel Clovertown to AMD Barcelona.  This is calculated by looking at the Barcelona 1.9 scores of 11.3 and 11.2 on SPECint_2006 and SPECfp_2006 versus an Intel E5335 2.0 at 15.6 and 14 adjusted down by a ratio of 1.9/2 which is conservative performance for Intel at a theoretical 1.9 GHz.

Single threaded performance plays an essential role on current applications that aren't multithreaded well but they still play a role in certain tasks that fundamentally don't thread well.  On servers, single threaded performance allows a busy thread to borrow memory bandwidth from idle threads whenever the system isn't fully throttled.

Simply put, Intel starts fast but scales slower while AMD starts slow and scales faster as the core count and clock speed goes up.  This is why a Barcelona 2.0 GHz processor loses to an Intel Clovertown 2.0 or Tigerton 1.86 GHz processor on SPECint_rate2006 but once you get to ~2.5 GHz, the clock-for-clock performance advantage on SPECint_rate2006 swings over to AMD.   So in order to overcome Intel at 3 GHz on general purpose benchmarks like SPECint_rate2006, AMD needs to get to the high 2s on GHz if they want to beat Intel's Clovertown 3 GHz processor.  Of course Intel isn't going to sit idly by and watch their lead evaporate.

[Update 9/30/2007 - Fixed and clarified numbers using 9/30/2007 SPEC numbers - To quantify the scaling of AMD and Intel CPUs, Intel starts with a huge 31.2% SPEC CPU 2006 integer advantage at 1.9 GHz over Barcelona and even more against Opteron K8 when we look at single threaded performance.  However, Intel can only scale 1 to 8 cores at 64.7% efficiency at 2 GHz and 53.6% efficiency at 3 GHz.  AMD Opteron K8 and Barcelona scale 1 to 8 cores at 87% at 2 GHz to 87.4% efficiency at 3 GHz.  In a two-socket 8-core platform, Intel Clovertown scales the clock from 2 to 2.33 GHz at 58.7% efficiency and drops down 52% efficiency by the time you scale 2 GHz to 3 GHz.  AMD Opteron K8 dual-core on the other hand for 8-cores scales the clock from 2 to 3 GHz at 77.3% efficiency.

Barcelona however seems to be scaling poorly for SPEC CPU 2006 floating point from 2 GHz to 2.5 GHz with an efficiency of 46.6%.  That seems to be due to the fact that AMD can't do a fractional multiplier for the memory clock like they can do for the CPU so they're forced to run the memory at 312.5 MHz instead of 333 MHz.  If AMD can switch to DDR2-800, then the scaling will probably be a lot better but most of the Barcelona class servers announced only go up to DDR2-667 except for Sun which supports DDR2-800.]

My definition of scaling efficiency: If a processor make a 50% increase in clock speed but only realizes a 40% increase in performance, I call that 80% clock-scaling-efficiency.  If a processor or a computer goes from 1 socket to 4 sockets but it only sees a 3 fold increase in speed, I call that 75% core-scaling-efficiency.

I really doubt I'm the first to think of this method of expressing scaling efficiency, but I haven't seen anyone explain it this way so I'm calling it my definition for now until I know otherwise.

Scaling inefficiencies can be more brutal towards Intel on applications that require even more memory bandwidth or it can be more generous on application benchmarks like SPECjbb2005 where Intel's lead is even larger than SPECint_rate2006.  SPECfp_rate2006 for example is a floating point benchmark that represents HPC (High Performance Computing) scientific and engineering workloads and it requires massive amounts of memory bandwidth.  The memory requirements of HPC applications play so well to AMD's architecture that even a 2 GHz Barcelona can destroy a 3 GHz Clovertown on SPECfp_rate2006 by a factor of 15.4%.

Unfortunately for AMD, benchmarks that are important to the IT world like SPECjbb2005, SPECweb2005, TPC-C, and SAP were conspicuously missing at the Barcelona launch.  Intel by contrast featured a plethora of record-breaking benchmarks at last week's Tigerton launch featuring all of the above metrics.

Both Intel and AMD are well aware of their own architectural shortcomings despite the fact that neither camp is eager to advertise it from a marketing standpoint but their roadmaps tell another story.  Intel will move to a memory architecture called CSI (Quick Path) in late 2008 that is similar to AMD's memory architecture and AMD announced at their July analyst meeting that their next generation platform called "Bulldozer" coming in late 2009 will feature an improved execution engine that addresses single threaded performance.

<Next page - AMD versus Intel on price, performance, and power efficiency>

AMD versus Intel on price, performance, and power efficiency

There was some good news for AMD at last night's Barcelona launch as they upgraded their projections for year-end Barcelona parts from 2.3 to 2.5 GHz.  The 2.5 GHz quad-core processor will make AMD a lot more competitive in the four-socket server segment and it will help in the mainstream two-socket segment.  2.5 GHz while it may not deliver the performance crown will be a huge improvement over AMD's current situation and it indicates a faster ramp up than previously expected if AMD can deliver on 2.5 GHz this year.

AMD's 2 GHz Barcelona is already priced very competitively and there is little question it will sell well, the problem is that Intel will very likely batter AMD's average selling price in the 2 GHz value market segment.  For example, the 2 GHz Barcelona Opteron 8350 is priced below Intel's 2.13 GHz half-cache Tigerton CPU yet it offers better SPECint_rate2006 performance.  While that's a good deal for the customer, it probably isn't such a great deal for AMD's margins.  Once Barcelona gets to 2.5 GHz, it should be able to sell those chips for double the price yet remain price competitive against Intel, and improve its margins.

Intel's ten month lead on price, performance, and performance/watt (when comparing Intel's quad-core to AMD's dual-core Opteron servers) battered AMD but Barcelona closes the quad-core deficit.  While the current Barcelona 2.0 GHz launch part won't reclaim general purpose performance crowns like SPECint_rate2006, it will allow AMD to reclaim a performance/watt leadership on many workloads and this is primarily due to Intel's FBDIMM (Fully Buffered memory) power-consumption liability.  Taking the performance/watt lead would not have been possible so long as Intel retained an exclusive quad-core advantage.

Scott Wasson of TechReport ran a series of detailed tests that showed a dual 95W TDP (Thermal Design Power) AMD Opteron 2350 8-core server beating a dual 50W TDP Intel L5335 server.  When you're expecting to see a ~80 watt advantage (processors don't actually hit TDP in real world applications and that's why you shouldn't expect 90W delta) for Intel based on the CPU differences, the memory controller probably factors in an extra 25W and the eight FBDIMMs probably cost the Intel server an extra 60W (AnandTech measured 60W difference on 8 FBDIMMs).  The difference in the memory subsystem explains how an 80W advantage on the CPUs can turn in to a 5W deficit for Intel.  If we use more reasonably priced E5335 80W parts, we're probably looking at a ~55W deficit for the Intel server.

Note on SPECjbb2005: While Scott Wasson does a lot of good work, his unofficial SPECjbb2005 results in the same article should be disregarded.  Wasson's results for the Intel X5365 is off by 61% from the record breaking published SPECjbb2005 results.  SPECjbb2005 isn't memory bandwidth heavy so it favors Intel's architecture.  In Wasson's defense, AMD sent him and other reviewers these parts on the Friday before launch and he probably didn't sleep much over the weekend getting that massive review ready for Monday morning so I don't want to be too hard on him for this.  So far, there are no published SPECjbb2005 scores for any AMD Barcelona processors yet and it wouldn't take the crown even if it doubled the score of an Opteron 2-socket 2222SE server.

AMD's new-found edge on performance/watt may be short-lived in the two-socket space because of Intel's jump to 45nm Penryn this November.  According to leaked OEM roadmaps, Intel will be able to launch a mainstream part at 3 GHz within the 80 watt TDP envelop.  Penryn coupled with FSB1600 and 50% more 24-way associative cache along with other improvements will probably take Intel in to uncharted territories on performance and performance/watt even with the FBDIMM power liability.  In the low-power segment, Intel's new "San Clemente" DDR2 chipset (which Intel will not comment on) coupled with a 2.66 GHz 50 watt low-voltage Penryns will undoubtedly be interesting given DDR2's known power advantages over the FBDIMM architecture that Intel currently relies on.  AMD will have a little less pressure in the four-socket market if they can quickly ramp up to 2.5 GHz because Intel's jump to 45nm "Dunnington" which is the successor to Tigerton probably won't arrive until the second half 2008.

<Return to top>

Editorial standards