AMD's 'Shanghai' processors are the company's first chips to exploit the improved performance and efficiency of 45nm technology. ZDNet's tests show that they have made up important ground on Intel's Xeons.
AMD's 45nm chips have arrived almost exactly one year after the first Intel processors to use the same feature size — currently the most advanced process used in mainstream processor production. Codenamed 'Shanghai', the new AMD processors are arriving first in quad-core Opterons for two-, four- and eight-processor server platforms, enabling up to 32 cores per server. Phenom variants for desktops, codenamed Deneb, are due in the first quarter of 2009.
For the most part, AMD is sticking with its previous 65nm processor design, Barcelona. But by investing in miniaturisation, AMD has created the space on the chip to increase the L3 cache, which was Barcelona's weak spot and which promises good performance return on design investment.
All Shanghai models offer up to 6MB of Level 3 (L3) cache, a three-fold increase on Barcelona's 2MB. The per-core caches — 512KB of L2 cache and 64KB of L1 data and instruction cache — remain unchanged.
On top of that, AMD now supports DDR2 RAM at up to 800MHz, an improvement on Barcelona CPUs' maximum of 667MHz. DDR3 RAM is not supported by the first Shanghai models. Furthermore they are not yet equipped with the HyperTransport 3.0 communications bus, which can run at up to 17GB/s. Early adopters will have to be content with 8GB/s. HyperTransport 3.0 and DDR3 support will be available in the second quarter of 2009.
The instruction set inherited from Barcelona remains unchanged, and Intel's SSE2 and SSE3 instructions are supported, as well as the standard x86 instructions. But although AMD's SSE4a corresponds to the functionality of the Insert and Extract instructions in Intel's SSE4.1, it is incompatible with them.
Although there are big price differences between Shanghai CPUs for 2-processor and 4- or 8-processor servers, technically there are almost none.
Locking down the instruction set for virtualisation compatibility
With Shanghai, it's now possible to lock down the instruction set to a subset of the full complement. This feature is important in the live migration of virtual machines (VMs), using, for example, VMware's VMotion. If a VM is moved from one piece of hardware to another in a live environment, it could crash if certain instructions are suddenly no longer available.
If the instruction set is limited to SSE2 when starting a VM, then the VM can be moved to any server that uses at least a Pentium 4. Since Intel's newest processors — for example, the 6-core 'Dunnington' Xeon — also offer instruction set lock-down, it should now be possible to move VMs between live AMD and Intel systems — something that would have unthinkable until recently.
Of course, that instruction set lock-down means accepting the lowest common denominator. But in practice less than five percent of all standard software uses instructions outside SSE2, so a performance hit is unlikely.
The new Shanghai Opterons all consume 75W of Average CPU Power (ACP) — which corresponds to about 95W of Thermal Design Power (TDP) — and are available with clock speeds ranging from 2.3GHz to 2.7GHz. AMD will be releasing 55W models and a 105W chip running at 2.8GHz in the first quarter of 2009.
Prices for the CPUs for two-socket motherboards start at US$377 for 2.3GHz, rising to US$989 for 2.7GHz. The almost identical CPUs for four- and eight-socket boards are significantly more expensive: US$1,165 for the 2.4GHz model and US$2,149 for 2.7GHz.
Expanded L3 cache
In Shanghai, AMD has stuck with the three-level cache architecture established with Barcelona. But while Barcelona's small L3 caches were unimpressive, Shanghai's 6MB shared L3 cache is a big step in the right direction: a cache miss in L2 cache is now much more likely to be compensated for by a hit in L3, avoiding a slow and expensive trip to main memory.
Shanghai CPUs have about the same amount of cache as Intel's first Nehalem (Core i7) CPUs. But while AMD allocates 512KB of exclusive L2 cache per core, Intel's Nehalem chips have only 256KB per core. On the other hand, Nehalem features an 8MB L3 cache compared to Shanghai's 6MB.
Intel's Xeon 5400-series server CPUs have 12MB of cache, and these processors can be viewed as direct competitors to the Shanghai two-processor models. Intel's four-processor models in the Xeon 7400 series have caches of between 14MB and 25MB, but they must access main system memory through a single external quad-channel DDR2 controller. The Nehalem architecture, which has an internal memory controller analogous to Shanghai's HyperTransport, is not yet available for servers.
The lack of support for DDR3 RAM is particularly significant for main memory. If one sets aside marketing statements and examines throughput as expressed by bits per cycle x clock frequency x the number of memory channels, it's clear the crucial factor is the number of memory channels. This scales with AMD's processors, unlike Intel's, as the number of CPUs increases.
A two-processor system with eight cores can use DDR2-800 modules at an effective memory speed of 3.2GHz. That's identical to the performance offered by the quad-channel FB-DIMM controller that Intel uses with its Xeon 5000 processors. But with current Intel server platforms each bit must run through the frontside bus to the northbridge, which also handles the PCI-Express bus.
There's no question that AMD delivers more throughput, especially if the hypervisor, operating system and applications all support a genuine NUMA (Non-Uniform Memory Access) architecture. Intel does not yet offer a server platform with an integrated memory controller, and buyers will have to wait until next year for a two-processor system. Four-processor systems will only become available towards the end of 2009.