Hot Chips embraces cool ARM

Now in its 26th year, Hot Chips has always been about the big, power-hungry chips that power the world’s fastest servers. But this year ARM crashed the party.
Written by John Morris, Contributor

Hot Chips is not the first place you would expect to hear about ARM. The annual conference, now in its 26th year, is as the name implies known for coverage of the power-hungry “big iron” behind the world’s fastest servers. ARM, by contrast, provides the technology for the tiny, low-power chips in nearly all phones and tablets.

So it was interesting to see the degree to which ARM infiltrated this year’s show, which took place earlier this week. Both keynotes were related to ARM — one from ARM CTO Mike Muller on the power constraints in everything from sensors to servers and the other from Qualcomm’s Rob Chandhok on the Internet of Things — and the two-day agenda included another five talks on ARM-based chips.

In part, this is simply a reflection of the growing importance of mobile computing. The rapid growth of smartphones and tablets has had a profound impact on all consumer and business technology. But it's also a sign that ARM technology is “moving up the stack” and beginning to challenge x86 and other platforms in new areas such as servers, and networking and communications gear.

Perhaps the biggest ARM news at Hot Chips was Nvidia’s presentation on an upcoming version of its Tegra K1 mobile processor with a custom core known as Denver. The current 32-bit Tegra K1, which is used in the Shield tablet, Xiaomi MiPad and Acer Chromebook 13, has four Cortex A15 CPU cores. The second version will have two Denver CPU cores based on the ARMv8 64-bit instruction set. Both use the same Kepler GPU with 192 CUDA cores.

Darrell Boggs, Nvidia’s Director of CPU Architecture and Principal Architect, said the Denver-based Tegra K1 will rival the performance of an entry-level Haswell Core processor (Celeron 2955U) and easily outperform the fastest mobile chips such as Qualcomm’s Snapdragon 801, the Apple A7 and Intel’s Bay Trail Atoms. (I wrote a separate post on Denver with more details.)

In a related talk, Nvidia discussed how Tegra K1’s “desktop-class” graphics and dual image processors bring new capabilities to mobile devices in areas such as gaming and video processing, as well a computer vision for advanced driver assistance systems in cars. Michael Ditty, Nvidia’s Tegra Architecture Manager, showed test results suggesting the Xiaomi Mi Pad with the 2-bit Tegra K1 was 1.2 to 1.6 times better than the “fastest competitor” on mobile benchmarks.

Competitor AMD has also embraced ARM, but not for mobile. At Hot Chips, AMD provided new details on the Opteron A1100, code-named Seattle, which is currently sampling and should be available in servers including AMD’s SeaMicro line, around the end of this year. Manufactured on GlobalFoundries’ 28nm process, Seattle has eight 64-bit Cortex A57 CPU cores; 4MB of L2 cache and 8MB of L3 cache; two memory channels for up to 128GB of DDR3 or DDR4 memory with error-correction; lots of integrated I/O (8 lanes each of PCIe Gen3 and 6Gbps SATA and two 10Gbps Ethernet ports), a Cortex A5 “system control processor” for secure boot; and an accelerator for speeding up encryption and decryption.

AMD engineer Sean White said the company would offer semi-custom versions of Seattle optimized for specific customer applications — something others like Intel and IBM have been talking about as well. Unfortunately AMD did not provide any details on the Opteron A1100’s frequency, power or performance because, White said, “we’re still a short while away from the actual product launch.”

AppliedMicro claims to have beaten AMD and pretty much everyone else to the punch with the first 64-bit ARM server chip, but for now it’s somewhat academic since X-Gene has yet to notch any major, public design wins (HP kicked the tires but its ProLiant Moonshot servers still rely on AMD Opterons or Intel Atoms).

At Hot Chips, AppliedMicro for the first time provided a detailed roadmap for how it plans to boost the performance and capabilities of X-Gene. The short answer: lots more cores and a faster interconnect. Guarav Singh, AppliedMicro’s Vice President of Technical Strategy, said there are many scale-out datacenter applications better-suited for systems with thousands of efficient connected over a high-bandwidth, low-latency network including web services hosting, web searches, NoSQL databases and analytics and media serving.

The X-Gene 1 (Storm), which is currently in production, is manufactured by TSMC on a 40nm process and has eight 2.4GHz ARMv8 cores, four DDR3 memory controllers, PCIe Gen3 and 6Gbps SATA, and 10Gbps Ethernet. The X-Gene 2 (Shadowcat), which is manufactured on a more advanced 28nm process and has an enhanced core design, is currently sampling. It will be available with 8 or 16 cores, running at speeds of 2.4 to 2.8GHz. But the big change here is the addition of a RoCE Host Channel Adapter. RoCE, or RDMA over Converged Ethernet, delivers the lower latency of Infiniband’s RDMA protocol over standard Ethernet hardware and software. This is a key building block for clusters of hundreds or thousands of microservers.

AppliedMicro is targeting X-Gene 2 to fill a gap between competing microservers (Cavium ThunderX, Intel Atom C2000 “Avoton” and AMD Opteron A1100 “Seattle”), which it believes will deliver lower performance but cost less, and the Xeon E5-2600 v2, which delivers higher performance but at a higher price. (Of course, Intel also has its own Xeon E3 family to fill this gap and AMD has an ambitious roadmap for its ARM servers.) In comparison to X-Gene 1, an 8-core X-Gene 2 will deliver about 60 percent better integer performance (SPECint 2006), twice the performance on Memcached, and about 25 percent better performance on Apache Web serving, according to the company.

A single server rack (42U) X-Gene 2 servers will have up to 6,480 threads and 50TB of memory (with a peak bandwidth of 48TBps). Because of the RoCE interconnect, Singh said that the servers will be able to share a single pool of storage that appears like local storage.

The third-generation X-Gene will be the first to use 16nm 3D transistors, known as FinFETs (currently only Intel is using this type of transistor, but competing foundries plan to shift to FinFETs starting sometime next year). The X-Gene 3 (Skylark) will have 16-64 third-generation ARMv8 CPU cores running at up to 3GHz, a second-generation RoCE interconnect and a new rack interconnect. It will be sampling in 2015.

Finally, ARM and Avago gave a joint presentation on the advantages of ARM’s Cortex CPU cores and CoreLink interconnects for networking. Avago’s Axxia 5500, a line of 28nm communications processors with 16 Cortex A15 CPU cores connected with the CoreLink CCN-504, is designed for cellular base stations and other networking and communications gear. The ARM-based Axxia line was developed by LSI, which Avago acquired in May for $6.6 billion. While it’s interesting that ARM is moving into new areas, the story became even more interesting later in the week when Intel announced it plans to buy the Axxia business for $650 million.

Editorial standards