One of the highlights of the first day on this year’s Hot Chips, an annual conference on chip technology, was Nvidia unveiling of Denver, a custom CPU used in an upcoming version of Tegra K1 processor.
There are. The 32-bit version has a four Cortex-A15 CPU cores running at up to 2.3GHz, 32KB of L1 instruction and data caches and 2MB of L2 cache. This chip is available now and is used in a handful of devices including the Shield tablet, the Xiaomi MiPad sold in China and the . The second has two Denver custom CPU cores based on the ARMv8 64-bit instruction set running at up to 2.5GHz, 128K of L1 instruction cache 64K L1 data cache, and 2MB of L2 cache. Both use the same Kepler GPU with 192 CUDA cores. The two chips are pin compatible, which should make it easy to design devices that work with either or both.
The idea behind Denver, according to Darrell Boggs, Nvidia’s Director of CPU Architecture and Principal Architect, is to deliver PC-class performance in mobile devices compatible with the massive ARM hardware and software ecosystem. To deliver this level of performance, Denver has a 7-wide superscalar architecture, meaning it can execute seven instructions per clock cycle compared with 3 instructions per clock with the A15, and it uses “aggressive” hardware prefetching, a commonly-used technique to place data closer to the CPU before it is needed to speed things up.
The more surprising wrinkle is that Nvidia is using what it calls dynamic code optimization to further boost performance. The processor takes frequently reused ARM code, converts it into optimized micro-code and holds it in a dedicated 128MB of cache carved out of the main system memory to deliver the performance of out-of-order execution without the power penalty associated with it. Nvidia claims it can provide double the performance of the base-level hardware by converting ARM code to optimized code. Others have tried binary translation before and failed--most notably Transmeta, which attempted to challenge Intel with its Crusoe mobile chip—but Boggs said that “many of these things we’ve fixed in our implementation.”
One of the other surprises is that this version of the K1, unlike other Tegra processors, does not have a low-power “companion core” to save power on light or background workloads. Instead Nvidia has added a new power state that retains the state but reduces power below the minimum operating voltage. The system can enter and exit this CC4 state very quickly, as with clock gating, but it approximates the energy savings of power gating, which is slower.
Nvidia showed the first public benchmark results for The Denver-based Tegra K1 versus current high-end mobile processors including Qualcomm’s Snapdragon (MSM8974), Apple’s A7 Cyclone in the iPhone 5s, and an Atom Bay Trail SOC (the Celeron N2910), as well as the Haswell Core-based Celeron 2955U used in many Chromebooks. The performance was normalized to the current A15-based Tegra K1. The Denver processor “significantly outperformed” the ARM-based and Bay Trail processor on all of the benchmarks, and it delivered a similar level of performance to the 1.4GHz dual-core Haswell processor—on some Denver is faster and on others the Core processor has the edge.
The Denver version of Tegra K1 is scheduled to ship later this year. At one time Nvidia was hoping to extend Denver all the way up into servers and supercomputers, but in June the company said it was sticking with mobile for now.