Intel's new Nehalem architecture features an integrated memory controller and runs two threads per CPU core. Our extensive benchmark tests reveal how well the new quad-core processors perform in practice.
Five years after AMD, Intel has produced its first CPU with an integrated memory controller. The AMD design was ahead of the game in a number of areas, and market leader Intel has integrated ideas from its competitor into the new Nehalem architecture. Until now, Intel has manufactured its quad-core processors from two dual-core dies. AMD always maintained that there was only one company that could build real quad cores — a distinction that Intel pooh-poohed. Now even that distinction has been lost: Nehalem (Core i7) CPUs consist of a single chip.
But that's not the end of the story. AMD processors communicate between themselves and with peripherals using AMD's Hypertransport, a point-to-point switched interconnect that maintains high bandwidth through ad-hoc independent channels. That technology contrasts with Intel's approach of having chips use the frontside bus to address not only memory but also to connect to other system components, sharing that channel between devices. That's no real disadvantage with single-core systems, and Intel has maintained performance in dual-core and quad-core systems by using large amounts of cache.
However, this old-fashioned way of communicating is a bottleneck for servers with multiple sockets. In the long term, even the 64MB on-chip cache with snoop filtering that Intel offers in its Xeon 7300 chipset or the 16MB Level 3 cache recently introduced into the six-core Dunnington could not help the chip giant remain competitive with AMD in the server field.
Intel's answer is to provide the Nehalem architecture with a technology called Quick Path Interconnect (QPI) that is comparable with Hypertransport. QPI is in the Nehalem desktop variants, codenamed Bloomfield, that are available later this month. The server variant, Gainestown, for two-socket systems is to follow in the first quarter of 2009, according to Intel boss Paul Otellini. Intel plans on introducing Nehalem chips for multi-processor systems in the second half of 2009, and QPI will also be part of Tukwila, the next generation Itanium processor, due at the end of this year.
Intel has also cribbed a few virtualisation ideas from AMD for the Nehalem architecture. With the introduction of the Barcelona processor, AMD offered Rapid Virtualisation Indexing (RVI) to allow virtual machines direct memory access. Virtualisation specialist VMware enthusiastically backed the AMD technology. The equivalent technology in Intel's Nehalem is called Extended Page Table (EPT).
On top of the ideas borrowed from AMD, Nehalem chips offer a number of additional features. For example, the four processor cores can work on two threads at the same time, a refinement of the P4's well-known Hyperthreading architecture. As well as the four physical arithmetic and logic units, a further four logic units are also available.
Unlike the AMD equivalent chips, which only support dual-channel DDR2/1066 memory, the Core i7 processors, officially available from 17 November, offer three DDR3/1066 channels. Thus the chips have a theoretical memory bandwidth of 25.5GB/s, compared with the AMD chips' maximum of 16GB/s. Individual Nehalem processors are differentiated by the speed of the QPI interface. On the top model — the Core i7 Extreme 965 — QPI runs at 3.2GHz, but only reaches 2.4GHz on the smaller models.
According to Intel, the new Nehalem processors are specified up to a memory speed of DDR3/1066, while the current Core 2 architecture can be operated with DDR3/1600 memory. But according to the benchmark tool Everest 4.60, the internal memory controller supports up to 1333MHz. It could be that the system would not work stably in all situations at that frequency, so Intel opted for the more conservative specification. For optimal performance no more than three memory modules should be used. If four DIMMs are used, memory performance falls because the important memory parameter Command Rate can only handle two wait states.
Nehalem processors offer a built-in overclocking feature called Turbo Mode. If a piece of software fails to make full demands on all the cores, the chip's internal logic ensures that calculations in the cores that are in use operate at a higher clock speed. Last but not least, the Nehalem processors come equipped with SSE4.2, a command set extension that might be particularly useful for accelerating processing of string variables in search engines. Programs such as browsers, email clients and text processing programs could also benefit from the faster processing offered by SSE4.2.
In terms of power consumption, the system with the Nehalem Core i7 965 Extreme processor core ranks about the same as Intel's previous best-performing chip, the Core 2 Extreme QX9775, although the Nehalem processor, with 731 million transistors, clearly has fewer electronic circuits than the QX9775 with 820 million. Because hyperthreading technology makes more intensive use of the arithmetic units than with the single threading cores, they take the same power overall as the more complex earlier designs despite having fewer transistors.
Power consumption (Watts): shorter bars are better.
Everest 4.60: memory performance
The memory tests show how quickly the processors communicate with their environment. Besides the pure bandwidth, what's interesting here are the access times. The fewer clock cycles it takes to access a memory cell (a measure known as latency), the faster the cell can be read. With large database applications a low latency can have a positive impact on overall performance.
Whereas AMD processors, with their integrated memory controller, could match Intel chips of the Core-2 era for memory access, and even offer advantages, things have changed with the arrival of the Nehalem architecture. These new chips, with their outstanding memory transfer performance and memory access, are clearly the top performers.
Memory performance (GB/s): longer bars are better.
Memory latency (nanoseconds): shorter bars are better.
Everest 4.60: CPU & FPU performance
In the synthetic Everest benchmark tests, Intel's new Nehalem architecture emerges impressively as the top performer. In some tests the 2.66GHz Core i7 920, thanks to its hyperthreading technology, even beats the 3.2GHz Core 2 Extreme QX9775. Nehalem's lead is particularly apparent in the floating-point benchmark SinJulia, which makes full use of hyperthreading.
CPU performance: longer bars are better.
Floating-point performance: longer bars are better.
VMware Workstation 6.5: performance in virtualised environments
Virtual desktops are becoming increasingly common in enterprises. Consequently, tests with VMware Workstation 6.5 and the application-based Winstone benchmarks are useful in providing an insight into the efficiency of virtualised IT environments. Even though the Winstone test is somewhat long in the tooth, it's still relevant because what's being tested here is the efficiency of the processors involved in VMware virtualisation rather than application performance.
In the test, two virtual machines (VMs) running Windows XP were tested using Content Creation Winstone (CCWS). In each case, the VMs have two CPU cores at their disposal. A test using Cinebench R10 was also conducted in the virtualised environment. Both Intel's EPT and AMD's RVI direct memory access technologies are supported. However, neither the new Nehalem processors nor the AMD Phenom work faster in this mode of operation. According to these tests, the fastest chip for virtualisation is the Core 2 Extreme QX9775, which only supports Intel VT.
It's possible that VMware Workstation is not optimised for processors that offer direct memory access for VMs. On the other hand, it's also possible that the tests we conducted do not make the most of this technology. Further testing will be required to clarify the use of direct memory access.
VMware/Cinebench tests: longer bars are better.
VMware/Content Creation Winstone tests: longer bars are better.
Image editing: Paint.Net, Autopano pro, Jalbum
Image editing programs use advanced parallelism to capitalise on the power of multi-core processors. We used three programs to test Core i7 performance in this area: the freeware tool Paint .NET is an efficient image editor based on the Microsoft's .NET interface; and its accompanying benchmark, pdnbench, puts a full workload on the processors during typical image operations. Additional tests are provided by Autopano Pro, which produces panoramic images, and Jalbum for HTML art galleries.
Jalbum and Paint .NET make the most of the new Nehalem quad cores' hyperthreading features. In both tests, the 2.66GHz Core i7 920 delivers better results than the 3.2GHz Core 2 Extreme QX9775. Autopano Pro's ability to make use of eight processors seems to produce no advantage, while the 64-bit versions of Paint .NET and Autopano Pro are clearly faster than their 32-bit equivalents.
Image editing tests (seconds): shorter bars are better.
Video and sound encoding
The video and sound encoding tests show that applications in this area are far from optimised for multi-core processors. For example, when turning raw audio data into MP3 files, the Windows version of iTunes uses only two threads, so quad-core CPUs offer no speed advantage over dual-core alternatives. The Mac version, by contrast, uses four arithmetic and logic units.
The story is very different when it comes to the video encoding tool Cyberlink PowerProducer. Because this software supports the Nehalem architecture's two threads per core, the 2.66GHz Core i7 920 delivers better performance than the faster-clocked 3.2GHz Core 2 Extreme QX9775, which also has four cores but only runs one thread per core.
Video/sound encoding tests (seconds): shorter bars are better.
Rendering performanceIn the rendering tests, the Core i7 processsors deliver particularly impressive results with Povray. Here, even the Core i7 920 clocked at 2.66GHz performs better than the 3.2GHz quad-core QX9775 without hyperthreading. With the 32-bit version of Cinebench R10, there's little difference between the two chips, but the Core i7 920 edges ahead when running the 64-bit version.
Video/sound encoding tests (seconds): shorter bars are better.
Rendering performance tests: longer bars are better.
Most 3D games are still not optimised for multi-core chips, which means that graphics cards remain the chief factor affecting game performance. However, the CPU test in the 3DMark Vantage benchmark does exploit several cores and reveals big differences between the processors.
Even so, the overall 3DMark score does not reveal any significant advantages for the new Nehalem processors. To a large extent, this result is confirmed by the tests with real games, including Farcry 2, Crysis, F.E.A.R and Call of Juarez.
3D Mark Vantage tests: longer bars are better.
3D gaming tests (frames per second): longer bars are better.
It's clear that Intel has now implemented a great many of the features long offered by AMD processors. It's equally clear the Intel has taken those features and improved them. For example, the integrated memory controller in the new Nehalem processors is an impressive demonstration of what's possible with this technology. The re-emergence of the hyperthreading technology that originated with the Pentium 4 is also extremely successful.
In numerous tests, the 2.66GHz Core i7 920 is a better proposition at AU$2800.00 than Intel's previous fastest processor, the Core 2 Extreme QX9775, at around AU$2600. However both are painfully expensive in Australia.
Direct comparisons between the two 3.2GHz chips — the older Penryn Core 2 Extreme QX9775 and the new Nehalem Core i7 Extreme 965 — show the latest processor to be well over 50 percent faster. That advantage is not only confined to professional rendering applications; it also holds true for image editing with software such as Jalbum and Paint .NET, which fully exploit the features of the new architecture. That performance improvement should ensure Nehalem is a success.
Intel's Nehalem processors don't just make the competition look outdated — even its own Core 2 chips can hardly keep up with the new architecture. The first Nehalem processors are priced from AU$600 to AU$2900 but the Lynnfield chips, due in early 2009, for LGA1160 sockets should be cheaper. These desktop chips have only a binary DDR3 memory interface and offer no QPI. But both types of processor should work without problems in desktops.
It is in Intel's interest to make a swift transition to the new architecture. With about 90 million fewer transistors, the Nehalem chip surface is smaller than that of Penryn quad cores. So the potential profit margins on Nehalem processors should be greater, assuming the yield for single-die quad core can approach that for dual-die quad core. Dual-core variants code-named Havendale that use the Nehalem technology are expected in the second quarter of 2009. These processors will be followed by the two-core Auburndale and the four-core Clarksfield mobile versions.
At the moment, AMD can only keep up with Intel chips in the lower part of the desktop range. With Nehalem, Intel has again opened up a large lead for high-end desktops. AMD should strengthen its position in dual- and four-processor servers with the Shanghai chip, due this month. According to existing plans, Intel's Nehalem architecture will only become available for four-processor servers in the second half of 2009.
* VMware Workstation 6.5 may not be optimised for the Nehalem architecture. Nehalem CPUs should deliver better virtualisation performance thanks to direct memory access via EPT (Extended Page Table).
Translation by Toby Wolpe