For most people, Intel's Itanium flagship processor remains a mystery. In the five years since the architecture was launched, what was promised as a general purpose server — even workstation — chip has retreated steadily into a high-end niche. It does well there, but arguably not well enough to support Intel's research and development budget, but it's hidden from most people's daily experience. The core question — how well does it actually work? — is hidden beneath salesmanship, positioning and industry FUD, not all of it Intel's.
But now some light has been shed on the chip's actual working. In "Itanium —A System Implementor's Tale", a paper to be presented at the Usenix 05 conference next week, four researchers from the University of New South Wales, and one from HP's Palo Alto labs, report on their experience of making Itanium fly. They report favourably on some well-known features of Itanium's design, most notably that it excels at floating-point number crunching, but then explore the reasons why it doesn't do nearly so well on bread and butter computing.
One of the major problems they highlight is the poor quality of compiler code generation. Because of Itanium's EPIC design the chip relies heavily on the code it runs being efficiently formatted. This means that the compiler itself has to work out the most effective way to order the instructions it spits out, so that they flow freely through the chip without creating conflicts for internal resources or waiting for results from each other. Furthermore, instructions have to be bundled together in groups that are given to the processor in one go, and the relationship between instructions within a group — and with those in subsequent groups — is critical for efficient work.
None of the above makes it easy to generate good code, but some of Itanium's features make it even harder. One major issue is instruction latency — the number of clock cycles needed to separate two instructions where one produces a result and the other consumes it. If you ask for the result too soon, the chip stalls, holding up processing for current and next instruction groups, which can slows things down dramatically.
Most of the common instructions used in application software have a latency of one, so all you have to do is make sure the consuming instruction is in a later group than the producer. However, many instructions necessary for operating system work have latencies between two and five, some have twelve and some as high as 36. Efficiently scheduling these is very difficult, while the cost of losing that many cycles in a stall is high.
There are many such issues in Itanium, often associated with critical aspects of operating system design. Operating system calls, where an application passes control to code of a higher privilege, require the chip to change internal levels of protection and reconfigure itself to use trusted resources. Doing this efficiently requires extensive and experimental hand-coding in assembly language and a level of knowledge of the chip's internals that the researchers say is just not available. It's also the sort of thing that can be expected to change between different versions of the processor, which makes the task of compiler writers even harder. Yet getting it wrong, especially with Linux and other operating systems which rely heavily on message passing, imposes a major performance penalty.
Another factor the researchers uncovered include the curious fact that Itanium is not virtualisable. Virtual systems rely on being able to hide some information from programs that don't know they're running on a virtual processor, and on making other information available to the software that's managing the virtualisation. Itanium has instructions that don't fulfil these requirements, meaning they can't be used in virtual systems — if they're found in a program, the virtualisation manager has to strip them out and replace them with its own code.
The researchers' conclusions make interesting reading. "The EPIC approach has proven a formidable challenge to compiler writers, and almost five years after the architecture was first introduced, the quality of code produced by the available compilers is often very poor for systems code. Given this time scale, the situation is not likely to improve significantly for quite a number of years." In particular, they single out the Gnu Compiler Collection (GCC) at the heart of Linux development as one of the worst offenders.
That may be pessimistic. The Gelato organisation, dedicated to Linux on Itanium, recently held a workshop to consider these very issues — at least everyone's talking to each other..
But meanwhile, as the researchers point out, the performance of systems on Itanium depends on expert production of hand-crafted assembly code, a task that's made difficult by the lack of general Itanium experience and Intel's reluctance to produce sufficiently detailed documentation. That's particularly surprising, given one of the researchers comes from HP — Intel's closest technical partner in Itanium design and the source of many of the ideas underlying the architecture. It may also be significant that Microsoft's delayed Windows Server 2003 Compute Cluster Edition (CCE), designed to run on precisely the sort of high-end clustered systems that Itanium targets, has dropped Itanium support from its first release: here too, one would expect privileged support from Intel that may not have been forthcoming.
Technical difficulties are one thing: nobody working in high performance computing expects a free ride. Technical difficulties that could be solved by greater co-operation and the availability of essential information are harder to excuse, and Intel should be worried that its marketing efforts are not being matched by comprehensive support to the workers at the code face. It would be a shame if the billion-dollar, billion-transistor Itanium effort foundered for lack of a few bits of paper.