Intel's cache problems, what they mean and how to solve them

If you want to get really intimate with hardware, there's nothing like low-level, high-performance coding to remove all barriers between your mind and the silicon. One result, aside from the difficulties it creates during dinner party small talk, is that you get to find out how stuff actually works – which can be very different to how the chip companies tell it.

If you want to get really intimate with hardware, there's nothing like low-level, high-performance coding to remove all barriers between your mind and the silicon. One result, aside from the difficulties it creates during dinner party small talk, is that you get to find out how stuff actually works – which can be very different to how the chip companies tell it.

Take a look at this blog entry. It's from a chap who's on the team for x264, an open source video encoder library. This is a fascinating area: you have to deal with a lot of real-world data very quickly, and modern processors have large amounts of support available if you're got the chops to handle vector mathematics.

But all is not as it seems. In particular, he reports that Intel has a consistent and rather painful problem with its on-processor caches – one that AMD has managed to avoid.

A cache works by making a copy of a large chunk of external, slow memory that it can then feed quickly to the processor in smaller chunks as required. As it's far more efficient to do one large read externally than lots of little ones, this speeds things up enormously.

The size of the large chunk is called a line; Intel, as is common, has 32 or 64 byte lines. If the processor wants to read one byte from memory, the cache intercepts it, does a 32 or 64 byte read for the area containing that byte, passes the one byte back and then expects to fulfil further reads from the same area without bothering the slow main memory.

As you may expect, an awful lot of work goes into designing the cache systems to optimise speed under lots of circumstances. One particular problem is a cacheline split – that's where the processor asks for memory that overlaps the end of one line and the beginning of the other. The cache has to do two long reads to fulfil one processor request – bad news.

In most systems, this never happens. Data is always aligned with the beginning of a 32- or 64-byte boundary; processors, compilers, memory architecture and software alike has rules to keep it that way. Some processors even crash if you try anything else.

But with video encoding, you can't stick to that rule. A lot of modern compression works by following patterns as they move across the screen, using various mathematical methods to spot where something's moved to and then just encoding that movement as a vector. Because objects can be anywhere on screen – and thus in display memory – you have to be able to scoop stuff up with no restrictions connected with the underlying memory architecture. Quite often, you'll just have to read across cache lines, because that's where the real world has put your data.

On AMD, you get a small penalty. On Intel, you get a huge penalty – equivalent to having to go to the next level of cache and move all the data in anew, no matter whether a real read is needed or not. Sometimes, it's far worse. Even when you factor in all the times that there's no split, the average speed of important functions is halved.

Exactly why this happens, nobody (outside Intel) knows. It has the savour of a clever shortcut that saved a lot of gates or neatly fixed some particular problem: the downside would have been recognised, but judged unlikely to happen and unimportant when it did. Hardware engineering is full of this kind of thinking: squeezing the best functionality out of a circuit against deadline and other limitations means having to decide in advance what's going to matter in the real world. Combine that with fallible human foresight and ever more systemic complexity, and it's rather surprising that chip sets work as well as they do. It's easier to understand that drivers, especially those written by third parties, are often late, buggy or poor performers..

What can be done? Look at the way the x264 developer blog works: solid engineering information shared and explained. In the case of the Intel cacheline problem, four possible workarounds are discussed in some detail: it's open source, so made to be passed around. It often takes a lot of blood and tears to find out what's going wrong in these cases – and the instinct can be to hoard what is so expensively bought. Yet when it's shared, its value increases: community amplifies knowledge.

The next step is to spread the awareness of sharing back to the chip manufacturers. It may seem instinctively right to keep secret the details of mishaps and architectural quirks, but they'll be found out anyway. Instead of being thought obstructive, why not get the kudos for honesty and community spirit? That goes a lot further than marketing budget ever can in the minds of users: it'll get your products working better, sooner, and it makes you an attractive place to work when you're out hiring the good guys.

After all, there's nothing smarter than making your mistakes work for you.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All