X
Innovation

Cosmic threat to tomorrow's computers

Intel's first two billion transistor processor, details of which appeared yesterday, is quite the beast. Tukwila is four cores of 2GHz Itanium embedded in 30 megabytes of on-chip cache, together with Intel's next-generation memory controller/cross-point interconnect - which, as any supercomputer designer will tell you, is as important as the details of the rest of the logic.
Written by Rupert Goodwins, Contributor

Intel's first two billion transistor processor, details of which appeared yesterday, is quite the beast. Tukwila is four cores of 2GHz Itanium embedded in 30 megabytes of on-chip cache, together with Intel's next-generation memory controller/cross-point interconnect - which, as any supercomputer designer will tell you, is as important as the details of the rest of the logic.

What's most fun, if you're not impressed by muscle-car statistics, is that this next generation chip is made out of last season's cloth. It is - or will be, since we ain't seen one yet - stamped out of 65 nanometre processor tech. Back in the less awesome (but rather more useful) world of more boring PC type processors, Intel's moved on to 45nm - which goes faster, uses less power and should be more profitable.

So why didn't Intel want Tukwila to be faster, more efficient and make it more money?

There are lots of possible reasons, some of which Intel will talk about. During the telephone briefing we got before the Tukwila announcement, Julian Rattner appeared unsure why Itanium had stuck with 65nm. (Hmmm.)

One reason that was mentioned and grabs my imagination is SER, the Soft Error Rate. Soft errors happen when a memory or logic circuit returns the wrong answer, without there being anything wrong with the circuit itself -- the problem goes away the next time new data comes in.

Back in the dawn of silicon circuitry time, SER was a big problem and it took a while to track down the source: tiny traces of radioactive contaminants in chip packaging -- well within normal environmental levels -- which were blipping alpha radiation into the chips.

Any sort of radiation can blunder into a transistor and change its operating conditions - and when those conditions are necessary to store or move data, bang goes your bit. Alpha radiation is particularly bad news, as it's enormous - two protons and two neutrons - and although it's easily stopped by a few centimetres of air or a sheet of cigarette paper, neither are normally present between chip packaging and the die it surrounds.

There was much faffery, and after a while the packaging material suppliers got their act together and their ceramics pure. Although SER was still present from other causes, it went away as a serious issue in mass production devices.

Now, it's coming back. The reason is the size of the components on the chips: they're so small that they're affected more by the sorts of radiation which aren't normally that important - but are much harder to get rid off. The worst, amazingly enough, are cosmic rays of the sort pouring down to Earth from exploding galaxies, neutron stars and other energetic events. These are largely shielded by the atmosphere, but more than enough get through to sea level to cause problems. Engagingly, this also means that SER goes up with altitude - you'll get five times as many just 2,600 feet above sea level, and if you're living in Denver (5,280 feet ASL) expect ten times as many.

Cosmic rays are now the biggest source of soft errors - and they're common enough that on average, a gigabyte of RAM will have one soft error every couple of weeks. There are lots of ways to find and fix single-bit errors - and to spot larger problems, which is often more than good enough. Now, in order to maintain a high level of reliability, Tukwila has had to adopt some of those ways not just in memory, but in the latches which hold data temporarily as it moves around the processor's logic, maths and instruction handling circuits.

Although I haven't seen the paper which discusses Tukwila, so don't know which if any technique Intel's disclosing, it will involve a significant number of transistors - none of which will improve throughput, and all of which will suck power.

That's at 65nm. The problem at 45nm will be much exacerbated, indeed: Rattner said that he thought Itanium may miss out on 45nm altogether and jump straight to 32. My bet is that Intel has some new ideas it wants to try which it can only engineer into the next generation of transistor at 32nm - the problem's going to get so much worse that it may need fundamental changes in basic design.

Which does raise a question or two, mostly about how the 45nm processors that do exist will cope with soft errors. It's not something that semiconductor companies like talking about in public (customers who most worry about SER tend to be good at keeping secrets), but they'll have to start.

Oh, and next time your computer crashes for no good reason - it might just be because it got hit by cosmic crap spat out by a rotating neutron star 50,000 years ago. Cooler excuse than a dodgy sound driver, at least.

Editorial standards