Making faster chips that do more is the only thing that keeps Intel going. Like all chipmakers, it announces better products with monotonous regularity, backed up with fanfares about finer physics and smarter designs. Behind the scenes, however, there's an equally constant and much less well publicised battle in progress: new processes and innovative designs only work once you've got the bugs out.
Testing and fixing 400 million transistor chips is as hard as it sounds. Although you can design and check a new circuit in simulation, there comes a time when you have to press the button and create the first silicon wafer. Experience says that each new major design has thirty to forty show-stopping bugs that have to be tracked down, repaired and re-checked before you get something you can ship.
In the old days, testing a circuit built out of discrete parts was a matter of slapping on probes, looking at voltages, soldering in new parts and rewiring the design until it worked. In the age of sub-micron, half-billion part single chips, the techniques are exactly the same -- but instead of probes, you use laser beams and infra-red microscopes that can see though silicon, and soldering in new parts involves carving micron-wide holes through the chips and beaming new connections through vacuum chambers using particle accelerators.
The first problem when your chip is down is finding the miscreant part. The hard part isn't working out where to look: the real fun is that these days, chips are mounted upside-down in their packaging. All the working connections are covered by the network of pins bringing signals to and from the circuit to the outside world: the only part available to the testers is the backside of the chip, a featureless expanse of silicon with no parts exposed.
Here, physics gives the testers their first break: silicon is transparent to infra-red (IR) light. If you shave off as much of the backside as possible and hook up an IR microscope, you can see the individual devices as they operate. Moreover, if you focus a very low power IR laser on a transistor and monitor the reflection, you can detect tiny changes in the beam as the transistor switches between on and off. This is the exact equivalent of sticking a voltmeter on a particular component in an old television set.
Another quirk of transistor design is that every ten thousand or so times a transistor switches, it gives off a solitary photon of IR light. Stick your chip in front of a camera and watch for those photons, and you can trace complex events as they percolate across your chip. You have to count individual photons over quite a long time, and do probabilistic analysis to screen out noise, but suddenly quite subtle interactions on your circuit become visible.
These techniques work well in checking for logic errors, but most of the time you're more interested in finding out why your chips don't work quite as fast as you'd like. Faster chips mean fatter prices, which is an equation of intense interest, but working out which part of the circuit is the first to fail as you crank up the gigahertz introduces many more variables than just sorting out a mistake in the wiring.
Once again, physics comes to the rescue. Not only can a laser beam reflect the actions of a transistor, but it can influence them. Turn up the wick a bit and you alter the operating conditions of parts of the circuit: get a chip running just at the edge of failure and scan it with a laser, and when that laser hits the problematic part, the chip will stop working. By matching that scan with the map of the chip, and gradually altering parameters until you get just the one repeatable failure, you can highlight the miscreant.
By now, you've expended a few million dollars in custom-built test equipment, a few million more in building the chip under test in the first place, and you have a solid idea of where it's going wrong. But you still have a broken chip. In the old days, you'd puzzle out a fix, rework the design, send it back to production and wait for another iteration to come back for testing. If you were lucky, you'd then have a fully working part. You might instead just uncover another bug, and have to go into the cycle again. In Intel parlance, this is called 'peeling the onion': it can turn a good product into an expensive also-ran if schedules slip while the competition ship.
Peeling the onion is now a job for the nanosurgeons, which is a term for the engineers who edit and rewire components on-chip. This time, a beam of gallium ions is focused above the affected area on a faulty chip, and the backside etched away by combining the high-energy ions with corrosive chemicals. Faulty parts can then be eroded away completely or partially excised, and new connections formed by laying down a thin wire of metal ions to another exposed area. When a new part needs to be added, the engineers find the nearest spare -- between 1 and 3 percent of a chip is given over to these normally unused 'bonus' components -- and wire to that. The resultant patched chip can be passed back to the testers in a matter of days, and the fix verified. By cutting out the entire refabrication cycle, this can remove months of delay from the debugging process.
One final tuning process involves checking the circuit for abnormal power usage. The chip is run in test mode, while an ultra-sensitive IR imaging system produces a thermal map of the silicon. Any area that uses more power than the rest appears in stark relief: however, the system is so exquisitely sensitive that temperature differentials of 0.01 degrees Celsius can be detected. This shows up leakage currents flowing in chips in hibernation or suspension.
All these techniques are in use on Intel's 90nm architectures. Because of limitations due to the wavelength of IR light -- already a factor of ten larger than the components it investigates, and thus unsuited to pinpoint fine details -- another generation of test technologies is required for the 65nm chips coming up next. Intel isn't discussing those, nor how it proposes to test complex three-dimensional structures as found in advanced transistor geometries and nanotech devices. That it feels free to talk about the current test systems is an indication of their maturity -- something also shown by the way Intel uses them to produce batches of fixed chips to feed back into its verification process.
Other companies have similar processes: in fact, Silicon Valley works in part by chip companies working with test equipment makers to develop these new ideas and letting them sell the results to all comers. The original company gets patents and licence fees, first dibs on the equipment and a lead of a year or two in using the techniques, but new concepts are quickly shared out between companies who appear on the surface to be sworn enemies. In this, as much as in the basic physics, chip companies ensure a regular flow of all the technologies necessary to keep new developments coming out of the labs and onto the shelves.