As Steve Jobs took to the stage yesterday to launch iTunes Europe, the whole concept temporarily failed. The company website, the machine that turns music into money, was pushed off the Internet by a denial-of-service attack. It wasn't alone: Microsoft, Google and Yahoo were all hit. Exactly what happened is still foggy, but it looks as if Akamai -- the company that handles the DNS translation of those Web site names to their numeric Internet addresses -- was the focus of attention.
At the other end of the Web content spectrum, thousands of bloggers were also left staring at blank screens. Dave Winer, the chap behind the free hosting service weblog.com, had unilaterally decided that the service was costing him too much time and money and had thrown three thousand blogs into the void without warning. Short of flying the hammer and sickle at a Republican Party convention, it's hard to know how to create more antagonism in such short order.
Again and again, the seismic events small and large that shake the online world have one thing in common -- the existence of single points of failure. With viruses and worms, the factor is usually Microsoft: not that its code is necessarily worse than any other, but that its ubiquity amplifies a single small fault into global vulnerability. With the Internet infrastructure, latent problems in Cisco routers or the reliance on one company running top-level domains builds weak points into a system that is built on a superbly fault-tolerant protocol.
It's wrong to think of this as primarily a technical issue. Air-accident investigators know that pilot error or a mechanical malfunction is almost never the root cause of a crash: instead, a chain of events leads to the final denouement. At any point along that chain, the disaster could have been averted: the weather is bad, so a rapid change in altitude is needed. The pilot is tired and distracted by nearby thunderstorms, so goes past the new level. Air traffic control is overworked, so the mistake goes unnoticed. Another aircraft on an adjacent level has a faulty collision warning system. Result: calamity. Change any one of those factors and the story is dramatically different.
The onus on anyone with responsibility for producing a technical system of importance, whether it's a public Web service or an internal corporate IT system, is to understand these chains of events and to engineer not just the technologies but the skein of factors that surround them. At one level, this is obvious and expected: you wouldn't run a server farm without having a fail-over system in place, or have your comms room reliant on mains electricity without a UPS backup. But as you go up layers of abstraction, the importance of diversity becomes discounted. This isn't because it's any less vital to have alternative strategies in place, but because we have learned to think that it's too difficult.
There used to be a golden rule in electronic design -- always have a second source. If your product depended on a unique component available from a single company, you were at heightened risk of commercial blackmail, random disaster or supplier incompetence. Whenever possible, design out such parts. With product lifecycles now so short and the urge to get a quick unique advantage so strong, this rule is often ignored -- what would a factory fire at Hitachi's 4GB 1" drive plant do for Apple's iPod Mini strategy?
Yet how many specifications for new IT systems include a requirement to be able to switch to an entirely different supplier for all key components, hardware and software, should any of those components prove dangerously fallible? For suppliers, lock-in is a most desirable state of grace: if you're a Microsoft shop or a major SAP installation, then expect every effort to keep things that way. Might this involve increasing the cost of switching? Most certainly. Do you have to accept that? No.
Open source might seem to be the answer, but it's not invulnerable. It certainly makes a fault-tolerant strategy easier, but by no means guarantees it: while the SCO nonsense is falling apart in slow motion, software patents remain a potent source of danger for everyone. The history of technology in the 20th century is full of cases where resourceful and aggressive companies have used every legal weapon to increase control. Do not assume that just because you have the source and the rights to use it as you will, those rights are necessarily inviolable. They should be -- but "should be" isn't enough.
You cannot assume that any aspect of your strategy is untouchable. Risk analysis is required at every level, even -- especially -- at those levels where your suppliers are assuring you of safety. Suppliers should be able to answer the question: if we want to chuck you out partially or completely, how would we make the transition? And when you have your strategy, you should test it -- but that's a story for another day.
Apple, Google and friends came back onto the Net in short order: it is possible to switch DNS servers quickly, if you are prepared. And those bloggers on weblog.com who had backups are reloading their lives onto other sites that run compatible software: it's those without the backups who are keening the loudest. The rule for happiness is: don't be at the mercy of others.