When it comes to computer foul-ups, there are so, so many to choose from. Why, just in the last 12 months, we had a major internet security hole, Heartbleed, that hit pretty much every "secure" web server on the planet. That was followed by Shellshock, which had the potential to be even worse.
But, hey, did any of those almost cause World War III? Blackout the power for tens of millions? Smack the stock market around for a 22.6 percent loss? Knock out 10 percent of the internet? I think not!
No, for the real bad ones, we need to go back a few years.
Soviet Air Defence officer Stanislav Petrov was having just another quiet night at work at Serpukhov-15, a secret bunker where the Soviet Union monitored its early-warning satellites, when the missile launch alarms went off. The satellites were reporting that the US had just launched a first-strike nuclear missile attack. Petrov was the man on the spot. If he reported that the USSR was under attack, the Soviet Union's leadership might have launched a counter-strike, and none of us would be reading this now.
Instead, Petrov took it on his own responsibility to not forward news of the "attack". He told The Washington Post that he "had a funny feeling in my gut" that the alarm was false. He had more than just a feeling; the early-warning radar wasn't showing anything, and he reasoned that the US would surely not launch an attack with only a handful of missiles. His human logic and courageous decision saved the world from a real possibility of World War III starting by computer error.
Long before the dot-com collapse of the early 2000s, technology contributed to the New York Stock Exchange crash of 1987. The market had been nervous about SEC investigations of insider trading for months, and then, on October 19, 1987, Black Monday, their jitters turned into a panic.
Stock owners tried to leave the market, and the early computer trading programs put a Stop-Loss on stock sales and sent out sell orders. Too, too many sell orders. The printers couldn't keep up with the orders, leaving everyone lagging behind, blind to what was really happening with the market. Panic fed on panic, and by day's end, the Dow had lost 22.6 percent of its value and the "Greed is Good" go-go '80s were done.
On November 2, 1988, I was at work at NASA's Goddard Space Flight Center in the data communications branch. Everything was fine. Then, our internet servers, running SunOS and VAX/BSD Unix, slowed to a stop.
We didn't know it yet, but we were fighting the Robert Morris Internet Worm. Before the patch was out, 24 hours later, 10 percent of the internet was down and the entire global network had slowed to a crawl.
Unlike the hundreds of thousands of hackers that would follow Morris, he hadn't meant to damage the internet. He thought his little experiment would spread far more slowly and not cause any real problems. He was wrong. He had created not only the first serious computer worm, but also the first distributed denial-of-service (DDoS) attack.
On January 15, 1990, more than half of the AT&T network crashed. For nine hours, over 75 million long-distance calls went awry.
The villain? An error in a single line of code in a software patch made weeks earlier to AT&T's computer-operated electronic switches (4ESS).
It started when the 4ESS in NYC went down to perform a routine self-test. It informed the other switches that it could not take anymore calls until further notice. After the reset, it began to distribute the signals that had backed up during the time it was offline.
So far, so good, but when the other switches received a message that a call from New York was on its way, they began to update their records. Ten milliseconds later, another New York message arrived before the first message had been handled. This second message overwrote other communications data. The switch was smart enough to realize this, and started its own routine self-test.
You can see the cascade of starts, stops, and restarts rolling back and forth over the AT&T phone system. I've always thought it amazing that the problem only lasted for eight hours.
Intel was betting a lot on its new Pentium chip in 1994. It was to be the new flagship chip. Then, mathematics professor Thomas Nicely discovered that the new chip made floating point math errors. The root cause was that the divider in the Pentium floating point unit (FPU) was missing five out of 1,000 entries.
Technically, only engineers and mathematicians were likely to ever see the problem, but the episode proved to be Intel's greatest public relations disaster of all time.
After all, you try to get companies to buy your new high-priced CPU when jokes like the following begin making the rounds.
"Q: How many Pentium designers does it take to screw in a light bulb?"
"A: 1.99904274017, but that's close enough for non-technical people."
You know you're having a bad day when your ship, the USS Yorktown, a Ticonderoga-class cruiser, is stopped dead in the water and has to be towed back to base.
The Yorktown had been retrofitted with 27 computers with dual 200MHz Pentium processors as part of the Navy's Smart Ship program. It wasn't the chip that was a problem this time. No, it was the operating system, Windows NT, that gets the "credit" for this foul-up.
A sailor entered a zero in a database field. Any other calculator could have handled this attempt to divide by zero, but not NT. One thing led to another, and in a matter of seconds, the computer-controlled propulsion system was down and the Yorktown was going nowhere. Eventually, the ship was towed back to its home port of Norfolk, Virginia. The Navy tried to cover this up, but in 1998, the news leaked.
After a journey of 416 million miles and almost 10 months, NASA's Mars Climate Orbiter was ready to go into orbit. Then, the lander flew too low in the atmosphere, and was destroyed.
The mistake this time? One of the contractors used imperial measurements -- such as feet and yards -- while the JPL navigation team used metric measurements -- centimeters and meters -- in navigating the spaceship. Whoops.
The real problem and the real lesson, as Dr Edward Weiler, then NASA's associate administrator for Space Science, said at the time, "was not the error, it was the failure of NASA's systems engineering, and the checks and balances in our processes to detect the error".
It must have come as quite a surprise to the 8,500 patients of St Mary's Mercy Medical Center in Grand Rapids, Michigan, that they were dead. They certainly didn't know it!
The hospital's new patient management system had marked as dead anyone having procedures done from October 25 through to December 11, 2002. If that's all there was to it, this would have been scary, but not that big a deal. Unfortunately, the system had also notified Social Security and the patients' insurance companies that they were "dead".
Eventually, all the records were cleared up. I see this particular episode as a harbinger of the future. Thirteen years later, our identities are more bound than ever to IT systems outside of our control.
On August 14, 2003, a power systems operator in Ohio was annoyed by an alarm about a power flow problem. So, he turned it off.
Hours later, as evening fell on a hot summer's day, blackouts across eight US states and Canada spread over 256 power stations and 50 million people. What had happened? Afterwards, it was discovered that the Unix-powered General Electric energy management system, the GE Energy's XA/21, was mishandling the situation, thanks to a race condition.
The net result was that FirstEnergy Corp, the Ohio utility where the blackout began, wasn't aware of just how bad the power grid was until the demand was overwhelming one power plant after another. It took more than a week for power to be fully restored.
On an average day, Knight Capital Group traded $21.5 billion in stocks. Then, between 9.30am and 10am EST on August 1, 2012, the company botched a software upgrade, and its trading programs began buying high and selling low on 150 different stocks. By the time the bleeding had stopped, KCG had lost $440 million on trades. The company had had only $364.8 million in cash and equivalents on June 30.
Knight Capital survived, but its near-death experience showed just how fragile even a company that deals regularly with billions of dollars can be.
I'm reminded of an old IT saying: "To foul up is human, but to really foul up requires a computer." As this and all these other cases have shown, this is all too true.