The top 10 IT disasters of all time

The top 10 IT disasters of all time

Summary: From faulty satellites nearly causing World War III to the Millennium Bug, poorly executed technology has had a lot to answer for over the years.

SHARE:
Following the loss of the personal records of some 25 million child benefit recipients by Her Majesty's Revenue & Customs this month, the UK government will be acutely aware of how quickly mismanagement of technology can lead to serious problems.

While technology wasn't to blame per se in the HMRC data loss, there are plenty of recorded examples where faulty hardware and software have cost the organizations concerned dearly, both financially and in terms of reputation--and resulted in some near misses for the public.

Here's our considered list of some of the worst IT-related disasters and failures. The order is subjective--with number one being the worst--so feel free to comment using Talkback below if you disagree or have suggestions for disasters we may have missed.

1. Faulty Soviet early warning system nearly causes WWIII (1983)
The threat of computers purposefully starting World War III is still the stuff of science fiction, but accidental software glitches have brought us too close in the past. Although there have been numerous alleged events of this ilk, the secrecy around military systems makes it hard to sort the urban myths from the real incidents.

However, one example that is well recorded happened back in 1983, and was the direct result of a software bug in the Soviet early warning system. The Russian system told them that the United States had launched five ballistic missiles. However, the duty officer for the system, one Lt Col Stanislav Petrov, claims he had a "funny feeling in my gut", and reasoned if the U.S. was really attacking they would launch more than five missiles.

The trigger for the near apocalyptic disaster was traced to a fault in software that was supposed to filter out false missile detections caused by satellites picking up sunlight reflections off cloud-tops.

2. The AT&T network collapse (1990)
In 1990, 75 million phone calls across the U.S. went unanswered after a single switch at one of AT&T's 114 switching centers suffered a minor mechanical problem and shut down the center. When the center came back up soon afterwards, it sent a message to other centers, which in turn caused them to trip, shut down and reset.

The culprit turned out to be an error in a single line of code--not hackers, as some claimed at the time--that had been added during a highly complex software upgrade. American Airlines alone estimated this small error cost it 200,000 reservations.

3. The explosion of the Ariane 5 (1996)
In 1996, Europe's newest and unmanned satellite-launching rocket, the Ariane 5, was intentionally blown up just seconds after taking off on its maiden flight from Kourou, French Guiana. The European Space Agency estimated that total development of Ariane 5 cost more than $8bn (£4bn). On board Ariane 5 was a $500 million (£240 million) set of four scientific satellites created to study how the Earth's magnetic field interacts with Solar Winds.

According to a piece in the New York Times Magazine, the self-destruction was triggered by software trying to stuff "a 64-bit number into a 16-bit space."

"This shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data--the sideways velocity of the rocket--from a 64-bit format to a 16-bit format. The number was too big, and an overflow error resulted. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software," the article stated.

4. Airbus A380 suffers from incompatible software issues (2006)
The Airbus issue of 2006 highlighted a problem many companies can have with software: What happens when one program doesn't talk to the another. In this case, the problem was caused by two halves of the same program, the CATIA software that is used to design and assemble one of the world's largest aircraft, the Airbus A380. This was a major European undertaking and, according to Business Week, the problem arose with communications between two organizations in the group: French Dassault Aviation and a Hamburg factory.

Put simply, the German system used an out-of-date version of CATIA and the French system used the latest version. So when Airbus was bringing together two halves of the aircraft, the different software meant that the wiring on one did not match the wiring in the other. The cables could not meet up without being changed.

The problem was eventually fixed, but only at a cost that nobody seems to want to put an absolute figure on. But all agreed it cost a lot, and put the project back a year or more.

5. Mars Climate Observer metric problem (1998)
Two spacecraft, the Mars Climate Orbiter and the Mars Polar Lander, were part of a space program that, in 1998, was supposed to study the Martian weather, climate, and water and carbon dioxide content of the atmosphere. But a problem occurred when a navigation error caused the lander to fly too low in the atmosphere and it was destroyed.

What caused the error? A sub-contractor on the NASA program had used imperial units (as used in the U.S.), rather than the NASA-specified metric units (as used in Europe).

6. EDS and the Child Support Agency (2004)
Business services giant EDS waded in with this spectacular disaster, which assisted in the destruction of the U.K.'s Child Support Agency (CSA) and cost the taxpayer over a billion pounds.

EDS's CS2 computer system somehow managed to overpay 1.9 million people and underpay around 700,000, partly because the Department for Work and Pensions (DWP) decided to reform the CSA at the same time as bringing in CS2.

Edward Leigh, chairman of the Public Accounts Committee, was outraged when the National Audit Office subsequently picked through the wreckage: "Ignoring ample warnings, the DWP, the CSA and IT contractor EDS introduced a large, complex IT system at the same time as restructuring the agency. The new system was brought in and, as night follows day, stumbled and now has enormous operational difficulties."

7. The two-digit year-2000 problem (1999/2000)
Many IT vendors and contractors did very well out of the billions spent to avoid what many feared would be the disaster related to the Millennium Bug. Rumors of astronomical contract rates and retainers abounded. And the sound of clocks striking midnight in time zones around the world was followed by... not panic, not crashing computer systems, in fact nothing more than New Year celebrations.

So why include it here? That the predictions of doom came to naught is irrelevant, as we're not talking about the disaster that was averted, but the original disastrous decision to use and keep using for longer than was either necessary or prudent double digits for the date field in computer programs. A report by the House of Commons Library pegged the cost of fixing the bug at £400 billion. And that is why the Millennium Bug deserves a place in the top 10.

8. When the laptops exploded (2006)
It all began simply, but certainly not quietly, when a laptop manufactured by Dell burst into flames at a trade show in Japan. There had been rumors of laptops catching fire, but the difference here was that the Dell laptop managed to do it in the full glare of publicity and video captured it in full color.

(Unfortunately, the video capturing the incident appears to have vanished from the web. If you happen to own a copy, please send it to us as it should make interesting viewing again.)

"We have captured the notebook and have begun investigating the event," Dell spokeswoman Anne Camden reported at the time, and investigate Dell did. At the end of these investigations the problem was traced to an issue with the battery/power supply on the individual laptop that had overheated and caught fire.

It was an expensive issue for Dell to sort out. As a result of its investigation Dell decided that it would be prudent to recall and replace 4.1m laptop batteries.

Company chief executive Michael Dell eventually laid the blame for the faulty batteries with the manufacturer of the battery cells--Sony. But that wasn’t the end of it. Apple reported issues for iPods and Macbooks and many PC suppliers reported the same. Matsushita alone has had to recall around 54 million devices. Sony estimated at the time that the overall cost of supporting the recall programs of Apple and Dell would amount to between ¥20 billion (£90m) and ¥30 billion.

9. Siemens and the passport system (1999)
It was the summer of 1999, and half a million British citizens were less than happy to discover that their new passports couldn't be issued on time because the Passport Agency had brought in a new Siemens computer system without sufficiently testing it and training staff first. Hundreds of people missed their holidays and the Home Office had to pay millions in compensation, staff overtime and umbrellas for the poor people queuing in the rain for passports. But why such an unexpectedly huge demand for passports? The law had recently changed to demand, for the first time, that all children under 16 had to get one if they were traveling abroad.

Tory MP Anne Widdecombe summed it up well while berating the then home secretary, Jack Straw, over the fiasco: "Common sense should have told him that to change the law on child passports at the same time as introducing a new computer system into the agency was storing up trouble for the future."

10. LA Airport flights grounded (2007)
Some 17,000 planes were grounded at Los Angeles International Airport earlier this year because of a software problem. The problem that hit systems at United States Customs and Border Protection (USCBP) agency was a simple one caused in a piece of lowly, inexpensive equipment.

The device in question was a network card that, instead of shutting down as perhaps it should have done, persisted in sending the incorrect data out across the network. The data then cascaded out until it hit the entire network at the USCBP and brought it to a standstill. Nobody could be authorized to leave or enter the U.S. through the airport for eight hours. Passengers were not impressed.

(Note: We have purposely omitted incidents that resulted in loss of life.)

Topics: Laptops, CXO, Dell, Hardware, Software, IT Employment

About

Colin Barker is based in London and is Senior Reporter for ZDNet. He has been writing about the IT business for some 30-plus years. He still enjoys it.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

87 comments
Log in or register to join the discussion
  • Why omitions?

    (Note: We have purposely omitted incidents that resulted in loss of life.)

    Why omit these incidents??? THESE are the ones that count!
    If some flacky software kills hundreds or thousands I would want to know.
    GrimmReaperSound
    • Why omitions? Why not "omissions"?

      Although not an IT disaster, that's certainly a spelling disaster...
      mseiler1
    • Software failures resulting in loss of life...

      Possibly because there's been a number of civil engineering incidents (mainly underdesigned structures) that got embroiled in controversy over the question of whether they were errors in design software or errors on the part of civil engineers not sanity-checking the results?
      Resuna
    • DIA's baggage handling system

      This was one for the textbooks:
      http://www.computerworld.com/managementtopics/management/project/story/0,10801,102405,00.html

      Wow, after more than 10 years, it's still not fixed.
      debuggist
      • Colorado in General

        Hey, don't just go for the DIA baggage fiasco. In Colorado there are the DMV, welfare, and so many IT disasters that Colorado is Mecca to IT disasters.
        thomasderk9
        • Hmmm...

          Wonder if there's something waiting in NORAD like Joshua.
          AbbydonKrafts
  • Purposely omitted errors that cost lives?

    Errors like the Therac-25 disaster seem much worse than delays at LAX caused by a $10 network card.
    eljay001
    • Therac-25...now there's a blast from the past

      I have lost count of the number of times I've been told with a straight face "Come on, bugs in software have never been responsible for any loss of life." I guess the biggest problem is that these incidents get covered up so well. It's a shame that the author of this article seems to want to play an active role in the coverup by purposely omitting them. Problems only get better by shining a spotlight on them, not by ignoring them. The current state of software development is a disgrace. I've worked with one company in the past that even seems to relish in their defects. They took to calling bugs in their software "treasures". Unfortunately it's going to take a disaster of epic proportions to open up some eyes.
      jasonp9
      • This article is too soft with the omission of deadly IT failures

        Interesting article, BUT it is way too soft with the omission of deadly IT failures. I agree that THOSE are the ones MOST worthy of mentioning and reflecting upon. Do we chalk this up to "political correctness". Is it the fear of lawsuits? Or is it just plain lack of knowledge because the scope is too big?

        racingmustang
        racingmustang
    • Yes, I'm offended by this...

      Yes, I'm offended by this... the Mr. Barker didn't made his homework and missed other importants errors. I'm sure the author can compile a "Top 10 IT Disaters that Killed People" list. Maybe he is more concerned about money and not about people.

      Around here the Therac problems are used as a case study to teach what not to do when building a complex system. It didn't kill hundreds of persons, but those who died by the Therac had an horrible death... and just becase some stupid engineers didn't know how to design software.

      That's just an example, let's see if the author can make it better.


      Regards,

      MV
      MV_z
  • Forgot the Windows one that almost killed thousands

    You remember, the one where the FAA switched the reliable UNIX systems that controlled RADAR with Windows systems that needed to be rebooted every couple months. The Windows system crashed and many planes almost did.

    Just goes to show why you don't use Windows for mission critical apps.
    itguy08
    • I see you think you know more than the FAA

      To quote from your cited CNN article:

      "FAA officials said the problem did not present a danger for the planes or their passengers."

      So how did a communications glitch suddenly become something that "almost killed thousands?"
      Confused by religion
      • Win PC's never used in production

        The FAA ATC (Air Traffic Control) never used and Windows OS at a controller position.

        Flight Service Stations, manned facilities that read weather forecasts and accept flight plans from Pilots that are unable to access a computer (not the scheduled airlines) may use Win PC's.

        Since they are not providing separation from other aircraft, the only function ATC performs that could place an aircraft in peril (despite what you see on TV and in the movies) this is not an issue.

        SK
        skykingoh
      • You think the FAA Would admit to gross neglegence?

        Please. You think they are going to admit that a poor computer system choice would have put thousands in danger?

        But think about it: In flight communication to the ground is a vital thing pilots must do. It ensures you are on the right course, far enough from other planes, runway landing instructions, etc. Loose that communications and bad things can happen.
        itguy08
    • And they replaced the Unix systems, why?

      Becuase they were so reliable? Guess the Unix systems just couldn't scale up?
      GuidingLight
      • Logical fallacy

        Your argument is that the fact that the Unix systems were replaced proves that the Unix OS is unable to scale up. Leaving aside the utterly laughable stupidity of such an asinine claim,
        your argument is that there could only be one possible reason for replacing the ten-years-old Unix-based system with a Windows one, when clearly there are many possible reasons for such a decision, including bribery, malfeasance, and stupidity among many others.

        Furthermore, the facts do not support your claim. In fact, the systems replaced were ten years old, which means that not only was the hardware out of date, but also extremely expensive to maintain due to unavailability of legacy parts.

        The decision to change to Windows was made not to improve reliability or scalability, but to reduce costs, by allowing the use of cheap commodity hardware. In fact, the same goal could easily have been met by switching to Linux or BSD as the OS, and that would probably have eliminated the need to reboot the servers on a scheduled basis to prevent them from crashing.
        bmerc
      • Some Idiot decided "Windows Everywhere"

        when it belongs NOWHERE.

        Probably what happened was this:

        System was 10 years old, let's bid it out.
        Windows system was cheapest in the short term (as they all)
        Long term, Windows ends up costing way more than the alternatives.
        itguy08
        • thats just your opinion tguy08 and not a very good one at that

          thats just your opinion tguy08 and not a very good one at that. windows has been running the world for a long time to hear you speak it's an os that no one uses because it's crap.

          i guess thats why every college have courses on it's use and deployment.

          i can understand that you might have had a bad experience with it but to say it's the worst of the worst is nothing but your opinion.

          i have used Linux windows and mac and i can tell you all of them will hang crash and 99.99% of the time it's the poorly coded 3rd party software that crashes not the os it self.

          Microsoft has and is used in many so called mission critical functions a lot more than any other os out there and if it were as bad as you say the whole world would be at a stand still.

          and from what i see the world is running right along with 95% of the world running windows. so your argument just does not hold water.
          SO.CAL Guy
          • what's multitasking?

            M$ o/s's have crash-preparation software built in, in the form of multitasking that they use.
            They tell each app "you get to run so much code, then you have to give up the cpu until it is your turn again". This works great until the app freezes. Then for all intents and purposes, your computer freezes. Unix type o/s's tell each app "you get the cpu for X time, and then you have to wait". If the app freezes, tough luck bozo, the rest of its work continues unabated.

            I have managed to freeze both windows and Linux computers. Guess which one does it more often, and by what overwhelming amount.
            rmjivaro
          • Utter nonsense

            What you say is true... for Windows 3.x. Windows has had preemptive multitasking since Windows 95, and Windows NT (which is what 2000, XP, and Vista) has *always* had it.

            Neither Unix nor Windows "tells" the app anything. When a process gets the CPU, a timer interrupt is set. When the interrupt fires, it is trapped by the kernel interrupt handler which transfers control back to the scheduler (note: there are some conditions in which the process loses the CPU before the timer fires, e.g. it waits for blocking IO; that's not relevant here however). A frozen user-mode application cannot stop this from happening. In Windows 3.x, which used cooperative multitasking, it was indeed the responsibility of the application to relinquish control back to the scheduler. But in a preemptive multitasking environment, which every modern OS is, this is simply not the case.

            I have not managed to freeze either Windows NT, Linux or Solaris in the last 10 years without it being the fault of a buggy driver or a hardware malfunction.
            Dilandau