Categorizing failure and lining up ducks

To really understand IT failures, it's helpful to organize, categorize, and line them up like ducks. I group IT failures into management process trials and technical tribulations.

Categorizing failure

To really understand IT failures, it's helpful to organize, categorize, and line them up like ducks. I'm considering a scheme that groups IT failures into management process tribulations and technical glitches.

The management process tribulations category covers technology implementation projects that go awry. For example, when you read about the latest government project fiasco, with millions down the drain, you're talking a management process tribulation.

J.Crew's recent website fiasco is a great example of an implementation project gone wrong. Here's how the company described its problem:

During the second quarter of fiscal 2008 we implemented certain Direct channel systems upgrades which impacted our ability to capture, process, ship and service customer orders. As a result, our Direct sales growth rate was lower than recent quarterly trends. We expect the impact of the systems upgrades to continue into the second half of fiscal 2008.

Rather than a single, bounded point of failure, the mess resulted from breakdowns during a larger implementation process. In this case, management seems to have pressured IT to release the site before completing full scalability testing.

Most project-related failures fall into the management process tribuations category.

Technical glitches occur when a hardware, software, or network component fails unexpectedly. The failure could be software bugs in a previously deployed system, a network link that goes down, or a hardware component that stops working.

As an example, Netflix recent shipping problems were caused by hardware that failed suddenly. From the company's event post-mortem:

On Monday, 8/11, our monitors flagged a database corruption event in our shipping system. Over the course of the day, we began experiencing similar problems in peripheral databases until our shipping system went down....

With some great forensic help from our vendors, root cause was identified as a key faulty hardware component. It definitively caused the problem yet reported no detectable errors.

The failure wasn't connected with an ongoing technical deployment (at least the company didn't disclose that fact); rather, the hardware just died.

Another technical glitch occurred at Los Angeles Airport last year. A Customs and Border Control office suffered an equipment malfunction:

Around 1:30 p.m., the CPB experienced problems accessing its database containing information on international travelers. After [many] hours of troubleshooting, the issue was finally resolved at 11:45 p.m. The real culprit: a failed router.

The problem arose when hardware just died, a clear indicator of technical glitch.


Technical glitches are generally easier to analyze from a management perspective than management process tribulations. Since specific technical malfunctions generally cause these failures, troubleshooting and repair are typically highly technical functions.

On the other hand, failures caused by management process tribuations usually involve project teams and complex human interactions. These failures are often rooted in organizational culture, project management maturity, and so on.

Regardless of cause or category, IT failures remain a significant cause of business interruption. Understanding the cause of past failures is the best way to prevent project failure tribulations and glitches in the future.


Please share your thoughts on this classification scheme via talkback, email, or Twitter. I've started writing a book on IT failures and would love your comments.

[Image from tunachilli.]


You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All