Netflix post-mortem: hardware failure and poor transparency

Massive shipping delays last month at Netflix were caused by hardware failure. Netflix gets the dunce award for lack of transparency in the face of disaster.
Written by Michael Krigsman, Contributor
Netflix post-mortem: hardware failure and poor transparency

Massive shipping delays last month at Netflix were caused by hardware failure. Although diagnosing hardware problems can be tough, Netflix gets the dunce award for lack of transparency in the face of disaster.

Here's the post-mortem analysis from Mike Osier, head of IT Operations at Netflix:

On Monday, 8/11, our monitors flagged a database corruption event in our shipping system. Over the course of the day, we began experiencing similar problems in peripheral databases until our shipping system went down. It was going to be a long night.

We suspected hardware and moved the shipping system to an isolated environment, gradually getting DVD shipments moving again. Eventually the system was repaired and shipping returned to normal conditions. With some great forensic help from our vendors, root cause was identified as a key faulty hardware component. It definitively caused the problem yet reported no detectable errors. We’ve taken steps to fortify our shipping system with the acquisition of additional equipment and worked with our vendors to verify we’re in good shape elsewhere.


There are two significant points to consider:

  1. Hardware failure, especially involving network and communications equipment, can be a nightmare to troubleshoot and repair.
  2. Appropriate end-user communications are a critical part of managing IT downtime.

Hardware failure. When hardware fails, symptoms can appear as software, database, or telecom link problems unrelated to the specific equipment that's flaky. For example, ComputerWorld reported on hardware-related problems at Kennedy Airport (emphasis added):

Initially, American's parent company, AMR Corp., said the malfunction yesterday was in software that controlled the baggage-sorting conveyor belt in American's bag room at JFK. However, airline spokesman Tim Wagner said today that the glitch was caused by a hardware issue involving the network between the computer software that controls the sorting function and the baggage conveyor belts. Wagner said the software was working, the conveyors were working, but some of the network hardware was failing.

Along the same lines, I wrote about a Los Angeles Airport failure that sounds similar to Netflix:

Assuming this [failure] to be a wide-area network problem, CBP called Sprint, its carrier, to test the lines. After three fruitless hours of remote testing, Sprint finally sent technicians on-site. Another three hours passed before Sprint finally concluded that transmission lines were not the problem.... The real culprit: a failed router. Update: Turns out it was a bad NIC card. Customs is planning network upgrades so the problem doesn’t happen again.

Lack of transparency. During times of failure, communication with end-users is critical if you value their continued loyalty. In this case, Netflix's post-mortem was anemic and their status updates were too vague.

Netflix should have disclosed which hardware failed, why repairs took so long, and specifics on what it has done to prevent future problems. Investors should ask why management's backup and contingency plans handled this mission critical failure so badly.

Given the financial implications for Netflix, as described by Larry Dignan, the company underperformed its crisis management:

These issues are obviously going to cost Netflix some dough. First, the company is losing revenue. That slippage will result in an earnings hit. Meanwhile, Netflix will have to account for reimbursing subscribers (currently credits to one-third of the subscriber base).

It's also worth noting there are currently 145 responses to the post-mortem blog post, most of them negative; that should tell you something.

Stacksafe's Jonah Paransky offers an excellent seven-point framework for communicating IT failure:

  1. Have a communication plan in place and ready to go
  2. Direct communication with your customers is the number one concern
  3. Be prepared to communicate over multiple channels.
  4. Over-communicating is better than under-communicating
  5. Expect the failure to become public
  6. Humor probably isn’t the right call
  7. Don’t underestimate the communication necessary after the failure is resolved

Netflix looks bad when measured against this list. Although it offered several superficial blog posts describing status during the outage, Netflix never really disclosed what happened or why.

The company has much to learn about the user relations aspect of IT downtime. Google, Salesforce, and Amazon take system transparency seriously; Netflix should do the same.

[Photo source: iStockphoto.]

Editorial standards