Here's how Amazon ruined my Christmas: After devouring a lovely rib roast with a porcini-spinach stuffing (recipe here in case your stomach is now growling), we all curled up on the couch with hot cocoa, turned on Netflix streaming to watch classic Christmas movies (and past Doctor Who Christmas Specials)... only to get an error message. That's right, in case you missed it, Netflix was down on Christmas Eve and Christmas Day in North America for many users due to issues with Amazon's Elastic Load Balancing (ELB) service in the US East region. It's interesting to note that this is at least the third time issues with the ELB service have caused problems for Netflix, with each time, the company making improvements to prevent this from happening again.
You might be thinking, "ruin" is a strong word to describe what happened to me (and many others) on Christmas Eve, but I use it to illustrate a point: Even though this particular outage was probably not the most severe (in duration or number of customers impacted), it may well be the most costly for Netflix. Why? Because of TIMING. I've been saying for a while that timing and duration are more critical indicators of availability performance and impacts than looking at "nines" (99.99%, 99.999%, etc.). If this same outage had occurred just a day or two earlier, the impact would be significantly different. And unfortunately for Netflix, because of the timing, this is an outage that many customers will remember.
I write this not to be punitive towards Amazon or Netflix (or any of the other services that experienced downtime on the 24th/25th), but as a reminder/cautionary tale that:
- Downtime will happen at the worst possible time. When designing continuity plans, it's prudent to hope for the best, but plan for the worst. Since the universe tends to be cruel and somewhat random, you may experience an outage at the worst possible time. Any calculations on the costs of downtime must account for this.
- The cloud is not inherently resilient. Netflix is one of the most mature implementations of cloud resiliency that I have seen, and they still experience outages. You are responsible for resiliency of the applications you deploy in the cloud, not your cloud provider. If you architect your applications to be able to withstand the loss of systems or sites (Netflix, for example, uses chaos monkeys and gorillas for this), you will be much more able to withstand failures from the cloud provider.
- Don't take away my Doctor Who Christmas specials. Seriously, don't do it.