As of the latest update this afternoon on Amazon's Service Health Dashboard, only a handful of customers are still waiting for their EBS and RDS instances to be restored after Thursday's harrowing outage. But for everyone involved (not least Amazon's own operations staff) it's been a very long four days (see latest Techmeme discussion). What are the lessons to learn?
1. Read your cloud provider's SLA very carefully
Amazingly, this almost four-day outage has not breached Amazon's EC2 SLA, which as a FAQ explains, "guarantees 99.95% availability of the service within a Region over a trailing 365 period." Since it has been the EBS and RDS services rather than EC2 itself that has failed (and all the failures have been restricted to Availability Zones within a single Region), the SLA has not been breached, legally speaking. That's no consolation for those affected of course, nor is it any excuse for the disruption they've suffered. But it certainly gives pause for thought.
2. Don't take your provider's assurances for granted
Many of the affected customers were paying extra to host their instances in more than one Availability Zone (AZ). Amazon actually recommends this course of action to ensure resilience against failure. Each AZ, according to Amazon's FAQ, "runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone." Unfortunately, this turned out to be a technical specification rather than a contractual guarantee. It will take Amazon quite some effort to repair the reputational damage this event has brought upon it.
Justin Santa Barbara, founder and CEO of FathomDB was forthright in his blog post on Why the sky is falling:
"AWS broke their promises on the failure scenarios for Availability Zones ... The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point."
While it's easy to be wise after the event, Amazon's vulnerability to this type of failure may have been visible on a deep-enough due diligence exercise. As Amazon competitor Joyent's Chief Scientist Jason Hoffman notes on the company's blog, "This is not a 'speed bump' or a 'cloud failure' or 'growing pains', this is a foreseeable consequence of fundamental architectural decisions made by Amazon."
3. Most customers will still forgive Amazon its failings
However badly they've been affected, providers have sung Amazon's praises in recognition of how much it's helped them run a powerful infrastructure at lower cost and effort. Many prefaced criticisms with gratitude for what Amazon had made possible, such as BigDoor's CEO Keith Smith:
"AWS has allowed us to scale a complex system quickly, and extremely cost effectively. At any given point in time, we have 12 database servers, 45 app servers, six static servers and six analytics servers up and running. Our systems auto-scale when traffic or processing requirements spike, and auto-shrink when not needed in order to conserve dollars."
4. There are many ways you can supplement a cloud provider's resilience
As O'Reilly's George Reese points out, "if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model." It's useful to review the techniques customers have used to minimize their exposure to failures at Amazon.
Twilio, for example, didn't go down. Although the company hasn't explained exactly what its exposure was to the affected North Virginia Availability Zones, it has described its architectural design principles in a first entry on its new engineering blog by co-founder and CTO Evan Cooke. These include decomposing resources into independent pools, building in support for quick timeouts and retries, and having idempotent interfaces that allow multiple retries of failed requests. Of course all this is easier said than done if all your experience is in designing tightly-coupled enterprise application stacks that assume a resilient local area network. Cooke's post goes on to describe some of the characteristics that make Twilio's architecture capable of operating in this more fault tolerant manner. To start with, "Separate business logic into small stateless services that can be organized in simple homogeneous pools." Another step is to partition the reading and writing of data: "if there is a large pool of data that is written infrequently, separate the reads and writes to that data ... For example, by writing to a database master and reading from database slaves, you can scale up the number of read slaves to improve availability and performance."
Another site that didn't go down is NetFlix, which runs all its infrastructure in the Amazon cloud. Again, it's not clear how exposed its operations were to the affected Amazon resources, but a Hacker News thread usefully summarizes some of the principles employed.
5. Building in extra resilience comes at a cost
Bob Warfield describes how a previous company used Amazon.com infrastructure in a way that allowed it to "bring back the service in another region if the one we were in totally failed within 20 minutes and with no more than 5 minutes of data loss." As he goes on to say, the choices you make about the length of outage you're prepared to support have consequences for the cost your customers or enterprise must fund. "Smart users and PaaS vendors will look into packaging several options because you should be backed up to S3 regardless, so what you’re basically arguing about and paying extra for is how 'warm' the alternate site is and how much has to be spun up from scratch via S3."
6. Understanding the trade-offs helps you frame what to ask
There are questions you should be asking to satisfy yourself that a cloud service you rely on is not exposing you to a similar failure (or at least that, if it is, you understand this and are willing to bear the consequences in return for a cheaper cost). Referring to NetFlix's practice of randomly killing resources and services in order to test its resilience, Bob Warfield adds this advice:
"That's likely another good question to ask your PaaS and Cloud vendors — "Do you take down production infrastructure to test your failover?" Of course you'd like to see that and not just take their word for it too."
7. Lack of transparency may be Amazon's 'Achilles heel'
Several affected customers have complained of the lack of useful information forthcoming from Amazon during the outage. BigDoor CEO Keith Smith wrote, "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner." GoodData's Roman Stanek called on Amazon to tear down its wall of secrecy:
"Our dev-ops people can't read from the tea-leaves how to organize our systems for performance, scalability and most importantly disaster recovery. The difference between 'reasonable' SLAs and 'five-9s' is the difference between improvisation and the complete alignment of our respective operational processes ... There should not be communication walls between IaaS, PaaS, SaaS and customer layers of the cloud infrastructure."
Amazon's challenge in the coming weeks is to show that it is prepared to give its customers the information it needs to build in that resilience reliably. If it does not meet that need and allows others to do better, it may gradually start losing its dominant position today in IaaS provision.