Amazon's Web Services outage: End of cloud innocence?

Cloud computing is learning the harsh reality of resiliency as Amazon Web Services' outage has crossed its second day. Meanwhile, startups and a host of other AWS customers are in uncharted waters. What have we learned?
Written by Larry Dignan, Contributor

Cloud computing is learning the harsh reality of resiliency as Amazon Web Services' outage has crossed its second day. Meanwhile, startups and a host of other AWS customers are in uncharted waters.

On Wednesday, the common belief was that startups could build their infrastructure on AWS completely. Set the servers up and forget them. Things like availability zones---for an extra fee---would mean you'd get no single point of failure. Some startups took advantage of that and others didn't.

Given that AWS' North Virginia data center has been out of whack for more than 24 hours, it's clear you need to procure more than one cloud. You need a backup for your cloud provider's backup.

Also: Amazon's N. Virginia EC2 cluster down, 'networking event' triggered problems

The good news for AWS customers is that the service appears to be coming online again. Amazon said in its most recent update:

2:41 AM PDT We continue to make progress in restoring volumes but don't yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.

6:18 AM PDT We're starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we'll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.

The AWS fallout is going to be far and wide. Here's a look at some of the key issues:

The blame game only goes so far. First, it's clear that Amazon's communication could be better. But data centers do fail and it's up to customers to make sure their supply chain---in the Web's case Amazon---is backed up. Amazon failed. So did some of its customers for not planning better. Startups will have to plan better. Customers aren't going to give startups a free pass completely.

Amazon will get better. To say this debacle is a learning lesson is going to be an understatement. Communication will improve. And availability zones are likely to become availability regions. Service level agreements (SLAs) will matter more. Gartner's Lydia Leong has a great overview of what went wrong. Here's what she said about SLAs and Amazon.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store (EBS) and Relational Database Service (RDS) which weren’t, and neither of those services have SLAs.

Architecture will garner more attention. Bob Warfield noted:

Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions.

This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.

This then, is a pretty basic question to ask of your infrastructure provider: “How easy do you make it for me to access multiple physical data centers with attendant failover and backups?”

Welcome to the new world of cloud computing. You'll need multiple cloud providers. Resiliency still matters whether the infrastructure is real or virtual. You wouldn't have one supplier for steel would you? Going forward you'll use AWS, Rackspace and maybe a few others.

Editorial standards