Amazon outage ends cloud innocence

Cloud computing learned the harsh reality of resiliency as Amazon Web Services' outage crossed into its second day. Meanwhile, start-ups and a host of other AWS customers are in uncharted waters.
Written by Larry Dignan, Contributor

Cloud computing learned the harsh reality of resiliency as Amazon Web Services' outage crossed into its second day. Meanwhile, start-ups and a host of other AWS customers are in uncharted waters.

On Wednesday, the common belief was that start-ups could build their infrastructure on AWS completely. Set the servers up and forget them. Things like availability zones — for an extra fee — would mean you'd get no single point of failure. Some start-ups took advantage of that and others didn't.

Then the outage arrived, bringing down the internet services of some customers such as Reddit and Quora.

Given that AWS' North Virginia datacentre has been out of whack for more than 24 hours, following a "networking event" that led to problems with how data is mirrored, it's clear you need to procure more than one cloud. You need a backup for your cloud provider's backup.

Amazon has been working on making a full recovery and hasn't yet been able to carry out a post mortem, according to its service health site.

Yet one thing's for sure: the AWS fallout is going to be far and wide. Here's a look at some of the key issues:

The blame game only goes so far. First, it's clear that Amazon's communication could be better. But datacentres do fail and it's up to customers to make sure their supply chain — in the web's case Amazon — is backed up. Amazon failed. So did some of its customers for not planning better.

Amazon will get better. To say this debacle is a learning lesson is going to be an understatement. Communication will improve. And availability zones (AZ) are likely to become availability regions.

Service level agreements (SLAs) will matter more. Gartner research vice president Lydia Leong has a great overview of what went wrong. Here's what she said about SLAs and Amazon:

Amazon's SLA for EC2 is 99.95 per cent for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store (EBS) and Relational Database Service (RDS) which weren't, and neither of those services have SLAs.

Architecture will garner more attention. SmoothSpan principal Bob Warfield noted:

Most SaaS companies have to get huge before they can afford multiple physical datacentres if they own the datacentres. But if you're using a cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical datacentres very cheaply. The trick is, you have to make use of it, but it's just software. A service like Heroku could've decided to spread the applications it's hosting evenly over the two regions or gone even further afield to offshore regions.

This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.

This then, is a pretty basic question to ask of your infrastructure provider: "How easy do you make it for me to access multiple physical datacentres with attendant failover and backups?"

Welcome to the new world of cloud computing. You'll need multiple cloud providers. Resiliency still matters whether the infrastructure is real or virtual. You wouldn't have one supplier for steel would you? Going forward you'll use AWS, Rackspace and maybe a few others.

Via ZDNet US

Editorial standards