Between the Lines

Larry Dignan, Andrew Nusca and Rachel King

Amazon's Web Services outage: End of cloud innocence?

By | April 22, 2011, 7:27am PDT

Cloud computing is learning the harsh reality of resiliency as Amazon Web Services’ outage has crossed its second day. Meanwhile, startups and a host of other AWS customers are in uncharted waters.

On Wednesday, the common belief was that startups could build their infrastructure on AWS completely. Set the servers up and forget them. Things like availability zones—for an extra fee—would mean you’d get no single point of failure. Some startups took advantage of that and others didn’t.

Given that AWS’ North Virginia data center has been out of whack for more than 24 hours, it’s clear you need to procure more than one cloud. You need a backup for your cloud provider’s backup.

Also: Amazon’s N. Virginia EC2 cluster down, ‘networking event’ triggered problems

The good news for AWS customers is that the service appears to be coming online again. Amazon said in its most recent update:

2:41 AM PDT We continue to make progress in restoring volumes but don’t yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.

6:18 AM PDT We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.

The AWS fallout is going to be far and wide. Here’s a look at some of the key issues:

The blame game only goes so far. First, it’s clear that Amazon’s communication could be better. But data centers do fail and it’s up to customers to make sure their supply chain—in the Web’s case Amazon—is backed up. Amazon failed. So did some of its customers for not planning better. Startups will have to plan better. Customers aren’t going to give startups a free pass completely.

Amazon will get better. To say this debacle is a learning lesson is going to be an understatement. Communication will improve. And availability zones are likely to become availability regions.

Service level agreements (SLAs) will matter more.
Gartner’s Lydia Leong has a great overview of what went wrong. Here’s what she said about SLAs and Amazon.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store (EBS) and Relational Database Service (RDS) which weren’t, and neither of those services have SLAs.

Architecture will garner more attention. Bob Warfield noted:

Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions.

This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.

This then, is a pretty basic question to ask of your infrastructure provider: “How easy do you make it for me to access multiple physical data centers with attendant failover and backups?”

Welcome to the new world of cloud computing. You’ll need multiple cloud providers. Resiliency still matters whether the infrastructure is real or virtual. You wouldn’t have one supplier for steel would you? Going forward you’ll use AWS, Rackspace and maybe a few others.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Larry Dignan is Editor in Chief of ZDNet and SmartPlanet as well as Editorial Director of ZDNet's sister site TechRepublic.

Disclosure

Larry Dignan

Larry Dignan has nothing to disclose. He doesn’t hold investments in the technology companies he covers.

Biography

Larry Dignan

Larry Dignan is Editor in Chief of ZDNet and SmartPlanet as well as Editorial Director of ZDNet's sister site TechRepublic. He was most recently Executive Editor of News and Blogs at ZDNet. Prior to that he was executive news editor at eWeek and news editor at Baseline. He also served as the East Coast news editor and finance editor at CNET News.com. Larry has covered the technology and financial services industry since 1995, publishing articles in WallStreetWeek.com, Inter@ctive Week, The New York Times, and Financial Planning magazine. He's a graduate of the Columbia School of Journalism and the University of Delaware.

For daily updates, follow Larry on Twitter.

Talkback Most Recent of 36 Talkback(s)

  • RE: Amazon's Web Services outage: End of cloud innocence?
    Frustrating as outages are, everything breaks at some point, and I'm sure Amazon and other cloud providers are learning some good lessons from this mess. One of them should be how to handle crisis PR, which thus far, isn't going well. http://crawfordpr.com/2011/04/22/crisis-pr-for-amazon-the-cloud-is-falling-the-cloud-is-falling/
    ZDNet Gravatar
    kschackai
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    That outage was my bad. Just testing a few backdoors. Nothing to worry about people, the Cylons are here to protect you!
    ZDNet Gravatar
    Cylon Centurion
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @Cylon Centurion 0005
    Just as I'd mentioned hundreds of times before... Cloud computing is a nice addition but not solution. If people are going to be so ignorant to think that the web is safe is sorely mistaken. They have been proving my point for the past couple months with certificates being stolen and breaking into all sorts of email systems and so on. Also if one is clever enough to hack into a cloud server they would also be smart enough to know how to make a virus spread and hit other servers as well as redundant backups. I sure hope they will be on top of these things. Also good luck streaming anything offline that you'd paid for and so on. Sorry but I like to keep what I buy in hand not in the "clouds" like they are trying to make so appealing. Also how would gamers ever expect to play real games OTA? They have gaming computers for reasons. They sell hardware to power the OS for a reason. If we scale back and dumb down then what is left? I laugh to think of all the people turning into sheep and those who know computers, systems and how to manipulate things will be the wolfs among the sheep. As China sits on 10 year old OS's and some moving to newer OS's are overtaking parts of the web... I highly doubt they are dumb enough to push everything to to cloud... What army would you have left? Too many things to think about but Cloud is nice addition or option but never a solution.
    ZDNet Gravatar
    audidiablo
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @audidiablo You have said more than I can imagine. This is only the beginning of what can and will go wrong. This is just one of the possible problems that the cloud faces. What if a cloud provider company goes out of business? What if a cloud provider decides to up its rates for service or reduces its level of service? What happens if due to some circumstance, the cloud provider looses all data that it has saved (with out having a reliable backup?) What's to keep a cloud provider (or someone else) from looking at your data? Is there insurance to cover this yet?

    Businesses will try to cut corners to save money. Maybe, Amazon will not add redundant servers because it would be too expensive.

    There was once a time when people would place their money into several banks, in case one or two closed down. (Those were the people who survived the depression.)

    All I can say is wake up cloud users. These things are going to happen, even at a critical time in you business. You must analyze the risk (of the worst possible event and its frequency of occurring) and weigh it with the benefits. Is it worth it? I will not bet my life (or life savings) on the cloud.
    ZDNet Gravatar
    jimlonero
    22nd Apr
  • was it some sabotage?
    @Cylon Centurion 0005
    from a company that starts with M has 'soft' in its name and resides in Redmond, WA?
    ZDNet Gravatar
    Linux Geek
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @Linux Geek
    More likely a company that begins with G.
    ZDNet Gravatar
    llamasaki
    23rd Apr
  • The frakking toasters did it?
    @Cylon Centurion 0005: I knew it! I'm going to need a few copies of Number Eight for close examination.
    ZDNet Gravatar
    bob@...
    22nd Apr
  • Amazon is A Technology Company?
    I thought they sold ... things.
    ZDNet Gravatar
    PMC-CON
    22nd Apr
  • ZDNet Gravatar
    aep528
    22nd Apr
  • ZDNet Gravatar
    MLHACK
    22nd Apr
  • Wake up call for the providers.
    100% uptime is a reality and what makes cloud computing a viable solution. However, the solution has to be designed around a full redundant architecture. (Which I'm sure it wasn't and that's why there was an outage). Shame on Amazon for reverting to their SLA's. What a cop out. They sold companies on a solution that they didn't implement properly. Now the legal ramifications of loss revenue are rearing up and they're scrambling. The only thing that happened here is that they should have deployed a more robust disaster tolerant solution and they didn't. They got caught with their hands in the cookie jar. They designed a network and solution that skimped on the redundancies. It will be interesting to see how the legal liability of data reliability will be handled from this point on. The whole purpose of the WORLD moving to a cloud computing environment is to offset the responsibilities of the individual from having to worry about their data. This offers a great opportunity for a GLOBAL centralization of resources by the largest ENTERPRISE players. However, if they want to play in this space then they should embrace the costs that are associated to accepting this responsibility. The age of backup is nearing an end as this is merely a restore solution and doesn't protect users from downtime. However, real-time redundant data computing, storage, and connectivity is available, but much more costly. If anyone from Amazon is reading this, please pass it on that you should have redundant data centers, with redundant networks within the cloud. This way the only outage a user of your services should ever have to worry about is if their internet connection goes down.
    ZDNet Gravatar
    7EPlusInc
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @7EPlusInc ... Amen brother, I agree with you.
    I can't believe a place like Amazon doesn't run redundant servers and have backups of their own backups. It's no solution to expect the customer to double-cloud, cloud being a misnomer at best, when Amazon should be doing it already, along with colocations and gobs of verification data.
    I'm not happy to see anyone lose money, but I am glad to see we're finally starting to get reality checks on these (mis-named) clouds. It's a dumb concept and Amazon could have done a lot better. But they didn't. Neither will others, even in the face of more events like this one.
    ZDNet Gravatar
    tom@...
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @tom@... Odds are they do have redundant servers. You can build as much fault tolerance into a system as you want and there will still be scenarios where it can go down. Google has gone down in the past for example and given they're based of the Beowulf concept that's quite the statement. Ever heard the phrase don't put all your eggs in one basket??
    ZDNet Gravatar
    ITSamurai
    23rd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @7EPlusInc
    Good theory, welcome to reality!
    ZDNet Gravatar
    Eddy-ICUR12
    22nd Apr
  • RE: Amazon's Web Services outage: End of cloud innocence?
    @7EPlusInc 100% uptime will never be a reality. Quick example - no matter how many backups you have, no matter how robust the cloud is - your connection to the internet can /always/ fail. Even if you had fiber, a T3 backup, and your grandmother's 56k if your ISPs core router goes down you loose. With local services the outside can't reach you - but you can continue to work inside, outsource everything to the cloud and now not only is revenue lost from the outside but you're loosing productivity from your entire workforce.
    ZDNet Gravatar
    ITSamurai
    23rd Apr

Talkback - Tell Us What You Think

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources