Amazon's Web Services outage: End of cloud innocence?

Amazon's Web Services outage: End of cloud innocence?

Summary: Cloud computing is learning the harsh reality of resiliency as Amazon Web Services' outage has crossed its second day. Meanwhile, startups and a host of other AWS customers are in uncharted waters. What have we learned?

SHARE:

Cloud computing is learning the harsh reality of resiliency as Amazon Web Services' outage has crossed its second day. Meanwhile, startups and a host of other AWS customers are in uncharted waters.

On Wednesday, the common belief was that startups could build their infrastructure on AWS completely. Set the servers up and forget them. Things like availability zones---for an extra fee---would mean you'd get no single point of failure. Some startups took advantage of that and others didn't.

Given that AWS' North Virginia data center has been out of whack for more than 24 hours, it's clear you need to procure more than one cloud. You need a backup for your cloud provider's backup.

Also: Amazon's N. Virginia EC2 cluster down, 'networking event' triggered problems

The good news for AWS customers is that the service appears to be coming online again. Amazon said in its most recent update:

2:41 AM PDT We continue to make progress in restoring volumes but don't yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.

6:18 AM PDT We're starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we'll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.

The AWS fallout is going to be far and wide. Here's a look at some of the key issues:

The blame game only goes so far. First, it's clear that Amazon's communication could be better. But data centers do fail and it's up to customers to make sure their supply chain---in the Web's case Amazon---is backed up. Amazon failed. So did some of its customers for not planning better. Startups will have to plan better. Customers aren't going to give startups a free pass completely.

Amazon will get better. To say this debacle is a learning lesson is going to be an understatement. Communication will improve. And availability zones are likely to become availability regions. Service level agreements (SLAs) will matter more. Gartner's Lydia Leong has a great overview of what went wrong. Here's what she said about SLAs and Amazon.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store (EBS) and Relational Database Service (RDS) which weren’t, and neither of those services have SLAs.

Architecture will garner more attention. Bob Warfield noted:

Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions.

This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.

This then, is a pretty basic question to ask of your infrastructure provider: “How easy do you make it for me to access multiple physical data centers with attendant failover and backups?”

Welcome to the new world of cloud computing. You'll need multiple cloud providers. Resiliency still matters whether the infrastructure is real or virtual. You wouldn't have one supplier for steel would you? Going forward you'll use AWS, Rackspace and maybe a few others.

Topics: Hardware, Amazon, CXO, Data Centers, Networking, Outage, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

36 comments
Log in or register to join the discussion
  • RE: Amazon's Web Services outage: End of cloud innocence?

    Frustrating as outages are, everything breaks at some point, and I'm sure Amazon and other cloud providers are learning some good lessons from this mess. One of them should be how to handle crisis PR, which thus far, isn't going well. http://crawfordpr.com/2011/04/22/crisis-pr-for-amazon-the-cloud-is-falling-the-cloud-is-falling/
    kschackai
  • RE: Amazon's Web Services outage: End of cloud innocence?

    That outage was my bad. Just testing a few backdoors. Nothing to worry about people, the Cylons are here to protect you!
    The one and only, Cylon Centurion
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @Cylon Centurion 0005
      Just as I'd mentioned hundreds of times before... Cloud computing is a nice addition but not solution. If people are going to be so ignorant to think that the web is safe is sorely mistaken. They have been proving my point for the past couple months with certificates being stolen and breaking into all sorts of email systems and so on. Also if one is clever enough to hack into a cloud server they would also be smart enough to know how to make a virus spread and hit other servers as well as redundant backups. I sure hope they will be on top of these things. Also good luck streaming anything offline that you'd paid for and so on. Sorry but I like to keep what I buy in hand not in the "clouds" like they are trying to make so appealing. Also how would gamers ever expect to play real games OTA? They have gaming computers for reasons. They sell hardware to power the OS for a reason. If we scale back and dumb down then what is left? I laugh to think of all the people turning into sheep and those who know computers, systems and how to manipulate things will be the wolfs among the sheep. As China sits on 10 year old OS's and some moving to newer OS's are overtaking parts of the web... I highly doubt they are dumb enough to push everything to to cloud... What army would you have left? Too many things to think about but Cloud is nice addition or option but never a solution.
      audidiablo
      • RE: Amazon's Web Services outage: End of cloud innocence?

        @audidiablo You have said more than I can imagine. This is only the beginning of what can and will go wrong. This is just one of the possible problems that the cloud faces. What if a cloud provider company goes out of business? What if a cloud provider decides to up its rates for service or reduces its level of service? What happens if due to some circumstance, the cloud provider looses all data that it has saved (with out having a reliable backup?) What's to keep a cloud provider (or someone else) from looking at your data? Is there insurance to cover this yet?

        Businesses will try to cut corners to save money. Maybe, Amazon will not add redundant servers because it would be too expensive.

        There was once a time when people would place their money into several banks, in case one or two closed down. (Those were the people who survived the depression.)

        All I can say is wake up cloud users. These things are going to happen, even at a critical time in you business. You must analyze the risk (of the worst possible event and its frequency of occurring) and weigh it with the benefits. Is it worth it? I will not bet my life (or life savings) on the cloud.
        jimlonero
    • was it some sabotage?

      @Cylon Centurion 0005
      from a company that starts with M has 'soft' in its name and resides in Redmond, WA?
      Linux Geek
      • RE: Amazon's Web Services outage: End of cloud innocence?

        @Linux Geek
        More likely a company that begins with G.
        llamasaki
    • The frakking toasters did it?

      @Cylon Centurion 0005: I knew it! I'm going to need a few copies of Number Eight for close examination.
      bob@...
  • Amazon is A Technology Company?

    I thought they sold ... things.
    PMC-CON
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @PMC-CON

      Are you serious?
      aep528
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @PMC-CON

      Amazon is Internet Walmart
      MLHACK
  • Wake up call for the providers.

    100% uptime is a reality and what makes cloud computing a viable solution. However, the solution has to be designed around a full redundant architecture. (Which I'm sure it wasn't and that's why there was an outage). Shame on Amazon for reverting to their SLA's. What a cop out. They sold companies on a solution that they didn't implement properly. Now the legal ramifications of loss revenue are rearing up and they're scrambling. The only thing that happened here is that they should have deployed a more robust disaster tolerant solution and they didn't. They got caught with their hands in the cookie jar. They designed a network and solution that skimped on the redundancies. It will be interesting to see how the legal liability of data reliability will be handled from this point on. The whole purpose of the WORLD moving to a cloud computing environment is to offset the responsibilities of the individual from having to worry about their data. This offers a great opportunity for a GLOBAL centralization of resources by the largest ENTERPRISE players. However, if they want to play in this space then they should embrace the costs that are associated to accepting this responsibility. The age of backup is nearing an end as this is merely a restore solution and doesn't protect users from downtime. However, real-time redundant data computing, storage, and connectivity is available, but much more costly. If anyone from Amazon is reading this, please pass it on that you should have redundant data centers, with redundant networks within the cloud. This way the only outage a user of your services should ever have to worry about is if their internet connection goes down.
    7EPlusInc
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @7EPlusInc ... Amen brother, I agree with you.
      I can't believe a place like Amazon doesn't run redundant servers and have backups of their own backups. It's no solution to expect the customer to double-cloud, cloud being a misnomer at best, when Amazon should be doing it already, along with colocations and gobs of verification data.
      I'm not happy to see anyone lose money, but I am glad to see we're finally starting to get reality checks on these (mis-named) clouds. It's a dumb concept and Amazon could have done a lot better. But they didn't. Neither will others, even in the face of more events like this one.
      tom@...
      • RE: Amazon's Web Services outage: End of cloud innocence?

        @tom@... Odds are they do have redundant servers. You can build as much fault tolerance into a system as you want and there will still be scenarios where it can go down. Google has gone down in the past for example and given they're based of the Beowulf concept that's quite the statement. Ever heard the phrase don't put all your eggs in one basket??
        ITSamurai
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @7EPlusInc
      Good theory, welcome to reality!
      Eddy-ICUR12
    • RE: Amazon's Web Services outage: End of cloud innocence?

      @7EPlusInc 100% uptime will never be a reality. Quick example - no matter how many backups you have, no matter how robust the cloud is - your connection to the internet can /always/ fail. Even if you had fiber, a T3 backup, and your grandmother's 56k if your ISPs core router goes down you loose. With local services the outside can't reach you - but you can continue to work inside, outsource everything to the cloud and now not only is revenue lost from the outside but you're loosing productivity from your entire workforce.
      ITSamurai
    • Outage does not have to in Amazon's neighborhood.

      @7EPlusInc The problem of the cloud is that an outage does not have to happen in the neighborhood of Amazon (or any Cloud provider).

      If some idiot cuts the main cable to the company's building, the company will lose all connection to their data. If Comcast decides to cutout any access to your company's internet provider for some reason, you are out of luck.

      The Amazon outage is just a little drop on the bucket of water that will eventually drown the "cloud customers".
      wackoae
  • RE: Amazon's Web Services outage: End of cloud innocence?

    Chili's?
    gioroc
  • RE: Amazon's Web Services outage: End of cloud innocence?

    Clouds... hold moisture and dump it or evaporate, and not on a schedule.

    Redundancy. There was a news story some years ago about a California company that had contracted two service providers because they wanted a good deal of failover built in to their service operation. After a srorm and flash flood somewhere out in the eastern side of the state their service to the East went off line. Turns out that BOTH ISPs had their backbone fiber strung over the same bridge, which was washed out in the flood.

    Yeah, the cloud is really great idea... NOT.
    notme403@...
  • Would you rather have your business or compensation?

    If you are indifferent then the compensation package is a good one - but is it risk free?<br>If you are an entrepreneur prepared to litigate - if all you care about is the bottom line - OK, it's your 'business' model.<br>But if it's your business you want, then <b>you</b> have to look after it - and that means backing it up, taking all necessary precautions, taking the residual risk yourself, and taking responsibility. Not passing all that off to an entrepreneur who is looking after <b>his</b> business.

    And the bigger that service provider is, the bigger his legal team when your push meets his shove.
    PassingWind
  • No going back...

    First of all I didn't know that Foursquare ran on Amazon. Foursquare is one of the best social apps and one of the few that really takes advantage of location for good business reasons.<br><br>Despite this outtage (and cloud is so so new) I think that Amazon Cloud is really proving the architecture. <br><br>Will there be better, more robust, faster, cloud technology?<br><br>Sure, in the same way our power grid has been improving since the Edison days of DC generators and bare bulbs.
    jabailo1