AWS outage reveals backup cheapskates

AWS outage reveals backup cheapskates

Summary: Was Amazon to blame for the Instagram, Netflix, Pinterest and Pocket outages? According to analysts, they were just being too cheap.

SHARE:
TOPICS: Cloud, Emerging Tech
12

Amazon Web Services' (AWS) hiccup over the weekend saw a number of web services suffer outages, but the issue was less to do with Amazon, and more to do with individual companies not using cloud services to their full potential, according to analysts.

Intelligent Business Research Services advisor Jorn Bettin laid the blame for the outage with providers failing to utilise cloud services as they should be.

He said that the real issue wasn't that such a huge cloud-services giant like Amazon had stumbled over a storm, but that the affected customers — Instagram, Pinterest, Pocket and Netflix, which all suffered from Amazon's recent outage on the weekend — hadn't used the ability of the cloud to create geographically redundant links.

"They could operate at a higher level of redundancy, so that these sort of outages would only have a minimal impact on them. It's a matter of cost," Bettin said.

Bettin said that the cost of creating redundant connections, to ensure that a natural disaster in one area of the world won't affect services in another, could double cloud costs, however. Despite the call for Amazon to pick up the bill and shield customers from this risk, Bettin said that this isn't the way that cloud services should be treated.

"It doesn't really make sense on a global scale that everyone relies on Amazon as, let's say, the ultimate risk manager for everything. That would be a dangerous proposition."

Instead, he said that the current hands-off model that Amazon has taken to giving customers, with the option to choose whether they want to pay for the risk, is more logical.

"Amazon's doing the right thing here of giving the customer the ability to do these switches from one [geographic] system to another."

What this has means, though, is that several companies have looked at their bottom line, and decided that the cost to mitigate the risk isn't worth maintaining 100 per cent uptime. Bettin said that these organisations tend to be small, and, in order to maintain any sort of profit, they have to be cutthroat with their costs. This is something that the cloud has enabled, but it also puts them at significant risk.

"They're effectively putting all their eggs in one basket. This whole topic is about managing levels of redundancy."

Gartner research vice president for IT services Jim Longwood sees the issue as problematic for Amazon, however, saying that after the most recent two issues, the psychological "third-strike" rule is in effect, and Amazon will be doing its utmost to prevent a repeat incident.

"You can bet your bottom dollar that Amazon will respond very strongly to this, and take remedial action, because if it does happen too often, it will affect their brand and their potential market penetration."

It will definitely be an opportunity for competitors, according to Longwood.

"All the competitors are making mileage from [this incident] already," he said, adding that this competition would then drive customer demand for reliability and availability.

"This is going to drive competition for improved services, particularly [for] reliability and availability. In two years' time, the tolerance will be much less, because they should be much more reliable. We're going to see more of [these incidents], but hopefully less frequently. Certainly, you'll see it amongst the smaller, newer service providers coming into this environment."

Topics: Cloud, Emerging Tech

Michael Lee

About Michael Lee

A Sydney, Australia-based journalist, Michael Lee covers a gamut of news in the technology space including information security, state Government initiatives, and local startups.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

12 comments
Log in or register to join the discussion
  • Cheapskates and Blame Spin

    The IaaS "Cloud", regardless of vendor, is a set of building blocks used by the customer's system designer and integrator. Poor assembly means poor operation. Obtaining redundancy to another part of the globe isn't near double the cost if you follow the any providers recommendations. Expect competitors to lump on the blame, right behind the designers. I mean, you don't expect the designers and managers to admit "It worked just like they told us and we didn't build redundancy." It's a hard job market out there.
    ichype2much
  • Cloud services are only as good as the users

    My company depends on AWS and use the VA availability zone, but we also make use of the redundant services and didn't have any problem. I am surprised that the companies that had issues didn't use the service correctly and invest to maintain their up-time. The outages should have never occurred if it was setup with disaster recovery in mind. As I said we where NOT affected and make use of AWS and invest highly to maintain high availability.
    ljrain
  • Not everyone impacted was careless...

    I am glad that you weren't impacted by the recent outage.

    Unfortunately, for some, Elastic Load Balancers for multi-zone deployments completely failed with the outage (and were inaccessible to make changes). We're all still awaiting a detailed response from Amazon, but it's my understanding that their backup power systems in Nova failed almost completely, at the same time as primary systems -- and that ELB and Beanstalk were completely down for many. The management console and API functions were also inaccessible for these resources, as they were located in east1d, I assume.

    I'm anxiously awaiting a response from Amazon... companies like Netflix, too, use multi-region deployments, and were impacted.

    Yeah -- I'm sure most people who were impacted were careless -- but, you should know that there are some who were impacted who were not.
    aptitudedude
  • Couldn't agree more

    Hear hear, well said. Totally agree and have blogged about this attitude at http://blog.next-genit.co.uk/2012/04/building-for-amazon.html
    imcdnzl
  • Blaming paying customers...

    Doesn't make sense. Period. It doesn't matter whose "fault" it is.

    It's as simple as that.

    Netflix, Pinterest, etc. are Amazon's customers in this case, and they should be treated as such. They only use AWS because it was the considered the cheapest option - and it was only considered the cheapest option because Amazon had led users to believe downtime would be minimal even without expensive backup options.

    So saying... oops, we lied, turns out our service has a -high profile- outage at least once a year, so you'll need to pay 2-3 times as much to prevent it is just plain silly. If that's the case then the services might as well go with a cheap-o private datacenter. Not because it won't go down, but because it will generate less negative publicity when it does.

    Amazon needs to understand this and respond accordingly. Either reduce the price of redundancy, improve your data center uptime, or lose cloud business.

    Of course, this doesn't mean Netflix own *customers* should let them off the hook. It's a matter of perspective.
    Popnfresh100@...
    • My sentiments exactly.

      Cloud providers promising a certain uptime and failing to deliver cannot possibly be the fault of the customers. Some, if not most, of the customers who did not pay extra for redundancy were counting on the uptime promised and figured as a cost benefit analysis that it was advantageous, economically speaking, to skip the redundancy. When Amazon failed to deliver on its promise of uptime it became Amazon's failure. Period.

      The previous poster pointing out that come companies who had payed extra for redundancy only adds to my resolve that this was Amazon's fault and not the managers who trusted them. Of course, ultimately it is always a persons individual responsibility to ensure their own systems remain available. Unfortunately, a person's ability to assume full responsibility is unrealistic as he must place his trust in some other entity that provides the system or services that he chooses. Because they choose to sell their systems and services based on promise that he can trust them, if they fail in their responsibility to deliver on those promises it is their liability.
      techadmin.cc@...
      • words, disaster recovery, etc.

        The word is paid, payed has a completely different meaning.
        Subject at hand--I don't know much about this cloud business, but wouldn't it make more sense for someone the size of Amazon to have several redundant data centers around the country, say one in each time zone to avoid such a problem. I don't know what happened, but if one center went down, the outage wouldn't be seen by customers, except for maybe a difference in speed. Bottom line: If you are going to offer services, then be able to do so. Have disaster recovery systems setup, and ready to go. There are several disaster recovery companies out there, including a magazine devoted to the process of Disaster Recovery--Disaster Recovery Journal. (I do not work for them, I do subscribe to their newsletter)
        I don't subscribe to any of the systems affected nor to Amazon's services of any sort. So I never saw a problem. That is the risk anyone takes when depending on someone else to handle your data, or when subscribing to internet services such as Netflix.
        dhays
  • *Yawn*

    How predictable that cloud supporters would not lay the blame with themselves for recommending such a stupid idea in the first place. Now their sage advice is that you have to buy 2 clouds with different providers. If only those lazy good for nothing customers had bought 2 clouds then there wouldn't have been a problem.... stupid customers....

    In the next chapter, massive data theft from cloud provider leaves thousands of companies exposed.
    12312332123
  • No cheapskates; just complexity, and a couple of bugs.

    So here's the link to the root cause analysis from Netflix, and here's the Amazon report. No cheapskates; just complexity. See the piece in my blog, and the GigaOM piece to which it refers.

    Bettin owes a bunch of mea culpas to a bunch of engineers in Amazon, Netflix, and elsewhere.
    geoffarnold
  • that's what AWS is for

    Isn't that the point of cloud services - so THEY handle the redundancies for us?
    wkriski@...
  • Cloud Maturity (or lack thereof)

    No Cloud Provider can afford complete redundancy in multiple regions. The pricing would be 25-50% higher (at least). I see several issues coming out of this and other cloud outages:

    1. A storm should NOT result in user impact, yet Amazon, Microsoft, and Google have all suffered from facility related issues. Either, after multi-billion dollar investments, someone decided to not go the final inch, or they have all chosen the wrong architects, or they have operations people who don't have experience in enterprise class environments. I've run several large data centers, a few in Dallas, TX, where storms are constantly threatening our operations. When the radar shows a storm approaching, we would switch to diesel AHEAD of the storm and stay on diesel until 30-60 minutes after the storm was gone. Have I had "facility" failures in my 25 years of running enterprise data centers? You bet, but our redundancy, protection systems, and operations tactics prevented any user impact.

    2. Along with the previous commenters, I believe Amazon should have a specific service offering for multi-region redundancy (beyond their availability zones). They should be able to do this at a reasonable additional charge (20-30%). The offering should include a guarantee of resource availability plus the automation to perform the failover. Today, Amazon leave it up to the clients to architect, plan, and implement such a capability. If they want a major role in enterprise-class mission critical services, this is required.

    3. My perception is that when these providers have an outage, the level of communications from the provider is extremely poor. This must be addressed. They should be providing detailed updates every 15 minutes. I looked through the blog for last year's outage and there were 2, 3, and even 5-hour gaps in updates. Unacceptable! Maybe the cloud providers should consider putting an optional account/delivery management layer in place, plus mechanisms to keep their account teams constantly in the technical discussion loops during such an outage so they can effectively communicate with their client(s).

    In summary, this all points to the lack of maturity of today's external cloud industry. It might be acceptable to consumers and SMB clients, but enterprise-class clients will demand more. Until they get it, internal private clouds will be the preferred enterprise solution.
    Ken Cameron
  • Amazon is not the only cloud provider

    At least a few promoters encouraged decision-makers to believe that “if it’s in the cloud, it’s automatically safe”, in the words of one acquaintance. That’s cheap, it’s wrong, and it’s time to move past it.

    On the other hand, it’s no wiser to jump all the way to the other side and conclude, “If it’s critical to your business, do it yourself.” The local McDonalds doesn’t disconnect from “the grid” and run its own generators; Starbucks doesn’t insist on owning its own coffee plantations. Similarly, it is perfectly legitimate for an organization to recognize that information technology (IT) plays a crucial role, yet simultaneously outsource at least some of its IT elements.

    A sensible conclusion is more like “If it’s critical to your business, analyze for yourself how to achieve the reliability you need.” AWS isn’t designed for the ultimate in reliability, but competing Cloud providers and architectures are.

    http://www.real-user-monitoring.com/little-mystery-in-amazons-outage/
    thomasmckeown55