Managing risk in the wake of Amazon's cloud outage

Managing risk in the wake of Amazon's cloud outage

Summary: With the recent Amazon Cloud outage many are suggesting that it brings Cloud Computing into question. And that Cloud Computing will now be more difficult to sell. I disagree.



You hear that Mr. Anderson? That is the sound of inevitability...Risk! With the recent Amazon Cloud outage many are suggesting that it brings Cloud Computing into question. And that Cloud Computing will now be more difficult to sell. I disagree.

Whether you’re considering the implementation of a cloud strategy, taking on a merger with the hopes of increasing revenue, or are thinking about implementing a new, emerging technology there is risk involved. How you manage that is risk is critical to business continuity.

The first step in managing risk is to understand the types of risk that organizations face. According to Robert S. Kaplan and Anette Mikes, whose recent article Managing Risks: A New Framework, risk falls into one of 3 categories. Preventable, Strategic, or External risk.

Also see: Amazon Web Services: The hidden bugs that made AWS' outage worse | AWS outage reveals backup cheapskates | Google launches alleged Amazon Web Services killer, but lacks maturity, options

Preventable risk – this category of risk is internal, arising from within the organization, are preventable and ought to be avoided as they add no value to the organization. These risks arise from employee’s actions which are inappropriate, unethical or downright illegal. No doubt you have seen or taken part in training intended to educate and prevent this sort of behavior.

The New York Times, citing an internal report at the bank, reported that the JPMorgan Trading Loss May Reach $9 Billion. JPMorgan's initial estimate was $2 billion when it disclosed the trade in May, although CEO Jamie Dimon said then that the loss could grow. Given the enormity of the recent trading losses, it appears that simple online training is not enough to combat greed. This was definitely preventable risk!

Strategic risk – this category of risk is one in which the organization accepts as part of a new plan in efforts to generate higher returns. This is a risk that is not inherently bad, it is a risk that is accepted as part of a strategic plan to capture potential gains.

An example of this is the Microsoft purchase of aQuantive. on Friday, CNNMoney reported that Microsoft spent $6.3 billion in cash buying online display advertising company aQuantive in 2007. Microsoft bought the company in efforts to beef up Bing, but never made money on its online services decision. On Monday, the company wrote off almost the entire value of the acquisition, taking a $6.2 billion write down.

External risk – some risks arise from events that occur outside and are beyond the control of an organization. These include natural disaster, political, economic in nature, and ought to be identified and planned for in efforts to mitigate impact. The Amazon Cloud outage is a good example. Sites like Netflix, Instgram and Pintrest were offline for hours.

The Boston Globe reported “The weekend’s disruption happened after a lightning storm caused the power to fail at the Amazon Web Services center in Northern Virginia containing thousands of computer servers. For reasons Amazon was still unsure of on Sunday, the data center’s backup generator also failed.”

The key thing to note here is that the backup generator failed. The purpose of the back-up generator is to allow systems to come down gracefully, it has nothing to do with site redundancy. So, we can assume that for sites like Netflix, Instgram and Pintrest that not having a redundant site, such as banks have, was a risk they were willing to accept.

This is to say, that they were not willing to accept an outage should a server go down (the value of Cloud Computing), but willing to accept an outage should the site go down.

So, having the cloud outage was bad, but it ought not to reflect badly on Cloud Computing, it ought to reflect poorly on their level of planning with regard to business continuity, and may cause many of Amazon’s customers to take a second look at their risk profile.

Are there other examples that you can think of that better exemplify the risks highlighted above? Talk Back and Let Me Know.

Topics: Cloud, Amazon, CXO, Microsoft, Outage

Gery Menegaz

About Gery Menegaz

Gery Menegaz is a Chief Architect for IBM with more than 20 years supporting technologies in the financial, medical, pharmaceutical, insurance, legal and education sectors. My Full-Time Employer is IBM. I write as a freelancer for ZDNet.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Another uninformed opinion...

    Menegaz writes "So, we can assume that for sites like Netflix, Instgram and Pintrest that not having a redundant site, such as banks have, was a risk they were willing to accept." But to assume such a thing would be absurd.

    If the writer had taken the trouble to read the Netflix blog posting in which they explained their site's response to the failure, he would have seen that they did indeed have "a redundant site" (several, actually), and that the failure was due to a bug in their (complex) cluster management code.

    Derrick Harris and I discussed this over at GigaOM, in a piece titled "How to deal with cloud failure: Live, learn, fix, repeat." Mr. Menegaz could learn something from it.
    • Netflix

      Geoff - I did not see the Netflix blog posting. If that is the case, then my bad. Still it does not take away from the point of the article with regard to risk management. Thanks for your comment.
      • GigaOM

        Geoff - You got me! The article quotes you, on what you "suspect" happened. LOL!

        Here are the facts regarding the failure from Amazon:
  • Cloud Computing Value

    We at Mosaic Technology think it’s great to see this article stand up for cloud computing after the recent Amazon outage. It would be a shame to see the value of Cloud Computing negatively affected after such an event.

    Mosaic Technology
  • Who to blame

    “…having the cloud outage was bad, but it ought not to reflect badly on Cloud Computing…”

    An interesting statement. The cloud fail but don’t blame it.

    In part I agree. Don’t blame the cloud, blame yourself for using the cloud in the first place.
    • Cloud should have an MVP

      The marketing of Cloud should have a Minimal Viable Product (MVP) consistent with the most likely understanding by laypeople. So all Cloud providers should consider providing at least a dual-server failover/load-balanced service as the minimum product. And they should *not* sell anything below that basic level, except to say "advanced" clients (like developers testing out their code).
    • Not the cloud

      If you feel the need to blame something/someone then may I suggest Amazon rather than a 'blanket blame' of cloud?

      Amazon failed to notice and/or act on the bugs present in their system.
  • Cloud Maturity

    Gery, I am torn on this subject. First of all, I believe Cloud Computing (in ALL of it's definitions) is definitely the way of the future, to some degree in the short term, and to a huge degree in the long term. However, I do believe that the Amazon outage, in addition to their outage last year, and to Google's and Microsoft's Ireland outages, and several others, is cause for the industry to step back and assess their current state and where they need to go. The following are a few points to ponder:

    1. Facilities: A storm should NOT take down a data center. A backup diesel generator should NOT take down a data center. Why didn't the second backup generator kick in? Why didn't they switch to diesel generators 30-60 minutes before the storm hit? I have run several large data centers, including some in the Dallas, TX area, where storms are a daily threat in Spring and Summer. My Operations Center had a TV with the Weather Channel (and one with CNN). When storms were predicted, we would activate our storm team (a combination of facilities, network, and operations staff & managers) to get ready by ensuring everything was in tip-top shape and ready to go. 30-60 minutes prior to any storm cell coming at us, we would switch to diesel, ride out the storm, then allow things to settle down for 1-2 hours before switching back to utility. I NEVER suffered user impact from a Storm. I know that this is a common practice in enterprise data centers, especially in Texas, Oklahoma, Kansas, etc. Why not Amazon, Google, Microsoft, and the other cloud providers?

    2. The technology and automation is available today to perform automatic failover to another site, including out-of-region. Amazon, and the other providers need to put that option on the table as an optional part of the Amazon services, not as something a client has to build themselves. This is the second time an outage has affected clients who were spread across multiple Amazon availability zones. Once is a mistake, twice is stupidity.

    3. Communications: the Amazon status blog referred to this incident as "performance problems", never mentioning an "outage". The post mortem is clearer. Last year, I reviewed the status blog and I was amazed that there were 2, 3 and even 5-hour gaps in status updates. In my enterprise data center world, I usually had an SLA requiring status updates to the users every 15 minutes. I had two conference calls going, one for the crisis team and one for the communication team. We had processes in place to keep the comm team completely updated blow by blow. That comm team then spread the message amongst the user community.

    In the end, I believe we are seeing the need for increased maturity on the part of the cloud providers. They definitely need to look at their service offerings from a large enterprise perspective, pay more attention to the basics, and push harder on expanding their offerings for enterprise class clients. In the interim, those enterprise class clients will probably lean in the direction of internal private clouds, with only experimentation and very gradual movement towards hybrid clouds. I also believe that outages like these opens a market for cloud aggregation where you use two vendors to give you a high availability capability. Meanwhile, consumers and SMBs are going to continue to suffer from such incidents until the vendors step up.
    Ken Cameron
    • Very Helpful Commments

      Ken, your analysis was very interesting and insightful, and I learned a lot by reading it. Thanks!
      Andy Parsons
    • Re: Cloud Maturity

      Hey Ken,

      This is very well explained. I agree with you in saying that Amazon is responsible for this outage and not Cloud technology.

      Thanks for sharing your experience with us.

    • Lighting Strikes Twice


      Thanks for the comment. I agree. Amazon could have handled the outage better. Let's see if does a better job of managing their recent outage.

      Gery Menegaz
  • the cloud

    di you all rember irobot and war games this is what i see cloud will get set up and people will use it and then the cloud quits for good all that you have put on line in the cloud is gone my mom said all things mess up and they make it look so good but look out it to will crash like the hard drive in your notebook or desktop computer and they can not do any thing to get your stuff back so then what you start from scratch i know i have had my hard drive to quit so i know to back it up so do not back it up online at all
  • IT Should Treat Cloud Same as On-Premise - Same Risks, but Amazon Magnified

    How many IT backup scenarios fail? Many, whether on-premise or Cloud, whether roll-your-own infrastructure or appliance - it requires more science, than art, a dedication to testing, and frankly some luck.

    The problem is Amazon and other IaaS providers have to be WAY better at RAS (yes, that old IBM term still applies, reliability, availability, scalability and/or security - your choice) than your average IT shop, because EC2 is operating for MANY companies, not just one.

    When IT makes a decision to move to public Cloud they must take that into consideration. Managing your CSP purely at the SLA layer is a huge mistake. IT has to get into the CSP's business in detail, has to be impressed and convinced that the CSP's HA and backup implementations are superior to what IT already produces.

    It is the same issue as with outsourcing: You can't throw a complex process over the wall, like call center or your computing infrastructure, just to save $ when you are in fact putting your business at risk.
    • Addition transparency would help

      Agree with your assessment as well as @Ken Cameron above. In my opinion, the "Cloud" brand takes a outsized PR hit which is the direct fault of the providers. If you are selling premium products like Cloud services, don't confuse the market with "Cloud-Lite" (aka IaaS/MaaS/AaaS) when clients balk at cost and complexity of proper Cloud husbandry.

      Cloud providers may benefit from publishing stats on actual impacted technologies (Cloud vs IaaS %) when their data centers implode. I personally know a handful of Cloud clients who had minimal or no impact during the AWS outages. Without this information, all the IaaS types get lumped into the "big" number of Cloud outage clients. The inflated Cloud numbers get reported which is killing the brand.
      Tired Tech
  • Why shut down generators?

    Menegaz writes: "The purpose of the back-up generator is to allow systems to come down gracefully, it has nothing to do with site redundancy."

    Let's look at that. If the generators will produce enough electricity to hold the system for 5 minutes while we shut things down, why would we then turn them off? Why not just keep them running another 5 minutes, or 2 hours, or two days, until the power comes back up?

    The UPS at my house is there to keep the systems running long enough to shut them down nice, but the generators at my data center are there to keep it running until utility power has been restored. Utility power fails, batteries carry the load long enough to spin up the generators, and the generators hold the load until power is manually transferred back to utility power. Servers still have nice, conditioned power and they just keep spinning. This has been the case at the corporate data centers as well as CoLo facilities I've worked with. There are typically very large fuel tanks buried near the generators with fuel for a number of days that, combined with agreements with fuel providers, see to it that in an emergency these generators receive more fuel right after the hospitals and airports get their delivery.

    If the only issue is power, my servers should keep running on generator for weeks at a time, if necessary. If they're suddenly underwater, that's a different issue entirely. Multiple, geographically disparate locations absolutely make sense, but I'm not as quick as you are to count the first one out.
  • A Great Upcoming Webinar on Cloud Computing

    I really liked the way you have explained about the three types of risks. The Amazon Cloud outage is a good example of external risk, but that does not mean it’s the end of cloud. It’s very important for the companies to plan for their risks and act accordingly. I think you will be interested in this upcoming webinar which is being hosted by Gartner and Infosys cloud about "Enterprise IT: Staying Relevant in the Cloud Era", on Tuesday, July 24, 2012. To register please go to the below link.
    Vivian Thomas