Managing risk in the wake of Amazon's cloud outage
Summary: With the recent Amazon Cloud outage many are suggesting that it brings Cloud Computing into question. And that Cloud Computing will now be more difficult to sell. I disagree.

You hear that Mr. Anderson? That is the sound of inevitability...Risk! With the recent Amazon Cloud outage many are suggesting that it brings Cloud Computing into question. And that Cloud Computing will now be more difficult to sell. I disagree.
Whether you’re considering the implementation of a cloud strategy, taking on a merger with the hopes of increasing revenue, or are thinking about implementing a new, emerging technology there is risk involved. How you manage that is risk is critical to business continuity.
The first step in managing risk is to understand the types of risk that organizations face. According to Robert S. Kaplan and Anette Mikes, whose recent article Managing Risks: A New Framework, risk falls into one of 3 categories. Preventable, Strategic, or External risk.
Also see: Amazon Web Services: The hidden bugs that made AWS' outage worse | AWS outage reveals backup cheapskates | Google launches alleged Amazon Web Services killer, but lacks maturity, options
Preventable risk – this category of risk is internal, arising from within the organization, are preventable and ought to be avoided as they add no value to the organization. These risks arise from employee’s actions which are inappropriate, unethical or downright illegal. No doubt you have seen or taken part in training intended to educate and prevent this sort of behavior.
The New York Times, citing an internal report at the bank, reported that the JPMorgan Trading Loss May Reach $9 Billion. JPMorgan's initial estimate was $2 billion when it disclosed the trade in May, although CEO Jamie Dimon said then that the loss could grow. Given the enormity of the recent trading losses, it appears that simple online training is not enough to combat greed. This was definitely preventable risk!
Strategic risk – this category of risk is one in which the organization accepts as part of a new plan in efforts to generate higher returns. This is a risk that is not inherently bad, it is a risk that is accepted as part of a strategic plan to capture potential gains.
An example of this is the Microsoft purchase of aQuantive. on Friday, CNNMoney reported that Microsoft spent $6.3 billion in cash buying online display advertising company aQuantive in 2007. Microsoft bought the company in efforts to beef up Bing, but never made money on its online services decision. On Monday, the company wrote off almost the entire value of the acquisition, taking a $6.2 billion write down.
External risk – some risks arise from events that occur outside and are beyond the control of an organization. These include natural disaster, political, economic in nature, and ought to be identified and planned for in efforts to mitigate impact. The Amazon Cloud outage is a good example. Sites like Netflix, Instgram and Pintrest were offline for hours.
The Boston Globe reported “The weekend’s disruption happened after a lightning storm caused the power to fail at the Amazon Web Services center in Northern Virginia containing thousands of computer servers. For reasons Amazon was still unsure of on Sunday, the data center’s backup generator also failed.”
The key thing to note here is that the backup generator failed. The purpose of the back-up generator is to allow systems to come down gracefully, it has nothing to do with site redundancy. So, we can assume that for sites like Netflix, Instgram and Pintrest that not having a redundant site, such as banks have, was a risk they were willing to accept.
This is to say, that they were not willing to accept an outage should a server go down (the value of Cloud Computing), but willing to accept an outage should the site go down.
So, having the cloud outage was bad, but it ought not to reflect badly on Cloud Computing, it ought to reflect poorly on their level of planning with regard to business continuity, and may cause many of Amazon’s customers to take a second look at their risk profile.
Are there other examples that you can think of that better exemplify the risks highlighted above? Talk Back and Let Me Know.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Another uninformed opinion...
If the writer had taken the trouble to read the Netflix blog posting in which they explained their site's response to the failure, he would have seen that they did indeed have "a redundant site" (several, actually), and that the failure was due to a bug in their (complex) cluster management code.
Derrick Harris and I discussed this over at GigaOM, in a piece titled "How to deal with cloud failure: Live, learn, fix, repeat." Mr. Menegaz could learn something from it.
Netflix
GigaOM
Here are the facts regarding the failure from Amazon: http://aws.amazon.com/message/67457/
Cloud Computing Value
Mosaic Technology
http://www.mosaictec.com
Who to blame
An interesting statement. The cloud fail but don’t blame it.
In part I agree. Don’t blame the cloud, blame yourself for using the cloud in the first place.
Cloud should have an MVP
Not the cloud
Amazon failed to notice and/or act on the bugs present in their system.
Cloud Maturity
1. Facilities: A storm should NOT take down a data center. A backup diesel generator should NOT take down a data center. Why didn't the second backup generator kick in? Why didn't they switch to diesel generators 30-60 minutes before the storm hit? I have run several large data centers, including some in the Dallas, TX area, where storms are a daily threat in Spring and Summer. My Operations Center had a TV with the Weather Channel (and one with CNN). When storms were predicted, we would activate our storm team (a combination of facilities, network, and operations staff & managers) to get ready by ensuring everything was in tip-top shape and ready to go. 30-60 minutes prior to any storm cell coming at us, we would switch to diesel, ride out the storm, then allow things to settle down for 1-2 hours before switching back to utility. I NEVER suffered user impact from a Storm. I know that this is a common practice in enterprise data centers, especially in Texas, Oklahoma, Kansas, etc. Why not Amazon, Google, Microsoft, and the other cloud providers?
2. The technology and automation is available today to perform automatic failover to another site, including out-of-region. Amazon, and the other providers need to put that option on the table as an optional part of the Amazon services, not as something a client has to build themselves. This is the second time an outage has affected clients who were spread across multiple Amazon availability zones. Once is a mistake, twice is stupidity.
3. Communications: the Amazon status blog referred to this incident as "performance problems", never mentioning an "outage". The post mortem is clearer. Last year, I reviewed the status blog and I was amazed that there were 2, 3 and even 5-hour gaps in status updates. In my enterprise data center world, I usually had an SLA requiring status updates to the users every 15 minutes. I had two conference calls going, one for the crisis team and one for the communication team. We had processes in place to keep the comm team completely updated blow by blow. That comm team then spread the message amongst the user community.
In the end, I believe we are seeing the need for increased maturity on the part of the cloud providers. They definitely need to look at their service offerings from a large enterprise perspective, pay more attention to the basics, and push harder on expanding their offerings for enterprise class clients. In the interim, those enterprise class clients will probably lean in the direction of internal private clouds, with only experimentation and very gradual movement towards hybrid clouds. I also believe that outages like these opens a market for cloud aggregation where you use two vendors to give you a high availability capability. Meanwhile, consumers and SMBs are going to continue to suffer from such incidents until the vendors step up.
Very Helpful Commments
Re: Cloud Maturity
This is very well explained. I agree with you in saying that Amazon is responsible for this outage and not Cloud technology.
Thanks for sharing your experience with us.
Cheers,
Murtaza
Lighting Strikes Twice
Thanks for the comment. I agree. Amazon could have handled the outage better. Let's see if Salesforce.com does a better job of managing their recent outage.
Gery Menegaz
the cloud
IT Should Treat Cloud Same as On-Premise - Same Risks, but Amazon Magnified
The problem is Amazon and other IaaS providers have to be WAY better at RAS (yes, that old IBM term still applies, reliability, availability, scalability and/or security - your choice) than your average IT shop, because EC2 is operating for MANY companies, not just one.
When IT makes a decision to move to public Cloud they must take that into consideration. Managing your CSP purely at the SLA layer is a huge mistake. IT has to get into the CSP's business in detail, has to be impressed and convinced that the CSP's HA and backup implementations are superior to what IT already produces.
It is the same issue as with outsourcing: You can't throw a complex process over the wall, like call center or your computing infrastructure, just to save $ when you are in fact putting your business at risk.
Addition transparency would help
Cloud providers may benefit from publishing stats on actual impacted technologies (Cloud vs IaaS %) when their data centers implode. I personally know a handful of Cloud clients who had minimal or no impact during the AWS outages. Without this information, all the IaaS types get lumped into the "big" number of Cloud outage clients. The inflated Cloud numbers get reported which is killing the brand.
Why shut down generators?
Let's look at that. If the generators will produce enough electricity to hold the system for 5 minutes while we shut things down, why would we then turn them off? Why not just keep them running another 5 minutes, or 2 hours, or two days, until the power comes back up?
The UPS at my house is there to keep the systems running long enough to shut them down nice, but the generators at my data center are there to keep it running until utility power has been restored. Utility power fails, batteries carry the load long enough to spin up the generators, and the generators hold the load until power is manually transferred back to utility power. Servers still have nice, conditioned power and they just keep spinning. This has been the case at the corporate data centers as well as CoLo facilities I've worked with. There are typically very large fuel tanks buried near the generators with fuel for a number of days that, combined with agreements with fuel providers, see to it that in an emergency these generators receive more fuel right after the hospitals and airports get their delivery.
If the only issue is power, my servers should keep running on generator for weeks at a time, if necessary. If they're suddenly underwater, that's a different issue entirely. Multiple, geographically disparate locations absolutely make sense, but I'm not as quick as you are to count the first one out.
A Great Upcoming Webinar on Cloud Computing