Business Discontinuity: Poorly communicating Cloud outages

Poorly communicating outages that Twitter, Facebook, Amazon, Microsoft, Google, Salesforce.com and other providers have experienced has raised significant doubts about the Cloud's resiliency.
Written by Gery Menegaz, Contributor

Business Continuity is supposed to safeguard organizations from revenue loss, opportunity loss, adverse impacts to a firms reputation, and/or ensure compliance with contractual Service Level Agreements (SLA).

While it is easier to establish the loss of a customer for companies like Amazon and Netflix, it is trickier for the likes of Twitter, Facebook, Instagram, or Pinterest.

The inability to post a picture to your timeline, or create a tweet likely will not cause you to stop using Facebook, Instgram, or Twitter. However, the inability of Amazon's or Salesforce.com's customers to carry on with business as usual is a markedly more significant event.

Thus, we would expect that the communication that follows an outage from a Facebook, or Twitter to be less detailed than those from an Amazon or Salesforce.com. This is because for companies like Amazon and Salesforce.com, they would need to explain why they missed a Service Level Agreement with their customers.

The Cloud is made up of a multitude of complex systems, software and hardware, and there is no reason to think that these would fail more or less than say a non-cloud datacenter.

Still, the poor communication that has come from some of these cloud providers has some asking: Is the Cloud Really Ready for Business?
Twitter Outage

Twitter was down for the second time in 5 weeks, brought down by by a glitch from within Twitter's data centers, “a the coincidental failure of two parallel systems at nearly the same time”.

The first incident in June was caused by a “cascading bug”. Though I have not seen the details, or root cause analysis, of the bug.  Twitter's vice president of engineering Mazen Rawashdeh wrote "...it was due to this infrastructural double-whammy. We are investing aggressively in our systems to avoid this situation in the future."

The net effect of the Twitter outage was a blow to reputation and reliability. Additionally, it forced Twitter users to go outside and interact with others people. Yikes!

GoogleTalk Outage

GoogleTalk, used for voice, text, and video chats, went down at 6:40 AM last week, according to GoogleTalk Service Details. The level of communication detail provided by Google could be summed up as follows: It's down, we're working on it...still working on it...it's back up. Apologies.

Microsoft Windows Azure Outage

We are experiencing an availability issue in the West Europe sub-region, which impacts access to hosted services in this region. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers,” was posted on their Service Dashboard.

The 2.5 hour Windows Azure outage was not well communicated. All we knew was that they were working on it and it eventually got fixed.

Amazon Outage

Amazon EC2 has experienced a couple of outages form the same datacenter, “…one, in late June, was sparked by a violent thunderstorm which cut power, setting up a chain of events that put many Amazon customers offline for hours.” A lightning strike, or any other sort of weather event, should NOT take out a datacenter. 

Our US East-1 Region consists of more than 10 datacenters structured into multiple Availability Zones. These Availability Zones are in distinct physical locations and are engineered to isolate failure from each other.

Perhaps having all of the data centers on a single grid ought to be reconsidered?

The outage brought to light issues with both software (RDBS) and hardware (Elastic Load Balancing), in addition to the weather issues.

The net effect of the outage was that several high profile customers like Netflix, Pinterest, and Instagram were taken off-line, and it was reported that Amazon lost customers because of the outage, but I have not seen specific numbers.

The information that came out about the incident was slow, but eventually Amazon provided a fairly detailed statement outlining several issue and what they were intending to do about it.

The chances that you will be hit by lightning are 1/310,000,000, according to the National Weather Service; the odds that this will happen again to Amazon?
Salesforce.com Outage

The Salesforce.com disruption came in the wake of a power outage and appears to be due to settings on server Network Interface Card: “Technology Team has identified and repaired a problem impacting NA5 search for large indexes. The search servers, when restarted, defaulted to a sub-optimal packet size setting on the host network interface card. As such, large indexes were broken up and handled inefficiently.”

Could this happen to a non-cloud server? Absolutely. You just would not hear about it.
Facebook Outage

The fear of losing a customer or user was summed up neatly in the movie The Social Network.

Okay, let me tell you the difference between Facebook and everyone else, we don't crash EVER! If those servers are down for even a day, our entire reputation is irreversibly destroyed! Users are fickle, Friendster has proved that. Even a few people leaving would reverberate through the entire userbase. The users are interconnected, that is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go, don't you get that?

Did the other dominos go?

No. The most interesting aspect of the intermittent Facebook outage was the AnonOp to stop the rumor that Anonymous had targeted Facebook. Aside from this, Facebook kept their root cause close to the vest.

Why is Facebook so cryptic about its outage?
Facebook is not alone in being cryptic. Both Twitter, and Google were, not so much evasive, but closed lip about the issues they experienced. The cost of these DR incidents to the Cloud is that, as a whole, the technology takes a hit.

Rather than looking at the individual issues and root causes, many will simply lump them all together, point to the Cloud, calling the whole thing in to question. These Cloud issues are not any different than the sorts of issues that large complex environments in corporate America experience every day. The difference is they happen in the cloud.

The bottom line is that more communication is needed.

What do you think?

Editorial standards