Business Discontinuity: Poorly communicating Cloud outages

Business Discontinuity: Poorly communicating Cloud outages

Summary: Poorly communicating outages that Twitter, Facebook, Amazon, Microsoft, Google, Salesforce.com and other providers have experienced has raised significant doubts about the Cloud's resiliency.

SHARE:
cloud-based-it-failure

Business Continuity is supposed to safeguard organizations from revenue loss, opportunity loss, adverse impacts to a firms reputation, and/or ensure compliance with contractual Service Level Agreements (SLA).

While it is easier to establish the loss of a customer for companies like Amazon and Netflix, it is trickier for the likes of Twitter, Facebook, Instagram, or Pinterest.

The inability to post a picture to your timeline, or create a tweet likely will not cause you to stop using Facebook, Instgram, or Twitter. However, the inability of Amazon's or Salesforce.com's customers to carry on with business as usual is a markedly more significant event.

Thus, we would expect that the communication that follows an outage from a Facebook, or Twitter to be less detailed than those from an Amazon or Salesforce.com. This is because for companies like Amazon and Salesforce.com, they would need to explain why they missed a Service Level Agreement with their customers.

The Cloud is made up of a multitude of complex systems, software and hardware, and there is no reason to think that these would fail more or less than say a non-cloud datacenter.

Still, the poor communication that has come from some of these cloud providers has some asking: Is the Cloud Really Ready for Business?

Twitter Outage

Twitter was down for the second time in 5 weeks, brought down by by a glitch from within Twitter's data centers, “a the coincidental failure of two parallel systems at nearly the same time”.

The first incident in June was caused by a “cascading bug”. Though I have not seen the details, or root cause analysis, of the bug.  Twitter's vice president of engineering Mazen Rawashdeh wrote "...it was due to this infrastructural double-whammy. We are investing aggressively in our systems to avoid this situation in the future."

The net effect of the Twitter outage was a blow to reputation and reliability. Additionally, it forced Twitter users to go outside and interact with others people. Yikes!

GoogleTalk Outage

GoogleTalk, used for voice, text, and video chats, went down at 6:40 AM last week, according to GoogleTalk Service Details. The level of communication detail provided by Google could be summed up as follows: It's down, we're working on it...still working on it...it's back up. Apologies.

Microsoft Windows Azure Outage

We are experiencing an availability issue in the West Europe sub-region, which impacts access to hosted services in this region. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers,” was posted on their Service Dashboard.

The 2.5 hour Windows Azure outage was not well communicated. All we knew was that they were working on it and it eventually got fixed.

Amazon Outage

Amazon EC2 has experienced a couple of outages form the same datacenter, “…one, in late June, was sparked by a violent thunderstorm which cut power, setting up a chain of events that put many Amazon customers offline for hours.” A lightning strike, or any other sort of weather event, should NOT take out a datacenter. 

Our US East-1 Region consists of more than 10 datacenters structured into multiple Availability Zones. These Availability Zones are in distinct physical locations and are engineered to isolate failure from each other.

Perhaps having all of the data centers on a single grid ought to be reconsidered?

The outage brought to light issues with both software (RDBS) and hardware (Elastic Load Balancing), in addition to the weather issues.

The net effect of the outage was that several high profile customers like Netflix, Pinterest, and Instagram were taken off-line, and it was reported that Amazon lost customers because of the outage, but I have not seen specific numbers.

The information that came out about the incident was slow, but eventually Amazon provided a fairly detailed statement outlining several issue and what they were intending to do about it.

The chances that you will be hit by lightning are 1/310,000,000, according to the National Weather Service; the odds that this will happen again to Amazon?

Salesforce.com Outage

The Salesforce.com disruption came in the wake of a power outage and appears to be due to settings on server Network Interface Card: “Technology Team has identified and repaired a problem impacting NA5 search for large indexes. The search servers, when restarted, defaulted to a sub-optimal packet size setting on the host network interface card. As such, large indexes were broken up and handled inefficiently.”

Could this happen to a non-cloud server? Absolutely. You just would not hear about it.

Facebook Outage

The fear of losing a customer or user was summed up neatly in the movie The Social Network.

Okay, let me tell you the difference between Facebook and everyone else, we don't crash EVER! If those servers are down for even a day, our entire reputation is irreversibly destroyed! Users are fickle, Friendster has proved that. Even a few people leaving would reverberate through the entire userbase. The users are interconnected, that is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go, don't you get that?

Did the other dominos go?

No. The most interesting aspect of the intermittent Facebook outage was the AnonOp to stop the rumor that Anonymous had targeted Facebook. Aside from this, Facebook kept their root cause close to the vest.

Why is Facebook so cryptic about its outage?

Facebook is not alone in being cryptic. Both Twitter, and Google were, not so much evasive, but closed lip about the issues they experienced. The cost of these DR incidents to the Cloud is that, as a whole, the technology takes a hit.

Rather than looking at the individual issues and root causes, many will simply lump them all together, point to the Cloud, calling the whole thing in to question. These Cloud issues are not any different than the sorts of issues that large complex environments in corporate America experience every day. The difference is they happen in the cloud.

The bottom line is that more communication is needed.

What do you think?

Topics: Cloud, Amazon, Google, Microsoft, Salesforce.com

Gery Menegaz

About Gery Menegaz

Gery Menegaz is a Chief Architect for IBM with more than 20 years supporting technologies in the financial, medical, pharmaceutical, insurance, legal and education sectors. My Full-Time Employer is IBM. I write as a freelancer for ZDNet.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

9 comments
Log in or register to join the discussion
  • Superman still puts his pants on one leg at a time

    Every datacenter has a vulnerability which is inevitably uncloaked given enough time. Consumers of "cloud" services are, for the most part, a savvy lot. Conserving cash by offloading your applications onto someone else's datacenter bypassing up front capital expenditures, is smart business. Not managing expectations and vacuous explanations for outages, is just plain foolish. Transparency needs to be forthcoming from cloud providers and their customers since the bimonthly whitewashing sessions are killing the "cloud" brand.
    Tired Tech
    • Cloud Brand

      Thanks for your comment. Completely agree.
      gery.menegaz
  • thoughts

    "A lightning strike, or any other sort of weather event, should NOT take out a datacenter. "

    You might as well say that nuclear weapons should not take out a datacenter. Severe weather is, well, severe. Sometimes the best you can do is damage control, as prevention is unreasonable.

    "The chances that you will be hit by lightning are 1/310,000,000, according to the National Weather Service;"

    It's a bit different for buildings than for people, as buildings are taller/larger than people, and are immobile. Buildings have a much higher chance of being struck. The tallest buildings in an area are likely to get hit severeal times in a thunderstorm.

    "The bottom line is that more communication is needed."

    Agreed. I'm a big fan of more communication, and with today's technology there's really no excuse for a lack of communication.
    CobraA1
  • 2nd Though

    Thanks for your comment.

    I think that a weather event such as a tsunami, like what happened in Fukushima, would qualify as something where prevention may be unreasonable. A lightning strike, ought to be manageable. There are a few things that Amazon could have done to avoid the outage.

    We can agree to disagree on this one.

    Glad you agreed with me on the communication bit. I think that the brand is taking a hit because of some of the coverage. I am certain that when Microsoft had the interruption, they had a SWAT call with customers letting them know what was up. The fact that the rest of us where not included is simply unfortunate.
    gery.menegaz
  • Not just with outages

    They need to be transparent about data breachs, especialy when customer data is compromised.
    NoAxToGrind
  • What do you expect?

    Put everyones eggs in the same basket...

    All you get is one omelet that can't be cooked.
    jessepollard
    • Proper cloud design...

      would have a tandem omelet cooking in another pan (zone). When the first one burns, you fallback to the other pan and enjoy it with fresh coffee. If you want to guarantee yourself a hot breakfast every morning, then you need to fork over the cash in advance. Given enough money, you can fly from NY to LA and the omelet will be delivered to you locally with zero latency. Only fly in the ointment, is that this daily omelet breakfast costs $85 on the cloud.
      Tired Tech
      • Western Omelet Supreme

        To build one from a base cheese omelet, additional ingredients (data stores) will add another $30 to the costs.
        Tired Tech
        • I'm hungry, are you hungry?

          Wanna grab some breakfast?
          happyharry_z