What's the right response to a cloud outage?

What's the right response to a cloud outage?

Summary: Microsoft's Azure storage service in the South Central US region has been offline for more than 24 hours. The outage has brought one web-based business to a grinding halt. But if you were expecting an angry response, prepare to be surprised.

TOPICS: Cloud, Microsoft

Last week Amazon's cloud-based infrastructure went down hard, knocking Netflix offline on what should have been one of its most heavily trafficked days of the year, starting Christmas Eve and running well into Christmas Day.

This week it's Microsoft's turn to suffer a severe cloud outage. At 3:16 PM (UTC) on December 28, Microsoft reported that its Azure Storage service for the South Central US region was experiencing "partial availability." In an update a few hours later on its service dashboard, the company noted that the outage was affecting its worldwide Management Portal.

Six hours after the initial report, this notice appeared:

The repair steps are taking longer because it involves recovery of some faulty nodes on the impacted cluster. We expect this activity to take a few more hours.

An update 12 hours after the initial report noted that "Impact to Service Management operations and new VM creation jobs has been fully mitigated. Remainder of the recovery process to restore Storage service is underway." But three hours later, some 15 hours after the initial outage was acknowledged, came this bad news:

The repair steps are still underway to restore full availability of Storage service in the South Central US sub-region. This repair process is likely to take a significant amount of time.

That's an understatement. Currently, the outage has lasted for more than 24 hours, with no indication of when it will be fully repaired.

One of the affected customers is Soluto, which runs a worldwide PC diagnostics tool for Windows users. Here's how Roee Adler, Chief Product Officer at Soluto, described the impact on his company's web-based service:

Running a cloud service surely has its challenges, but I believe it’s the future of consumer products and most technology in general. We (Soluto) rely our service on Microsoft Azure, which we chose as our scalable big data platform because we could build stuff really fast on top of it using our favorite tool: Visual Studio. We now run on hundreds of machines and deal with close to 100M data transactions per day from which we extract quick fascinating insights for our users, which is fun and cool.

For over 24 hours now, we’re down. It’s horrible. Seeing Google Real-Time Analytics show this image is.. well… heart breaking at best, and murderous-thoughts-invoking at worst.

The graphic showed a big fat goose egg, zero users worldwide.

Ironically, Soluto originally chose to move to Azure back in 2010, after its own self-hosted infrastructure was unable to handle a sudden spike in traffic. So what's the company's response this time? Adler continues:

But every cloud provider has its glitches, and to be frank, every software or hardware company ever has had its glitches. We know people are working hard and around the clock to fix this failure, so instead of complaining, we decided to send our community to transmit positive karma in the direction of the people spending their weekend restoring the service instead of with their families. Who knows- maybe it’ll speed the restoration process :)

And you know what? He's right. Cloud services fail. In July of this year, Amazon had an outage that temporarily shut down Netflix, Pinterest, and Instagram. In May of last year, Microsoft's business email services were offline for 9 hours. In a separate incident a few months earlier, Google's free and paid email services suffered an outage that lasted more than 30 hours. And so it goes.

I'm not aware of any cloud-based service that offers a 100% uptime guarantee, because such a promise is impossible to keep. If you can afford redundant storage in multiple zones, that's a good alternative, but that option is too expensive for anything but mission-critical services.

As more and more services move to the cloud, this sort of outage is inevitable (but hopefully rare). And on the part of customers, responding to it with equanimity and professionalism is essential.

The most recent update from Microsoft, more than 36 hours after the initial outage, is that the problem is still not fully repaired: "We made good progress in executing the repair steps to restore full availability of Storage service in the South Central US sub-region. We continue to work to resolve this issue at the earliest and our next update will be before 7:30AM PST on 12/30/2012. We apologize for any inconvenience this causes our customers."

Update: The outage is now repaired. Microsoft's most recent update says, "As of 12/30/2012, at approximately 9:42 pm PST, full service functionality has been restored to the Storage service in the South Central US sub-region. All customers should have access to their data. Please accept our apologies for the interruption and issues it has caused our customers."

Topics: Cloud, Microsoft

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Use the cloud, but don't rely on it, I say.

    Use the cloud, but don't rely on it, I say. Have a plan B.

    Never was anything wrong with having a plan B.
    • What is the point of the cloud, if it can not really survive traffic peaks,

      ... as Soluto's own system could not?

      If they still had their own infrastructure, they would surely repair it much faster than 24+ hours.
      • Nothing's 100%

        Nothing's 100%, and even high availability on the server end doesn't mean high availability on the client end.

        The point is not the uptime of the server. The point is the uptime of the entire system. We're so hyperfocused on the uptime of the servers we forget the rest of the system isn't so great.
  • Ed please follow up. The response depends on the steps taken

    to avoid the problem. If the problem is just in the south central datacenter then the question to ask is why isn't Solutos service and data also running in the north (or west or east) datacenters as well? Yes it cost a little bit more for geo redundancy but that's what people are paying Soluto for and what Soluto should be paying MS for. If they have geo redundancy architected in then the question to ask is why didn't it work? The first is a Soluto problem, the second is a MS problem. Everyone knows it's reasonable to expect and plan for an entire datacenter (even two) to go offline. I'm tired of everyone blaming the "cloud" when something goes wrong with a service that didn't take geo redundant precautions. That's an app architecture problem. That's what moving to the cloud is supposed to entail, not just someone else hosting your non cloud architected crap in offsite iaas. Is Soluto trying to pull a PR slight of hand move here and get everyone to be looking at MS when they didn't either design or pay for geo redundancy is was there a problem with Azure's geo redundancy not working that MS isn't opening up about? Solutos customers deserve to know if it's the former, MS's other Azure customers need to know if it's the later.
    Johnny Vegas
    • Also good to have a plan B anyways

      Well, even if the cloud provider stays up, things can still go wrong on the other side of the wire. ISP issues, a mobile device leaves the range of cell towers, construction worker hits a data line, mother nature takes stuff out, etc.

      Frankly, I'm all for having a plan B - having access to data and computational power locally as well as "in the cloud." When all is said and done, a "pure" cloud solution with no local caching or storage at all is actually an unreliable solution.
    • Well...

      Soluto's data is in a single zone. Remember, this is a free consumer service, not a mission-critical business service. Having an outage like this is annoying but not even close to fatal.
      Ed Bott
      • Thge upside Ed,

        is at least they are not waiting for MS tech support to fix a broken MS signed secure boot key as the Linux Foundation is still waiting for ;)

        That would be a loooong wait indeed.
        • That would almost be as worriesome as when Google's outages occur

          where the users are then wondering if it's just a glitch, or did Google just up and shut down yet another service they were using.
          William Farrel
    • Redundancy failed, too.

      I work for another company affected by the outage. Our data is in the north region as a backup, and the failover also has failed. So our redundancy in the future will need to be a whole other provider or in-house server.
      • Waiting ...

        for Johnny Vegas to respond to milespj's post.

        For mission-critical business service's (repeating ZDNet blogger Ed Bott's terminology), it's not just Geo-redundancy as mentioned by Johnny Vegas, it's also cloud service provider redundancy. Whether it's Amazon's, Google's, Microsoft's, etc. or one's own standby solution. KA-CHING!
        Rabid Howler Monkey
      • What failed?

        What failed in your north backup exactly? My company is thinking about this approach as well.
        • There was no redundancy at all

          What failed is the redundancy to use the north region "copy" / backup of our data. Microsoft now acknowledges that on the service status page finally. Also it is worth noting that Microsoft can't give you a way to access the other region or change it to use the other region. So Rabid is on the money with his comment. You need to be redundant across providers and not just pay one provider for so-called redundancy.

          Also I highly doubt that an established company like Soluto wouldn't have a geographic backup within Azure so they are probably in the same boat as us: backup exists, but its worthless. Well, more appropriately, they were in the same boat as us. We switched to a different boat Friday afternoon :-)
          • And when that new provider fails

            will you switch to someone else, and so on?

            I've not seen a single provider able to do what you want them to do - 24/7 guaranteed
            William Farrel
  • What's the right response to a cloud outage?

    cloud is a rather complex system and even with redundancies built-in, something can go wrong somewhere. the network itself is so complex that even a minor glitch can take the whole system down, and add to it the threat of hacking. as for the back-end, database system is with us long before the internet as we know it existed, and has for the most part robust in its engineering yet we hear from time to time big data centers plagued with systemic failures ... integrating the two is a big undertaking comparable to a moonshot if not more.
  • Cloud-based services

    are a new arena and there will be outages, trial and error, no matter the platform.

    I recall early in the web hosting business we had to feel our way around quite a bit. I had my hosting biz co-located in NJ just across the river fro NYC at the NOC where cPanel was being developed. About half of my customers were from outside the US because, at that time the did not have the kind of hosting and reseller services we offered.

    Looking back at what we had to do to keep things up and running were so different than today. One time, while visiting a reseller customer in Switzerland, we had to log-in to two of his servers via a tethered cell phone and laptop to do some repairs while hiking on a mountain top. A sonic screwdriver would have come in handy back then. Where's the Doctor when you need him?

    As cloud time goes by and these outages are documented there will be more ways to prevent them. Good luck to the team(s) working on this one.
  • Response

    "Transmit positive karma" has a whiff of ironic fatalism to this reader. Further reading their description of to whom they send this, it appears to me that they are suggesting that their demand is low because of holidays and they are indirectly acknowledging that the timing has some element of mitigation.

    I presume Azure has more than one customer in Central America. Should we consider this representative of the e-mails Microsoft has been receiving?
    • Just to clarify...

      South Central US (United States) is NOT equivalent to Central America!
      As for being representative of emails to Microsoft, that would be a big stretch,
      since only this one customer, Soluto, is the topic of this discussion.
  • Nothing new there.

    Use a centralized service, expect failures.

    That was why distributed systems under individual control always worked.

    You depend on any single vendor... same problem. When a failure occurs you need an alternate for the same service.

    Making the software depend on a single service is just stupid.

    And that is what MS depends on.
    • Don't try to make this a "Microsoft" thing.

      It's a cloud computing thing. And Microsoft is hardly the only or even the first player in the cloud space.
      • didn't intend to.

        In reality, you would need three separate vendors, all with compatible services, and contract support to automatically balance usage from one service to the other two.

        In addition, contract penalties to cover three things:

        1) direct expenses for access to data - in other words, the fee for service during the time period where access is normal.

        2) indirect expenses caused by your failure to support your own customers - and this expense may be 0 due to the automatic load balancing, provided it really works.

        3) fees caused by migrating the load to the other two services (the various services may not have the same expenses...)

        This devolves into the expense caused by the lockin of using MS services... which has already been listed as a problem promoted by MS. Therefore, MS (having caused the problem) should also be responsible for paying for it.

        The other vendors (as far as I know) already have compatible APIs for use... But MS prefers to lock their users way from being safe - hence the promotion of using the nonportable VS for applications.