What's the right response to a cloud outage?

Microsoft's Azure storage service in the South Central US region has been offline for more than 24 hours. The outage has brought one web-based business to a grinding halt. But if you were expecting an angry response, prepare to be surprised.
Written by Ed Bott, Senior Contributing Editor

Last week Amazon's cloud-based infrastructure went down hard, knocking Netflix offline on what should have been one of its most heavily trafficked days of the year, starting Christmas Eve and running well into Christmas Day.

This week it's Microsoft's turn to suffer a severe cloud outage. At 3:16 PM (UTC) on December 28, Microsoft reported that its Azure Storage service for the South Central US region was experiencing "partial availability." In an update a few hours later on its service dashboard, the company noted that the outage was affecting its worldwide Management Portal.

Six hours after the initial report, this notice appeared:

The repair steps are taking longer because it involves recovery of some faulty nodes on the impacted cluster. We expect this activity to take a few more hours.

An update 12 hours after the initial report noted that "Impact to Service Management operations and new VM creation jobs has been fully mitigated. Remainder of the recovery process to restore Storage service is underway." But three hours later, some 15 hours after the initial outage was acknowledged, came this bad news:

The repair steps are still underway to restore full availability of Storage service in the South Central US sub-region. This repair process is likely to take a significant amount of time.

That's an understatement. Currently, the outage has lasted for more than 24 hours, with no indication of when it will be fully repaired.

One of the affected customers is Soluto, which runs a worldwide PC diagnostics tool for Windows users. Here's how Roee Adler, Chief Product Officer at Soluto, described the impact on his company's web-based service:

Running a cloud service surely has its challenges, but I believe it’s the future of consumer products and most technology in general. We (Soluto) rely our service on Microsoft Azure, which we chose as our scalable big data platform because we could build stuff really fast on top of it using our favorite tool: Visual Studio. We now run on hundreds of machines and deal with close to 100M data transactions per day from which we extract quick fascinating insights for our users, which is fun and cool.

For over 24 hours now, we’re down. It’s horrible. Seeing Google Real-Time Analytics show this image is.. well… heart breaking at best, and murderous-thoughts-invoking at worst.

The graphic showed a big fat goose egg, zero users worldwide.

Ironically, Soluto originally chose to move to Azure back in 2010, after its own self-hosted infrastructure was unable to handle a sudden spike in traffic. So what's the company's response this time? Adler continues:

But every cloud provider has its glitches, and to be frank, every software or hardware company ever has had its glitches. We know people are working hard and around the clock to fix this failure, so instead of complaining, we decided to send our community to transmit positive karma in the direction of the people spending their weekend restoring the service instead of with their families. Who knows- maybe it’ll speed the restoration process :)

And you know what? He's right. Cloud services fail. In July of this year, Amazon had an outage that temporarily shut down Netflix, Pinterest, and Instagram. In May of last year, Microsoft's business email services were offline for 9 hours. In a separate incident a few months earlier, Google's free and paid email services suffered an outage that lasted more than 30 hours. And so it goes.

I'm not aware of any cloud-based service that offers a 100% uptime guarantee, because such a promise is impossible to keep. If you can afford redundant storage in multiple zones, that's a good alternative, but that option is too expensive for anything but mission-critical services.

As more and more services move to the cloud, this sort of outage is inevitable (but hopefully rare). And on the part of customers, responding to it with equanimity and professionalism is essential.

The most recent update from Microsoft, more than 36 hours after the initial outage, is that the problem is still not fully repaired: "We made good progress in executing the repair steps to restore full availability of Storage service in the South Central US sub-region. We continue to work to resolve this issue at the earliest and our next update will be before 7:30AM PST on 12/30/2012. We apologize for any inconvenience this causes our customers."

Update: The outage is now repaired. Microsoft's most recent update says, "As of 12/30/2012, at approximately 9:42 pm PST, full service functionality has been restored to the Storage service in the South Central US sub-region. All customers should have access to their data. Please accept our apologies for the interruption and issues it has caused our customers."

Editorial standards