Microsoft's December Azure outage: What went wrong?

Microsoft has published a detailed postmortem, with commitments for new Windows Azure storage features, following a service outage at the end of 2012.
Written by Mary Jo Foley, Senior Contributing Editor

At the end of 2012, Windows Azure was down for two-plus days for some of Microsoft's public-cloud customers making use of Microsoft's US South region datacenters.

(Credit: Microsoft)

On January 16, Microsoft officials delivered their postmortem on what happened on the Windows Azure blog.

According to the company, the outage affected 1.8 percent of the Windows Azure Storage accounts that were in one "storage stamp," or cluster with multiple racks of storage nodes, in the US South region. Making matters worse, as my ZDNet colleague Ed Bott, who covered the outage, noted, the health dashboard — designed to alert customers of service issues — wasn't working because it was reliant on the cluster that went down.

"Due to the extraordinary nature and duration of this event we are providing a 100 percent service credit to all affected customers on all storage capacity and storage transaction charges for the impacted monthly billing period," said Mike Neil, author of the post-mortem post on the Windows Azure blog site. (The credits will be applied proactively.)

The detailed blog post noted that there were three issues that led to the December storage outage. Some of the nodes didn't have node protection turned on. The monitoring system for detecting this kind of problem had a defect, resulting in no alarms or escalation. On top of this, a transition to a new primary node triggered a reaction that led to an incorrect formatting of other nodes. Normally, according to the post, Azure should have survived the simultaneous failure of two nodes of data within a stamp, as the system keeps three copies of data spread across three separate fault domains.

"However, the reformatted nodes were spread across all fault domains, which, in some cases, led to all three copies of data becoming unavailable," Neil explained.

The team came up with a Plan A (restoring data in place) and a Plan B (geo-redundant storage and failover). They decided to go with A.

Going forward, the Azure team has committed to allow customers to choose between durability, Neil said. To enable this, Microsoft is working to add a handful of new features to Azure. These include:

  • Geo-Replication for Queue data for Windows Azure Storage accounts

  • Read-only access to a customer's storage account from the secondary location 

  • Customer-controlled failover for Geo-Replicated Storage Accounts, by allowing customers to "prioritize service availability over data durability based on their individual business needs," which will be provided via a programming interface to trigger failover of a storage account.

One of the most vocal Azure customers affected by the late December outage was Windows diagnostics vendor Soluto.

In early January, the company sent an apology note to its users for the "62 hours of unexpected downtime" and complete inability to reach the company's services.

In that note (a copy of which I saw thanks to Shaun Jennings), Soluto officials noted they decided to go with Azure over Amazon Web Services because the company believed it could develop its solution faster on top of it, despite the fact that Amazon was the more mature platform.

"In addition, we got lots of help from Microsoft by being added to their BizSpark One program: we got both great pricing and the highest level of support," the note said. Officials said the decision paid off.

Despite being hit by both the storage outage and a subsequent Ajax content-delivery-network problem, Soluto is sticking with Azure, said Roee Adler, chief product officer.

"We will be adding redundancy for some critical elements, probably on Amazon, but for now we're sticking with Azure for the most part," Adler said.

Editorial standards