Microsoft's December Azure outage: What went wrong?

Microsoft's December Azure outage: What went wrong?

Summary: Microsoft has published a detailed postmortem, with commitments for new Windows Azure storage features, following a service outage at the end of 2012.

SHARE:
16

At the end of 2012, Windows Azure was down for two-plus days for some of Microsoft's public-cloud customers making use of Microsoft's US South region datacenters.

azureloginthumbnail
(Credit: Microsoft)

On January 16, Microsoft officials delivered their postmortem on what happened on the Windows Azure blog.

According to the company, the outage affected 1.8 percent of the Windows Azure Storage accounts that were in one "storage stamp," or cluster with multiple racks of storage nodes, in the US South region. Making matters worse, as my ZDNet colleague Ed Bott, who covered the outage, noted, the health dashboard — designed to alert customers of service issues — wasn't working because it was reliant on the cluster that went down.

"Due to the extraordinary nature and duration of this event we are providing a 100 percent service credit to all affected customers on all storage capacity and storage transaction charges for the impacted monthly billing period," said Mike Neil, author of the post-mortem post on the Windows Azure blog site. (The credits will be applied proactively.)

The detailed blog post noted that there were three issues that led to the December storage outage. Some of the nodes didn't have node protection turned on. The monitoring system for detecting this kind of problem had a defect, resulting in no alarms or escalation. On top of this, a transition to a new primary node triggered a reaction that led to an incorrect formatting of other nodes. Normally, according to the post, Azure should have survived the simultaneous failure of two nodes of data within a stamp, as the system keeps three copies of data spread across three separate fault domains.

"However, the reformatted nodes were spread across all fault domains, which, in some cases, led to all three copies of data becoming unavailable," Neil explained.

The team came up with a Plan A (restoring data in place) and a Plan B (geo-redundant storage and failover). They decided to go with A.

Going forward, the Azure team has committed to allow customers to choose between durability, Neil said. To enable this, Microsoft is working to add a handful of new features to Azure. These include:

  • Geo-Replication for Queue data for Windows Azure Storage accounts

  • Read-only access to a customer's storage account from the secondary location 

  • Customer-controlled failover for Geo-Replicated Storage Accounts, by allowing customers to "prioritize service availability over data durability based on their individual business needs," which will be provided via a programming interface to trigger failover of a storage account.

One of the most vocal Azure customers affected by the late December outage was Windows diagnostics vendor Soluto.

In early January, the company sent an apology note to its users for the "62 hours of unexpected downtime" and complete inability to reach the company's services.

In that note (a copy of which I saw thanks to Shaun Jennings), Soluto officials noted they decided to go with Azure over Amazon Web Services because the company believed it could develop its solution faster on top of it, despite the fact that Amazon was the more mature platform.

"In addition, we got lots of help from Microsoft by being added to their BizSpark One program: we got both great pricing and the highest level of support," the note said. Officials said the decision paid off.

Despite being hit by both the storage outage and a subsequent Ajax content-delivery-network problem, Soluto is sticking with Azure, said Roee Adler, chief product officer.

"We will be adding redundancy for some critical elements, probably on Amazon, but for now we're sticking with Azure for the most part," Adler said.

Topics: Cloud, Amazon, Microsoft, Storage

About

Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

16 comments
Log in or register to join the discussion
  • The risks you take

    "In that note (a copy of which I saw thanks to Shaun Jennings), Soluto officials noted they decided to go with Azure over Amazon Web Services because the company believed it could develop its solution faster on top of it, despite the fact that Amazon was the more mature platform."

    It is not like they can go back in time and change that decision. I am sure there is a contract in place preventing them from switching and it would be quite costly. They will just have to roll with the punches if any other problems arise.
    coastin
    • Or maybe they're 100% it's still the right choice.

      you make it sound like they really want to jump to Amazon, but are prevented from doing that.
      It sounds like they're still plenty happy with their choice.

      They probably also read about the Netflix fiasco over the holidays.

      Now Netflix on Amazon. Thats a different story. - With their recent outages, I wonder if they want to move to another provider but cant. At one time I'm sure they thought it a good decision, Problem is, It's not like they can go back in time and change that decision. I am sure there is a contract in place preventing them from switching and it would be quite costly. They will just have to roll with the punches if any other problems arise.
      William Farrel
      • Funny

        "I am sure there is a contract in place preventing them from switching and it would be quite costly. They will just have to roll with the punches if any other problems arise."

        I think I've read that somewhere....
        coastin
        • Amazon crashed Netflix during Christmas when it matters the most

          If you switch Amazon you'd get hit harder.
          LBiege
          • Yep it did

            cause a 24 hour holiday outage for many when a single load balancing server had problems. Perhaps if it were not Christmas Day they may ahve had the problem solved sooner.

            The Azure outage lasted 62 hours and was caused by multiple problems including human error in the setup. I hardly thing Netflix was hit harder than Soluto and have not seen any refund from Netflix. So if you are trying to compare you might want to re-check the facts.
            coastin
          • Big holiday like Christmas is when they make money

            They'd rather have a 2-day outage in March then one day on Christmas, trust me.
            LBiege
  • Kudos to Microsoft

    What a refreshing change from certain other companies who would have said:
    You were using it wrong.
    toddbottom3
    • no choice but to come clean

      if its down its down, they can't weasel out of it - its futile. Reception is more of a grey area. Note, I hate both apple and MS so I'm fairly unbiased here.
      deathjazz
    • why bring apple into this.

      I can't see the point of bringing apple into this. You're right in that apple rarely takes responsibility for what goes wrong with their products but I can't see a direct comparison of an Azure outage with something similar at apple.
      slimjim1989
  • Still it is pretty clear

    that it is a bitter pill for Soluto to swallow:

    "We will be adding redundancy for some critical elements, probably on Amazon, but for now we're sticking with Azure for the most part," Adler said.

    Sounds like they do regret the choice and the fact they ignored the maturity of the Amazon services. They can plan the switch while the contract clock runs down or try and get out of the contract if any more major problems arise.
    coastin
    • Amazon no walk in the park

      Amazon has had numerous issues with the northern VA data center.

      To quote:

      "The new facilities are the latest in a series of data centers Amazon has deployed in Northern Virginia, one of the Internet’s key intersections. The US-East region has been the focus of multiple outages for sites hosted on Amazon Web Services, including two outages in June, and downtime in October and on Christmas Eve."

      http://www.datacenterknowledge.com/archives/2013/01/15/amazon-to-add-capacity-to-us-east-region/
      techvet
    • Mature Amazon services? Ask Netflix how that maturity is working out.

      Just not when they're not trying to get Amazon to get them back online during a critical Holiday period.

      And I doubt they regret the choice at all.

      "In addition, we got lots of help from Microsoft by being added to their BizSpark One program: we got both great pricing and the highest level of support," the note said. Officials said the decision paid off"

      Yup, sure sounds like a load of regret. ;)

      Are you sure you're not trying to spin this into a negative?
      William Farrel
    • My understanding of it is that they didnt even have redundancy enabled for

      Azure or they wouldnt have been down with an Azure storage node in just one datacenter going down. That was a Soluto not designing/paying for redundancy issue, not an Azure redundancy issue. Amazon is no cure for that.
      Johnny Vegas
  • Kudos to Microsoft for apologizing

    For "62 hours of unexpected downtime". Companies should be more open to apologizing officially to their customers. Like Apple also did after the iOS 6 Maps disaster (but should have done sooner).
    Smalahove
  • Microsoft's December Azure outage: What went wrong?

    hope cloud service providers will learn from all these outages to identify and resolve any systemic problem left unidentified during the design, development, and deployment of the current crops of cloud infrastructures. although, i believe that the most reliability they can achieve is at par or a little below par of internet reliability (if we want to consider internet reliability as the hallmark for cloud computing, that is.)
    kc63092@...
    • Keeping the Internet (Cloud) going

      takes lots of hard work and dedication from the tech that are hands-on. During hurricane Katrina, in New Orleans, .com registrar and web hosting provider directnic.com's crew of dedicated employees rode out the storm in in the French Quarter, rolling 55 gal. drums of fuel up three stories of parking garage every few hours to keep their DNS and hosting server up and running. When the US Military Commander stopped a water tanker on the river bridge, the directnic crew members got in touch with the media to let them know it was much needed water for the SW Bell Main Southern US communications switch cooling towers. If the water had been held up a few minutes longer vital communication would have been down.

      You gotta hand it to the technicians in the trenches.
      coastin