Microsoft to provide Azure users with 33 percent credit for February outage

Microsoft to provide Azure users with 33 percent credit for February outage

Summary: Microsoft officials have posted a detailed analysis of what led to a widespread leap-year-day outage of its Azure public cloud service.

SHARE:
TOPICS: Microsoft, Outage
3

Microsoft is issuing many of its Windows Azure users with a 33 percent credit "due to the extraordinary nature" of the February 29 cloud-service outage caused by a leap-year bug.

Microsoft officials said all customers of its Azure Compute, Access Control, Service Bus and Caching will get the credit for the entire billing months for its services, whether or not their service was affected. Microsoft execs shared that information -- as well as a play-by-play dissection of what caused the widespread outage in a March 9 blog entry (posted at 9 pm ET on Friday, March 9).

The widespread Azure outage began around 9 pm ET on February 28. Customers in Europe, North America and other areas were hit by a series of rolling problems over the course of two days. Many said they weren't able to access the Azure dashboard, which was basically the only means by which Microsoft was sharing information about the status of the different Azure services. The outage was largely resolved by the morning (ET) of March 1.

The leap-year bug caused a first outage, which then led to a secondary outage. Bill Laing, the head of Microsoft's server and cloud team, explained what happened:

"The leap day bug immediately triggered at 4:00PM PST, February 28th (00:00 UST February 29th) when GAs (guest agents) in new VMs tried to generate certificates. Storage clusters were not affected because they don’t run with a GA, but normal application deployment, scale-out and service healing would have resulted in new VM creation. At the same time many clusters were also in the midst of the rollout of a new version of the FC (fabric controller), HA (host agent) and GA."

Laing said Microsoft is taking steps to prevent future time-related bugs with new testing procedures, improvements in dashboard service availability, and a commitment to provide alternate communication channels when outages happen.

Topics: Microsoft, Outage

About

Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • It's not easy getting leap years working;-)

    This company is priceless.

    Take the credit, or move to a unix cloud platform?
    Richard Flude
    • Unix has no bugs, right?

      So of course, one outage on Azure is going to chase you to a unix cloud platform, like, say, AWS, which has never had a multi-day outage before.

      <facepalm>
      jdzions
    • CostCloud

      While an unfortunate occurrence, this event clearly demonstrates how many of the risks of outsourcing part of your infrastructure to the Cloud are offset by service level agreements. When errors are committed by a private IT department, there is no compensation. I'm speaking about the Cloud in general, not just Azure.
      scH4MMER