Microsoft is issuing many of its Windows Azure users with a 33 percent credit "due to the extraordinary nature" of the February 29 cloud-service outage caused by a leap-year bug.
Microsoft officials said all customers of its Azure Compute, Access Control, Service Bus and Caching will get the credit for the entire billing months for its services, whether or not their service was affected. Microsoft execs shared that information -- as well as a play-by-play dissection of what caused the widespread outage in a March 9 blog entry (posted at 9 pm ET on Friday, March 9).
The widespread Azure outage began around 9 pm ET on February 28. Customers in Europe, North America and other areas were hit by a series of rolling problems over the course of two days. Many said they weren't able to access the Azure dashboard, which was basically the only means by which Microsoft was sharing information about the status of the different Azure services. The outage was largely resolved by the morning (ET) of March 1.
The leap-year bug caused a first outage, which then led to a secondary outage. Bill Laing, the head of Microsoft's server and cloud team, explained what happened:
"The leap day bug immediately triggered at 4:00PM PST, February 28th (00:00 UST February 29th) when GAs (guest agents) in new VMs tried to generate certificates. Storage clusters were not affected because they don’t run with a GA, but normal application deployment, scale-out and service healing would have resulted in new VM creation. At the same time many clusters were also in the midst of the rollout of a new version of the FC (fabric controller), HA (host agent) and GA."
Laing said Microsoft is taking steps to prevent future time-related bugs with new testing procedures, improvements in dashboard service availability, and a commitment to provide alternate communication channels when outages happen.