Microsoft's Windows Azure has a meltdown

Customers on Microsoft's Azure cloud platform are reporting they've been down for hours.
Written by Mary Jo Foley, Senior Contributing Editor

Customers are reporting that Microsoft's Windows Azure cloud platform has been experiencing a major meltdown in geographies across the world.

The Register is reporting that problems began around 9 p.m. ET on February 28. This morning, as of 11 a.m. ET on February 29, I am still hearing from customers affected by the problem.

One of my readers wrote in:

"The Windows Azure Service Management is / has been down worldwide for about 12 hours or more. Have a look at the Service Dashboard of Windows Azure (if you can reach it): https://www.windowsazure.com/en-us/support/service-dashboard/. It looks like the Service Management works again, but despite that I just see more warnings and errors popping up on the dashboard latest hours...

"I'm really astonished how this can happen world wide, and for such a long time. And glad we don't have anything in production yet (just playing around so far). How reliable and mature is Azure at the moment?"

I can't access the Azure dashboard myself.

Update No. 1 (11:25 am ET): I finally got the dashboard to load. I see it saying that the Management Service is still experiencing an outage worldwide. Compute and Access Control services are experiencing "performance degradation."

But another Azure customer told me that "reports are that 6.7, 28 and 35 percent of users are experiencing problems in the three data centers. Report says they’re investigating the cause of the problem."

ZDNet UK reported that the initial Azure problems began with an outage in the Windows Azure Management Service technology, which then spread to the Windows Azure Compute and Access Control parts of the platform. Affected areas included North Europe, North Central US and South Central US regions, ZDNet UK said.

I've seen some speculating on Twitter that all of these problems could stem from some kind of Leap Year bug. Microsoft officials said they had an update for me. I will add it to this post once I get it and will continue tracking the issue.

Update No. 2 (12:05 pm ET): Here's an update from a Microsoft Azure spokesperson. Still no word from Microsoft as to what is causing the rolling series of problems:

"On February 28th, 2012 at 5:45 PM PST Microsoft became aware of an issue impacting Windows Azure service management in a number of regions.  Windows Azure engineering teams developed, validated and deployed a fix that resolved the issue for the majority of our customers. Some customers in 3 sub regions - North Central US, South Central and North Europe – remain affected.  Engineering teams are actively working to resolve the issue as soon as possible  We will update the Service Dashboard, hourly until this incident is resolved."

Update No. 3 (12:30 pm ET): Missed this February 29 piece on Data Center Knowledge that says Microsoft officials earlier confirmed that a cert issue (which sounds like it is Leap Year-related) does seem to blame for at least some of what's gone wrong.

From that post: "Microsoft said the Azure service management problems were caused by a 'a cert issue triggered on 2/29/2012' – presumably a date-related glitch with a security certificate triggered by the onset of the Feb. 29th 'Leap Day' which occurs once every four years."

Update No. 4 (3:15 pm ET): No new update from Microsoft for the past three hours, but it doesn't look like things are resolved by a longshot.

I am hearing from more and more customers that they are being affected across a variety of Azure services. A new check on the status dashboard is showing SQL Azure Data Sync is down for most of the U.S. Compute is still iffy in North Central and South Central U.S., as well as Northern Europe. Service Bus is down totally in South US. And Service Management is still totally down worldwide. ZDNet UK is likewise monitoring the dashboard and keeping up with the latest service degradation and outage reports across the Azure service stack.

Update No. 5 (8:00 pm ET): Microsoft's Bill Laing, the head of Server and Cloud, has a new blog post with the latest on the Azure problems. He says the root problem does, indeed, seem to stem from "a time calculation that was incorrect for the leap year."

From Laing's update, which noted that even after a fix was applied, some customers still had issues:

"(S)ome sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues.  Customers should refer to the Windows Azure Service Dashboard for latest status. Windows Azure Storage was not impacted by this issue."

Microsoft plans to share more of its analysis of the root cause of today's outage once it is resolved, Laing added.

Update No. 6: (7:45 am ET on March 1): The dashboard is looking almost all green this morning, with the exception of some ongoing performance degradation in the South Central US region. Looks like it's all systems go for Azure customers.

Editorial standards