Microsoft's Windows Azure has a meltdown

Microsoft's Windows Azure has a meltdown

Summary: Customers on Microsoft's Azure cloud platform are reporting they've been down for hours.

SHARE:

Customers are reporting that Microsoft's Windows Azure cloud platform has been experiencing a major meltdown in geographies across the world.

The Register is reporting that problems began around 9 p.m. ET on February 28. This morning, as of 11 a.m. ET on February 29, I am still hearing from customers affected by the problem.

One of my readers wrote in:

"The Windows Azure Service Management is / has been down worldwide for about 12 hours or more. Have a look at the Service Dashboard of Windows Azure (if you can reach it): https://www.windowsazure.com/en-us/support/service-dashboard/. It looks like the Service Management works again, but despite that I just see more warnings and errors popping up on the dashboard latest hours...

"I'm really astonished how this can happen world wide, and for such a long time. And glad we don't have anything in production yet (just playing around so far). How reliable and mature is Azure at the moment?"

I can't access the Azure dashboard myself.

Update No. 1 (11:25 am ET): I finally got the dashboard to load. I see it saying that the Management Service is still experiencing an outage worldwide. Compute and Access Control services are experiencing "performance degradation."

But another Azure customer told me that "reports are that 6.7, 28 and 35 percent of users are experiencing problems in the three data centers. Report says they’re investigating the cause of the problem."

ZDNet UK reported that the initial Azure problems began with an outage in the Windows Azure Management Service technology, which then spread to the Windows Azure Compute and Access Control parts of the platform. Affected areas included North Europe, North Central US and South Central US regions, ZDNet UK said.

I've seen some speculating on Twitter that all of these problems could stem from some kind of Leap Year bug. Microsoft officials said they had an update for me. I will add it to this post once I get it and will continue tracking the issue.

Update No. 2 (12:05 pm ET): Here's an update from a Microsoft Azure spokesperson. Still no word from Microsoft as to what is causing the rolling series of problems:

"On February 28th, 2012 at 5:45 PM PST Microsoft became aware of an issue impacting Windows Azure service management in a number of regions.  Windows Azure engineering teams developed, validated and deployed a fix that resolved the issue for the majority of our customers. Some customers in 3 sub regions - North Central US, South Central and North Europe – remain affected.  Engineering teams are actively working to resolve the issue as soon as possible  We will update the Service Dashboard, hourly until this incident is resolved."

Update No. 3 (12:30 pm ET): Missed this February 29 piece on Data Center Knowledge that says Microsoft officials earlier confirmed that a cert issue (which sounds like it is Leap Year-related) does seem to blame for at least some of what's gone wrong.

From that post: "Microsoft said the Azure service management problems were caused by a 'a cert issue triggered on 2/29/2012' – presumably a date-related glitch with a security certificate triggered by the onset of the Feb. 29th 'Leap Day' which occurs once every four years."

Update No. 4 (3:15 pm ET): No new update from Microsoft for the past three hours, but it doesn't look like things are resolved by a longshot.

I am hearing from more and more customers that they are being affected across a variety of Azure services. A new check on the status dashboard is showing SQL Azure Data Sync is down for most of the U.S. Compute is still iffy in North Central and South Central U.S., as well as Northern Europe. Service Bus is down totally in South US. And Service Management is still totally down worldwide. ZDNet UK is likewise monitoring the dashboard and keeping up with the latest service degradation and outage reports across the Azure service stack.

Update No. 5 (8:00 pm ET): Microsoft's Bill Laing, the head of Server and Cloud, has a new blog post with the latest on the Azure problems. He says the root problem does, indeed, seem to stem from "a time calculation that was incorrect for the leap year."

From Laing's update, which noted that even after a fix was applied, some customers still had issues:

"(S)ome sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are actively working to address these remaining issues.  Customers should refer to the Windows Azure Service Dashboard for latest status. Windows Azure Storage was not impacted by this issue."

Microsoft plans to share more of its analysis of the root cause of today's outage once it is resolved, Laing added.

Update No. 6: (7:45 am ET on March 1): The dashboard is looking almost all green this morning, with the exception of some ongoing performance degradation in the South Central US region. Looks like it's all systems go for Azure customers.

Topics: Microsoft, Operating Systems, Software, Windows

About

Mary Jo has covered the tech industry for 30 years for a variety of publications and Web sites, and is a frequent guest on radio, TV and podcasts, speaking about all things Microsoft-related. She is the author of Microsoft 2.0: How Microsoft plans to stay relevant in the post-Gates era (John Wiley & Sons, 2008).

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

41 comments
Log in or register to join the discussion
  • It just works. Ouch!

    Ouch!
    cgmergel@...
    • Ya. Virtual proof that...

      ...the cloud dosnt work yet. I have been saying this for the better part of 2 years.

      THE CLOUD IS NOT READY FOR PRIME TIME.

      Believe it.

      Its a fact. Its true and the proof exists in living color. Nobody has it right yet. Not MS not nobody.

      Consumer beware. Be very aware.
      Cayble
  • Imagine that

    A Microsoft product that is not reliable or scalable. Who would have thought?

    Some of us knew all along Windows is not up to the task of REAL computing needs.
    itguy10
  • Was iCloud affected?

    I remember hearing that Azure was going to be part of the management of iCloud and we all have to know, was it affected?
    nucrash
    • Nothing here...

      All my stuff is fine.
      SamWilkinson
      • Vague article...

        It is unclear, but it seems that sites hosted on Azure were fine, but that the management portal was down....Is this what you understand? Our site has been up and running fine and we are in those supposedly affected datacenters.
        gomigomijunk
    • Development only

      I believe that was for development only.

      Makes no sense Apple would spend billions for their NC datacenter and not use it to power iCloud.
      itguy10
      • Your comments...

        rarely make any sense. Still making us it guys look bad I see.
        kstap
    • It's just the Service Management so far...

      It's just the Service Management so far, so while iCloud and other services may seem unaffected, the administrative part of the service (the side Apple is monitoring) appears to be down.

      It's unclear how it'll affect iCloud.
      olePigeon
  • LEAP YEAR!

    Could it be that it's a leap year and today is 2/28/12?
    Could a simple leap day have kicked them in the N*ts?
    Food for thought!
    mrhappysbluegrass@...
    • Y2K all over again LOL

      When CPM was written they never thought that it would still be in use in 2000. When Gates bought it an it became DOS no one though about it. Then came Y2K. So perhaps the MS folks forgot about leap year too?
      john_gillespie@...
  • Phew! My sites are okay :)

    I have sites hosted in the affected NA datacenter and they're live. Looks like Microsoft may have moved affected sites to unaffected racks - or I was just lucky.
    bitcrazed
    • I spoke too soon.

      Just after I posted this, Pingdom emailed me to tell me our main site was down. It crashed at 09:23 and wasn't back until 00:25!

      As I stated elsewhere, Microsoft really needs to find out why it took so long for their systems to recover from the outage once the patch was applied. We all accept that faults will happen and bugs are a fact of life. But its how quickly they can recover from a major outage that will define their success moving forward.

      It's a shame this happened because, until yesterday, we've been enjoying great service uptime and perf' from Azure.
      bitcrazed
  • Small price to pay

    Nothing special ... any cloud based service will from time to time experience technical difficulties ... doesn't matter if it's one of the big players like Microsoft and Sony or smaller fish like Steam... Yes some people will be affected, yes for some it will mean a "slow work day"... no biggie, it will as with all of those services be fixed... storing your data online DOES mean it might at times be unavailable.. I would guess people understood that when dumping their stuff in the cloud.. it's a small price to pay for having everything everywhere anytime...
    DJK2
    • A 12-hour world-wide outage is a bit more

      than a technical difficuly
      baggins_z
      • Just a click bait headline. A few goegrahies are having management access

        issues. All the hosted site/services WW are running great with full access to all data.
        Johnny Vegas
    • No Biggie?

      For a Azure customer with say a few hundred employees standing around doing nothing for an entire day, that means thousands of dollars of salary, and maybe up to hundreds of thousands of dollars worth of lost business.

      No biggie indeed.
      anothercanuck
      • AMEN! The CLOUD is NOT the solution its Marketed as

        This is a perfect example why the cloud is not the end all /be all in computing and more importantly shows why the Cloud is just like anything else, a tool that???s good for specific scenarios and not an all-around "works for everyone" solution as it is being sold to the public as.

        The push to move everyone to the cloud is :

        1) To create a recurring revenue stream in which the service provider/vendor has full access/control over their clients use of the application/service provided.

        2) Centralize management/storage of as much commerce/activity on the net as possible. If you???re the Federal Government or one of its agencies, the Cloud is the best thing to come along since the Patriot Act. The cloud will enable law enforcement to more easily get at anyone???s information. And NO level of promises or assurances from any cloud provider will change that.

        The sales staff and even the legal staff for a cloud services provider can promise/guarantee all they like but when it comes down to it, do you really think they will fight as hard as you would to keep the government from snooping around in your data? The answer is of course NO.

        You have only your company to protect whereas the Cloud provider has many and they will NOT go to all ends to protect you and your data like you would. Think carefully about the Cloud.
        BlueCollarCritic
    • RE: Small price to pay....

      "it's a small price to pay for having everything everwhere anytime...", except of course when it's down for hours/days at a time...
      douglas_john_ledet@...
    • Everywhere anytime?

      That has nothing to do with the cloud - those are software features that have existed for a long time.

      And of course you have to take the "anytime" out when its down . . . right? :-)
      Andy Blevins