Microsoft's Azure cloud leap-day meltdown

Microsoft's Azure cloud leap-day meltdown

Summary: Everyone makes mistakes, but for Microsoft to make a killer leap day blunder with its Azure cloud service is inexcusable.

SHARE:

Sometimes, Microsoft can make great programs, Windows 2008 R2 and Windows 7 SP1. And, sometimes they can blow it, Vista and, from what I've seen so far, Windows 8. But every now and again Microsoft fouls up in such a spectacular fashion that I'm left to wonder how anyone can use them for mission-critical work. There was the London Stock Exchange failure, which is one reason why almost all the world's leading stock exchanges now use Linux. Microsoft's Azure cloud collapse may prove to be a similar turning-point for Microsoft's cloud service.

In case you missed it, on the same day Microsoft fans were slapping themselves on the back for Windows 8 Consumer Preview getting out the door, Microsoft's Windows Azure Platform-as-a-Service (PaaS) cloud suffered a worldwide meltdown. For almost 36-hours, Windows Azure Service Management was down.

Even after Microsoft had a fix in, faults continued to spread across the Azure cloud in America and Northern Europe. As some areas came back up Compute functionality in the North Central US, South Central US and North Europe regions, functionality was downgraded or even turned off on a range of Azure services.

What caused Azure to fall down and go boom? Microsoft hasn't really spelled out what happened yet but, according to Bill Laing, Microsoft's Corporate VP of Server and Cloud, "Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year."

Well, who could blame Microsoft for that? I mean how often do we get a leap year... Oh wait, we get a leap year once every four years! Who knew? Apparently not Microsoft's developers.

This is incredible. How in the world can a company the size of Microsoft make such a simple, stupid mistake as not accounting for a leap day in its most important cloud service? How can any business trust a cloud that can go out of service because of a programming blunder that would get a failing mark in a software development 101 class? I don't know. I really don't.

I do know that businesses putting all their computing eggs into one Azure basket led to untold damages. If you want to continue to take chances with Azure, good for you. Just be ready to explain to your board of directors exactly why you thought trusting Azure was a smart move. Good luck with that.

Azure's failure, while an especially spectacular one, reminds me again just how vulnerable any business that puts its trust into the cloud model is. No cloud, not even one built on Linux or open-source cloud technologies such as Eucalyptus and OpenStack is immune to major problems. You need to carefully plan for cloud failures no matter whose cloud you use.

That said I will also say that in the open-source model, where with many eyeballs on the code all bugs are shallow, I'm sure that we'll never see a kiddie programming mistake take out a global cloud the way Azure fell apart. Clouds are dangerous enough as they are for enterprises, if you can't trust their code, how can you trust your company's business to them at all?

Related Stories:

Windows Azure suffers worldwide outage

Microsoft's Windows Azure has a meltdown

Is uptime the wrong metric for cloud service-level agreements?

2011: the cloud has landed

Cloud in 2012: The awkward teenage years are upon us

Topics: Microsoft, Linux, Open Source, Operating Systems, Software, Windows

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

49 comments
Log in or register to join the discussion
  • But ... but ...

    ... no one ever got fired for choosing Microsoft.
    kludd
    • People should start

      Fire 'em for choosing Microsoft.

      How many times must they blunder or how many $$'s must be lost due to Microsoft's Ineptitude?
      itguy10
      • That is why there are so many Linux openings

        people routinely are fired for implimenting Linux.

        It is not a question of if it will be hacked or fail, but when. :|
        Tim Cook
    • Clouds? There's more than MS Azure at play here. Reread the article.

      [i]No cloud, not even one built on Linux or open-source cloud technologies such as Eucalyptus and OpenStack is immune to major problems. You need to carefully plan for cloud failures no matter whose cloud you use.[/i]

      Sounds like Steven is more down on Clouds than he is Microsoft.

      [i]That said I will also say that in the open-source model, where with many eyeballs on the code all bugs are shallow, Im sure that well never see a kiddie programming mistake take out a global cloud the way Azure fell apart. Clouds are dangerous enough as they are for enterprises, if you cant trust their code, how can you trust your companys business to them at all?[/i]

      *This* is what everyone should be fighting over. Open-source code is more bug-free than proprietary code? Really!? I wonder what percentage of the Linux fans that post here download and review the source code.

      If Steven's assertion *is* true, then shouldn't the open-source Chromium browser be more bug-free than the proprietary Chrome browser? And shouldn't CynaogenMod be more bug-free than the proprietary Android on the pretty devices you buy? Since Steven doesn't wave his pom-poms for Chromium and CyanogenMod, I say [b]hypocrisy[/b].
      Rabid Howler Monkey
  • Wow back to the tired discredited LSE FUD dead horse again?

    Still ignoring the fact that it was a bug in accentures application code and nothing to do with windows, .net, sql, or anything else from ms. That system was faster and better on less hardware than what replaced it and would still be in use today (and even faster on the same hw) if if had been properly tested first. As for azure the damages arent untold from what Ive read on zdnet and elsewhere. It was almost completely contained to the managment sites. A small percent of sites saw their performance decrease by up to 6%. The other 90+% had no impact at all. This was a pin prick when compared to the (numerous) amazon or google cloud outages. And as for the FOSS "many eyeballs" FUD, well we've all seen numerous examples how what a crock that is. MS still provides the most secure and reliable stack and outperforms on both end product and developer productivity. You are taking a much bigger risk of failure and security breach with a lot more explaining to do to the board if you go FOSS than MS. Yet another lame piece of completely biased FUD. Zdnet you really should implement the +- for the posts as well as the talkbacks.
    Johnny Vegas
    • The great thing about outsourcing...

      ...is that there's always someone else to blame.

      Never mind that MS apparently failed to test for this particular contingency.

      Reply to otaddy:

      Then maybe customers should be more careful about patronizing big companies.
      John L. Ries
      • I agree, they own it, their responsibility to test and accept it.

        But this is a problem all big companies face: trying to ensure consistency in their efforts. The quality of the various teams across the company varies greatly.

        Not trying to make excuses for MS here, they failed big time in their quality control efforts, but I understand the difficulties of working for a giant company.
        otaddy
    • Dumber than bricks

      That's what you are. There's quite a few cases out there where companies have relied on Azure and been hung out to dry and it has cost them big. So, it's "not many" but it still hurts for those affected. Oh, and just because other providers have had problems does not give Microsoft a pass.

      The rest of your post is pants, starting with your excusing of the LSE fiasco. Microsoft's systems were at the heart of the problems there. Microsoft have not released enough information about the problem and why it's still bouncing around the world. So, how do you think you know anything about it?
      ego.sum.stig
      • re lse i have read back at the time what the cause was and it was an

        application error not anything to do with windows or .net or SQL. As for this azure issue I know nothing about the route cause beyond what has been said about certificates but I have also read about the limited scope and impact of it and that's precisely what I addressed. It had zero impact on almost all running services/apps an negligible impact on the few rest. If you have links to any previous service interruptions please post them, I would love to read them.
        Johnny Vegas
      • Of course he's dumber then bricks

        becasue we're only supposed to talk about MS's mistakes, we can't point out companies that have relied on Linux and been hung out to dry and it has cost them big, because that's not how things are supposed to work around here.
        William Farrel
      • Microsoft Hater!

        This article is another great example of ignorance and misinformation.
        jhenriks79
    • A bug in Linux based appliance code

      Mr. Vaughan-Nichols proudly points out that Linux is running many a router or switch.

      Until those routers and switches fail, then he is as quiet and secreted as the proverbial church mouse.
      :|
      Tim Cook
      • Just like Android is Linux

        and is touted whenever someone mentions the failures of desktop linux.
        otaddy
      • Get back to him when this level of failure is reported

        As in...

        [i]"According to a Microsoft dashboard update at 2:30pm around 37 percent of Azure Compute services in North Europe, 6.7 percent of North Central US and 28 percent of South Central US, were affected by the problems."[/i]

        And a cynic might suggest that the party in question might just be understating the problems, but that's just a cynic talking.
        ego.sum.stig
    • Agreed

      "Azure Cloud" did NOT go down. The management portals were unavailable, yes. But, Azure did not go down. Both Amazon's and Google's ACTUAL clouds affecting actual businesses and sites have gone down in the last year. See: http://techcrunch.com/2011/09/09/google-explains-its-google-docs-outage/ and http://www.computerworld.com/s/article/9216064/Amazon_gets_black_eye_from_cloud_outage
      cmoya
  • It is rather embarassing

    You'd think that there would still be 5 or 6 Y2K veterans left at MS to warn people about such things. The moral of the story, however, is that it can happen to just about anyone; the only real thing that can be done to prevent such embarassments is adequate testing (I think Mr. Ballmer may want to pay his testers a visit real soon).

    Note to SJVN: This article really belongs in your Networking blog, not "Linux and Open Source".
    John L. Ries
    • It's more than embarrassing...

      I do understand the perils of date logic, although I've never known anyone to slip up on "Leap Day" logic before. I'd have thought that entering / exiting Daylight Saving Time would have been the more likely candidates: even our team goofed on that one.

      [b]Although we discovered our error via TESTING![/b]
      Zogg
  • Biased commentary: the only bug-free process of mankind

    Any approach comes with a cost, some of which is founded on risk. This article, and others like it, make it sound like Azure is all alone in this regard. Of course, it is not. No matter where you run, you cannot hide from human error - you can only manage it. In my experience, most companies don't even make the attempt. One thing can be said in favor of Azure (or any cloud provider): it's NOT the Wild West of your average small business IT, worked by cowboys and overseen by Sunday-morning preachers.
    scH4MMER
    • To quote the author

      "Everyone makes mistakes".

      But I hope my mistakes are less obvious (or disasterous) than this one.
      John L. Ries
      • Considering what people think the mistake is

        They have every reason to suspect that the work on Azure was done by "cowboys and overseen by Sunday-morning preachers."
        ego.sum.stig