Microsoft's Azure cloud leap-day meltdown

Everyone makes mistakes, but for Microsoft to make a killer leap day blunder with its Azure cloud service is inexcusable.

Sometimes, Microsoft can make great programs, Windows 2008 R2 and Windows 7 SP1. And, sometimes they can blow it, Vista and, from what I've seen so far, Windows 8. But every now and again Microsoft fouls up in such a spectacular fashion that I'm left to wonder how anyone can use them for mission-critical work. There was the London Stock Exchange failure, which is one reason why almost all the world's leading stock exchanges now use Linux. Microsoft's Azure cloud collapse may prove to be a similar turning-point for Microsoft's cloud service.

In case you missed it, on the same day Microsoft fans were slapping themselves on the back for Windows 8 Consumer Preview getting out the door, Microsoft's Windows Azure Platform-as-a-Service (PaaS) cloud suffered a worldwide meltdown. For almost 36-hours, Windows Azure Service Management was down.

Even after Microsoft had a fix in, faults continued to spread across the Azure cloud in America and Northern Europe. As some areas came back up Compute functionality in the North Central US, South Central US and North Europe regions, functionality was downgraded or even turned off on a range of Azure services.

What caused Azure to fall down and go boom? Microsoft hasn't really spelled out what happened yet but, according to Bill Laing, Microsoft's Corporate VP of Server and Cloud, "Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year."

Well, who could blame Microsoft for that? I mean how often do we get a leap year... Oh wait, we get a leap year once every four years! Who knew? Apparently not Microsoft's developers.

This is incredible. How in the world can a company the size of Microsoft make such a simple, stupid mistake as not accounting for a leap day in its most important cloud service? How can any business trust a cloud that can go out of service because of a programming blunder that would get a failing mark in a software development 101 class? I don't know. I really don't.

I do know that businesses putting all their computing eggs into one Azure basket led to untold damages. If you want to continue to take chances with Azure, good for you. Just be ready to explain to your board of directors exactly why you thought trusting Azure was a smart move. Good luck with that.

Azure's failure, while an especially spectacular one, reminds me again just how vulnerable any business that puts its trust into the cloud model is. No cloud, not even one built on Linux or open-source cloud technologies such as Eucalyptus and OpenStack is immune to major problems. You need to carefully plan for cloud failures no matter whose cloud you use.

That said I will also say that in the open-source model, where with many eyeballs on the code all bugs are shallow, I'm sure that we'll never see a kiddie programming mistake take out a global cloud the way Azure fell apart. Clouds are dangerous enough as they are for enterprises, if you can't trust their code, how can you trust your company's business to them at all?

Related Stories:

Windows Azure suffers worldwide outage

Microsoft's Windows Azure has a meltdown

Is uptime the wrong metric for cloud service-level agreements?

2011: the cloud has landed

Cloud in 2012: The awkward teenage years are upon us

Editorial standards