Lesson learned from the Outlook.com outage

Lesson learned from the Outlook.com outage

Summary: A holistic approach to the entire infrastructure, not just IT components, is necessary for maximum reliability.

TOPICS: Data Centers

When Microsoft’s Outlook.com and Hotmail.com email services started becoming unavailable to users on the afternoon of March 12, it was inevitable that pundits would point to problems in the migration of millions of users from the Hotmail service to the new Outlook.com enterprise as the cause of the problem. But given the scope of the beta process and the high-profile nature of the migration it really seemed unlikely that Microsoft would have missed something that would have taken the service offline for the 16 hours of service interruption that users experienced.

Microsoft’s root cause analysis, published on Wednesday afternoon, narrowed the culprit down not to issues with the IT load equipment in the datacenter, but rather a software failure in the HVAC management system after a software upgrade was applied. When the cooling and air management systems failed to operate properly, the temperature spike in the datacenter triggered a cascade of failures as automated measures in place to protect IT load and data took those servers offline. The shutdowns also apparently impacted the automated failover process and rather than switching users to the failover systems, users were locked out of their Hotmail and Outlook.com resources.

Microsoft also found themselves in the position of having a failure that required not simply automated procedures to mitigate, but actual human involvement. They made it clear in their explanation of the service interruption that this was not the way the process was supposed to work and was part of the reason that the service interruption extended as long as it did.

In many ways this highlights a traditional datacenter problem: facilities management and IT not being a tightly integrated operation but rather two groups with their own goals and priorities.  I don’t know if this was the exact case in this incident, but a question that IT should perhaps now ask is “what happens if facilities turn off the A/C in the datacenter?”

As automation becomes the de facto standard for datacenter operations it seems somewhat obvious that the entire building that houses the datacenter needs to be part of that automation. Smart Building design isn’t a new technology; facilities have been working on automating the behavior of their HVAC and power systems to maximize efficiency in office buildings for years.  If software upgrades need to be tested before being deployed on HVAC systems, then that needs to be factored in to the overall equation. High reliability environments have to be able to withstand failures throughout their infrastructure, not just those components dealing directly with the IT load.

Topic: Data Centers

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • HVAC probably ran Windows

    And that was the root cause of the failure.

    Microsoft has no clue about uptime, reliability, or resiliency.
    • Show me one company that has no failure

      Amazon, Google and so on all had a ton, idiot.
    • Where is the evidence?

      I didn't see anywhere that it said the HVAC system was running on Windows.
      Your credibility becomes nothing when you slam something with no proof.
      • Re: Where is the evidence?

        So It's alright for LBiege to call the guy an idiot, for no reason, but because Itguy10 said something negative about MS, you call out him? (Itguy10)
        Talk about your double standards!

        And no, I don't hate or dislike MS. Just pointing out that you can't condone one and not the other......

    • Good findings, now continue your janitor duties.

      • They should let janitors do their work at Microsoft...

        ... instead of having them manage the outlook.com - :D
        No offense to janitors :)
        • Windows - so easy a janitor can do it.

          Pun intended
          William Farrel
    • So what you're saying is that you brain runs on Windows

      Since you always equate failure with MS
      William Farrel
  • I call BS on that explanation. The first part sure fine, the failover part

    is a crock. There should be no way any of that could impact failover. The failover should be designed so that the whole building could go to zero connectivity and zero power immediately without any warning or be completely blown to bits and still work because it has nothing to do with anything on premises. You don't get incremental shutdown in an earthquake or an airplane strike or a terrorist attack. It's routing outside the datacenter that notices the servers there are not responding and sends requests elsewhere. The only thing that should be missing is the data from the last second or two that hadn't made it out to the geo replication backups.
    Johnny Vegas
    • No kidding. Is MS really trying to tell us that

      they have all their eggs in one basket?
  • Feeling Bamboozled?

    So are those who switched over from Gmail now feeling Bamboozled?
  • Oh, dear! Another cloud failure!

    Live by the cloud, die by the cloud.

    • Not really

      Another Microsoft failure.

      There; fixed for you
  • Let me get this straight

    So all of Outlook is housed in one data centre and if that isn't the case why should one data centre failure take down the whole system ?

    I smell BS or a poorly configured system.
    Alan Smithie
  • Oulook/Hotmail failure

    The domino like chain of events sounds similar to the Chernobyl disaster in some respects.
  • Lesson learned from the Outlook.com outage

    I had no problems accessing my mail. I must get it from a different data center.