Lesson learned from the Outlook.com outage

A holistic approach to the entire infrastructure, not just IT components, is necessary for maximum reliability.

When Microsoft’s Outlook.com and Hotmail.com email services started becoming unavailable to users on the afternoon of March 12, it was inevitable that pundits would point to problems in the migration of millions of users from the Hotmail service to the new Outlook.com enterprise as the cause of the problem. But given the scope of the beta process and the high-profile nature of the migration it really seemed unlikely that Microsoft would have missed something that would have taken the service offline for the 16 hours of service interruption that users experienced.

Microsoft’s root cause analysis, published on Wednesday afternoon, narrowed the culprit down not to issues with the IT load equipment in the datacenter, but rather a software failure in the HVAC management system after a software upgrade was applied. When the cooling and air management systems failed to operate properly, the temperature spike in the datacenter triggered a cascade of failures as automated measures in place to protect IT load and data took those servers offline. The shutdowns also apparently impacted the automated failover process and rather than switching users to the failover systems, users were locked out of their Hotmail and Outlook.com resources.

Microsoft also found themselves in the position of having a failure that required not simply automated procedures to mitigate, but actual human involvement. They made it clear in their explanation of the service interruption that this was not the way the process was supposed to work and was part of the reason that the service interruption extended as long as it did.

In many ways this highlights a traditional datacenter problem: facilities management and IT not being a tightly integrated operation but rather two groups with their own goals and priorities.  I don’t know if this was the exact case in this incident, but a question that IT should perhaps now ask is “what happens if facilities turn off the A/C in the datacenter?”

As automation becomes the de facto standard for datacenter operations it seems somewhat obvious that the entire building that houses the datacenter needs to be part of that automation. Smart Building design isn’t a new technology; facilities have been working on automating the behavior of their HVAC and power systems to maximize efficiency in office buildings for years.  If software upgrades need to be tested before being deployed on HVAC systems, then that needs to be factored in to the overall equation. High reliability environments have to be able to withstand failures throughout their infrastructure, not just those components dealing directly with the IT load.