Microsoft has made available publicly a preliminary root cause analysis (RCA) for its September 4 cloud outage that impacted customers worldwide. The Azure engineering teams are continuing to investigate the incident and are saying they will provide a more detailed analysis "in the weeks ahead."
Impacted customers will receive a credit based on the Microsoft Azure Service Level Agreement in their October billing statements, Microsoft officials said in the post-mortem report.
On September 4, as I blogged originally, a lighting strike hit near Microsoft's South Central US datacenter region, knocking out a number of Azure services, as well as Office 365, which authenticates via Azure Active Directory, for many Microsoft customers worldwide.
Microsoft's post-mortem summary noted that the storm caused "electrical activity on the utility supply, which caused significant voltage swells." These swells caused some of one Azure datacenter to transfer to generator power and shut down the datacenter's cooling systems even though there were surge suppressors in place. The datacenter still maintained required operational temperatures through a load-dependent thermal buffer in the cooling system, but once that buffer was depleted, temperatures went up and an automated showdown of devices was initiated.
Some hardware was damaged before it could shut down, including a "significant number of storage servers" and other network devices and power units. Onsite teams began attempts to recover the infrastructure, which meant replacing failed hardware, migrating servers to healthy servers and validating that data wasn't corrupted.
For those wondering why Microsoft's datacenter didn't failover to a backup site: "The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication," officials explained in the post.
The shutdown of the datacenter impacted many Azure services that depended on the storage servers in that datacenter. Among the services hit: torage, Virtual Machines, Application Insights, Cognitive Services & Custom Vision API, Backup, App Service (and App Services for Linux and Web App for Containers), Azure Database for MySQL, SQL Database, Azure Automation, Site Recovery, Redis Cache, Cosmos DB, Stream Analytics, Media Services, Azure Resource Manager, Azure VPN gateways, PostgreSQL, Application Insights, Azure Machine Learning Studio, Azure Search, Data Factory, HDInsight, IoT Hub, Analysis Services, Key Vault, Log Analytics, Azure Monitor, Azure Scheduler, Logic Apps, Databricks, ExpressRoute, Container Registry, Application Gateway, Service Bus, Event Hub, Azure Portal IaaS Experiences- Bot Service, Azure Batch, Service Fabric and Visual Studio Team Services (VSTS).
Microsoft says "the vast majority of these services were mitigated by 11:00 UTC on September 5," but acknowledges full mitigation didn't happen until 8:40 on September 7.
Why were customers outside the U.S. South Central region also affected by this series of events? According to the post, there was "insufficient resiliency for Azure Service Manager," the operations-management service for "classic" resource types. "Although ASM is a global service, it does not support automatic failover," Microsoft execs said. And Azure Resource Manager services outside the South Central region also were impacted due to various dependencies on ASM and other related services.
Also: Microsoft 365: A cheat sheet TechRepublic
Azure Active Directory also was impacted, officials said, due to authentication traffic from the shut-down datacenter being routed to other sites, coupled with an increased rate in authentication requests. The post details what went wrong with VSTS, Azure Application Insights and other key services during that series of events in early September.
Microsoft execs said they apologize to affected customers and are looking for ways to improve architectural resiliency after this event. The company is doing a detailed forensic analysis of the impacted datacenter hardware and systems; a review of every internal service with dependencies on the Azure Service Manager; an investigation of the possibility of moving these ASM-dependent services to Azure Resource Manager; and an evaluation of future hardware design of storage units to increase resiliency.
Previous and related coverage:
Microsoft's much-hyped free upgrade offer for Windows 10 ended in 2016, right? Not exactly. The GWX tool may be gone, but all the other upgrade tools still work. The end result is an apparently valid digital license, and there's no evidence that the free upgrades will end any time soon.
You've just upgraded to the most recent version of Windows 10. Before you get back to work, use this checklist to ensure that your privacy and security settings are correct and that you've cut annoyances to a bare minimum.
You've got a new PC running Windows 10 Home. You want to upgrade to Windows 10 Pro. Here's how to get that upgrade for free. All you need is a Pro/Ultimate product key from an older version of Windows.
- Windows 10 Spring Creators Update: Act fast to delay this big upgrade
- Top Windows 10 questions: How to install, secure, upgrade, get it for free
- Here's how Microsoft plans to milk more profits out of its cash cow
- Windows 10 tip: Disable annoying app notifications