On September 28 and September 29 this week, a number of Microsoft customers worldwide were impacted by a cascading series of problems resulting in many being unable to access their Microsoft apps and services. On October 1, Microsoft posted its post-mortem about the outages, outlining what happened and next steps it plans to take to head this kind of issue off in the future.
Microsoft acknowledged it was a service update targeting an internal validation test ring that caused a crash in Azure AD backend services. "A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, by passing our normal validation process," officials said.
Azure AD is designed to be geo-distributed and deployed with multiple partitions across multiple data centers around the world, and is built with isolation boundaries. Microsoft normally applies changes across a validation ring that doesn't include customer data, followed by four additional rings over the course of several days before they hit production. But this week the SDP didn't correctly target the validation ring due to a defect and all rings were targeted concurrently causing service availability to degrade, Microsoft's report says.
Microsoft engineering knew within five minutes of the problem that something was wrong. During the next 30 minutes, Microsoft started taking steps to expedite mitigation by scaling out some Azure AD services to handle the load once a mitigation would have been applied and failing over certain workloads into a backup Azure AD authentication system.
Unfortunately, Microsoft's automated rollback failed due to the corruption of SDP metadata. So the team began manually updating the service configuration by bypassing the SDP system. Microsoft says the entire operation was completed by around 8 p.m. ET. Microsoft says "all service instances with residual impact were recovered" more than two hours after that.
Microsoft officials said they've fixed the latent code defect in the Azure AD backend SDP system; fixed the existing rollback system; and expanded the scope and frequency of rollback operation drills. The team still needs to apply more protections to the Azure AD SDP system to prevent these kinds of issues. It also needs to expedite the rollout of the Azure AD backup authentication system to all key services, and to onboard Azure AD scenarios to the automated communications pipeline to let affected customers know within 15 minutes of impact about what's going on.