Microsoft's latest cloud authentication outage: What went wrong
Microsoft is saying a 'rotation of keys' that handle authentication was to blame for a roughly 14-hour Azure outage that took down Office 365, Dynamics 365, Xbox Live and other Microsoft services on March 15.
Microsoft has published a preliminary root cause analysis of its March 15 Azure Active Directory outage, which took down Office, Teams, Dynamics 365, Xbox Live, and other Microsoft and third-party apps that depend on Azure AD for authentication. The roughly 14-hour outage affected a "subset" of Microsoft customers worldwide, officials said.
Microsoft's preliminary analysis of the incident, published March 16, indicated that "an error occurred in the rotation of keys used to support Azure AD's use of OpenID, and other, Identity standard protocols for cryptographic signing operations," according to the findings published to its Azure Status History page.
Officials said as part of normal security practices, an automated system removes keys that are no longer in use, but over the past few weeks, a key was marked as "retain" for longer than normal to support a complex cross-cloud migration. This resulted in a bug being exposed causing the retained key to be removed. Metadata about the signing keys is published by Microsoft to a global location, its analysis notes. But once the metadata was changed around 3pm ET (the start of the outage), applications using these protocols in Azure AD started picking up the new metadata and stopped trusting tokens/assertions that were signed with the removed key.
Microsoft engineers rolled back the system to its prior state around 5pm ET, but it takes a while for applications to pick up the rolled-back metadata and refresh with the correct metadata. A subset of storage resources required an update to invalidate the incorrect entries and force a refresh.
"We understand how incredibly impactful and unacceptable this is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future," the blog post said.
A full root-cause analysis will be published once the investigation is complete, officials said.