Microsoft: Packet loss involving Apple Push Notification Service was latest MFA outage culprit

Microsoft has shared its post-mortem on the October 18 MFA issue which affected a number of North American Azure and Office 365 users.
Written by Mary Jo Foley, Senior Contributing Editor

Microsoft has posted its root-cause analysis of its latest Multifactor Authentication (MFA) melt-down, which happened last week. "Severe packet loss" between a network route between Microsoft and the Apple Push Notification Service (APNS) was to blame for the October 18 issues experienced by several Azure and Office 365 users in North America.

The three-hour issue which affected users attempting to sign in using MFA affected .51 percent of users in North American tenants using the service, according to Microsoft. The problem hit during morning peak traffic in North America -- just before 10 a.m. ET last Friday. Earlier this week, Microsoft's preliminary analysis said the severe packet loss involved a connection between Microsoft and an unnamed third-party service

Microsoft's write-up of what went wrong explains how its engineers prepared a hotfix to bypass the impacted external service altogether and restore MFA functionality. During that time, the external network recovered and packet loss reduced, so the hotfix could be rolled back.

"We sincerely apologize for the impact to affected customers," Microsoft officials said in the analysis. Microsoft is taking steps to improve Azure and its processes to ensure such incidents won't happen in the future, they said.

Among the "next steps" the Azure team is taking, according to the write-up: 

In-progress fine-grained fault domain isolation work has been accelerated. This work builds on the previous fault domain isolation work which limited this incident to North American tenants. This includes:  

- Additional physical partitioning within each Azure region.
- Logical partitioning between authentication types.
- Improved partitioning between service tiers.

Additional hardening and redundancy within each granular fault domain to make them more resilient to network connectivity loss. This includes:

- Improved resilience to request build-up.
- Optimizing network traffic to decrease load on network links.
- Improved instructions to users for self-service in case notifications are not delivered.
- Service restructuring to decrease service impact of network packet loss.

Enhanced monitoring for networking latency and various resource usage thresholds. This includes:

- Multi-region and multi-cloud targeted monitoring for the specific type of packet loss encountered.
- Improved monitors for additional types of resource usage. 

Last year, Microsoft's Azure and Office 365 services suffered two, back-to-back MFA outages. In its root-cause analysis, Microsoft detailed three independent causes, along with monitoring gaps that resulted in Azure, Office 365, Dynamics and other Microsoft users not being able to authenticate for much of the day during the first of the worldwide outages. Microsoft officials described a multi-pronged plan to try to keep this kind of outage from happening but said some of the required steps might not be completed until January 2019

Editorial standards