The first root cause showed up as a latency issue in the MFA front-end's communication to its cache services. The second was a race condition in processing responses from the MFA back-end server. These two causes were introduced in a code update rollout which began in some datacenters on Tuesday November 13 and completed in all datacenters by Friday November 16, Microsoft officials said.
A third identified root cause, which was triggered by the second, resulted in the MFA back-end being unable to process any further requests from the front-end, even though it seemed to be working fine based on Microsoft's monitoring.
European, Middle Eastern and African (EMEA) and Asian Pacific (APAC) customers were hit first by these cascading issues. As the day went on, Western European and then American datacenters were hit. Even after engineers applied a hotfix which allowed front-end servers to bypass the cache, the issues persisted. On top of all this, telemetry and monitoring wasn't working as expected, officials acknowledged.
Microsoft identified a number of intended next steps to improve the MFA service, including a review of its update-deployment procedures (target completion date: December 2018); a review of montioring services (target completion date: December 2018); a review of the containment process which will help avoid propagating an issue to other datacenters (target completion date: January 2019); and an update to the communications process for the Service Health Dashboard and monitoring tools (target completion date: December 2018).
Microsoft officials apologized to affected customers, but made no mention of any planned financial compensation. Microsoft's November 19 Azure status history post has more details about the trail of events leading to the MFA meltdown.