Microsoft: Here's what caused our Azure cloud-computing outage

Microsoft reveals that unseen code errors in its Azure DNS service caused the April 1 outage.
Written by Liam Tung, Contributing Writer

Microsoft has revealed the root cause of the recent outage affecting Azure, which lasted about an hour and was due to a surge in Domain Name System (DNS) requests coupled with a code defect. 

Users were reporting that Azure Portal, Azure Services, Dynamics 365, and Xbox Live were inaccessible during the worldwide outage between 21:21 UTC and 22:00 UTC on 1 Apr 2021. Microsoft said in its root cause analysis report that the majority of services recovered by 22:30 UTC. 

While Microsoft quickly confirmed the outage was related to its DNS capabilities, the company's final root cause analysis published April 4 sheds a bit more light on the cause being a previously unseen code defect in its DNS service that was triggered by excessive DNS client retries. 

SEE: Office 365: A guide for tech and business leaders (free PDF) (TechRepublic)

"Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure," Microsoft states.

"Normally, Azure's layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches."

Microsoft's DNS service was swamped as DNS clients retried requests, which added further pressure on the service. Microsoft notes DNS client retries are considered legitimate DNS traffic, so this traffic was not dropped by Microsoft's volumetric mitigation systems, in turn reducing the availability of its DNS service across multiple regions. 

Microsoft says it mitigated the issue by updating the logic on the volumetric spike mitigation system to protect the DNS service from excessive client retries.    

The technology giant apologized to affected customers and explained that it had repaired the code defect to handle all requests efficiently in the cache. It has also improved automatic detection and mitigation of anomalous traffic patterns. 

This latest outage was not as lengthy as its 14-hour Azure outage in mid-March, which was attributed to an error that occurred in the rotation of keys used to support Azure AD's use of OpenID.    

Editorial standards