The nearly 11-hour outage that hit Microsoft Azure customers earlier this week was due to a performance update Microsoft made to Azure storage services, according to company officials.
Microsoft Azure Corporate Vice President Jason Zander acknowledged the problem and explained the root cause in a November 19 blog post.
On the evening (US Pacific Time) of November 18, customers across the US, Europe and parts of Asia experienced problems with various Azure services. The issue also affected Xbox Live and MSN.com — parts of which rely on Azure — as well as Visual Studio Online and Search.
Further exacerbating the problem was the fact that the Service Health Dashboard and Azure Management Portal both rely on Azure Storage services, which meant those services were not accurately reflecting the impaired state of Azure storage. Many users noted that Azure's status page was reporting that Azure was working fine when it wasn't.
Microsoft officials said a performance update applied to Azure Storage — which the company had previously tested over several weeks with a subset of Microsoft's customer-facing Azure Tables storage service — was the culprit. It wasn't until Microsoft began rolling out the performance update more broadly that the company discovered "an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting," Zander explained.
"The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues," he said.
Microsoft rolled back the change once the issue was detected, but had to restart the storage front ends to fully undo the update. According to the official report on the outage, "Unfortunately the issue was widespread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches."
Some customers are still experiencing "intermittent issues" as a result, Zander said. Microsoft engineering is working with these customers to resolve lingering problems, he said.
Microsoft is committing to making changes to its policies and procedures to head off similar possible problems, Zander said. Among its commitments:
- Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
- Improve the recovery methods to minimize the time to recovery.
- Fix the infinite loop bug in the CPU reduction improvement from the Blob front ends before it is rolled out into production.
- Improve Service Health Dashboard Infrastructure and protocols.
"We apologize for the impact and inconvenience this has caused," the Azure team repeated.