On the evening (US Pacific Time) of November 18, customers across the US, Europe and parts of Asia experienced problems with various Azure services. The issue also affected Xbox Live and MSN.com — parts of which rely on Azure — as well as Visual Studio Online and Search.
Further exacerbating the problem was the fact that the Service Health Dashboard and Azure Management Portal both rely on Azure Storage services, which meant those services were not accurately reflecting the impaired state of Azure storage. Many users noted that Azure's status page was reporting that Azure was working fine when it wasn't.
Microsoft officials said a performance update applied to Azure Storage — which the company had previously tested over several weeks with a subset of Microsoft's customer-facing Azure Tables storage service — was the culprit. It wasn't until Microsoft began rolling out the performance update more broadly that the company discovered "an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting," Zander explained.
"The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues," he said.
Microsoft rolled back the change once the issue was detected, but had to restart the storage front ends to fully undo the update. According to the official report on the outage, "Unfortunately the issue was widespread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches."
Some customers are still experiencing "intermittent issues" as a result, Zander said. Microsoft engineering is working with these customers to resolve lingering problems, he said.