Cloud outages are bad. Poor communication during and after outages make them even worse. Microsoft officials know this all too well and have a plan to try to improve the way the company handles communication around Azure outages.
I've been noticing for a while that Microsoft is using its Azure status page less and less frequently to notify users of cloud outages. Back in March, when one of Microsoft's most active regions, East US, went down for hours, there was nothing about the issue on the status page -- and very little outcry on Twitter (which has been another barometer of cloud outages).
It turns out this relative quiet is by design. Microsoft has been working to get its cloud users to their individualized Service Health pages, rather than the public-facing Azure Status site. And its Azure Support account on Twitter tries to guide users to look at these pages and/or to direct message that account when they need the most up-to-date information on an outage. (Convincing users to take their gripes off Twitter also is good for making it harder for us pesky reporters to track outages, reducing the number of "Azure outage" headlines out there.)
In a blog post this week, Sami Kubba, a principal program manager overseeing Azure's outage communications process, outlined where Microsoft is at and where it's going on the outage communications front. His post is part of a series that Microsoft started last year which outlines ways it is seeking to improve Azure reliability, performance, and more.
He noted that Microsoft's goal is to notify all impacted Azure subscriptions within 15 minutes of an outage. Microsoft uses human beings, plus automatic notifications to do this. He said automatic notifications via Service Health were accountable for more than half of Microsoft's outage communications in the last quarter. Kubba said Microsoft's goal is to continue to reduce the time it takes for the company to notify users of an outage.
"We are also in the early stages of expanding our use of AI-based operations to identify related impacted services automatically and, upon mitigation, send resolution communications (for supported scenarios) as quickly as possible," he added.
Microsoft is currently using the public Azure Status page only to communicate "widespread" outages -- meaning impacting multiple regions and/or multiple services -- Kubba acknowledged. Microsoft is communicating directly with impacted customers in-portal via Service Health for more than 95 percent of current incidents. Kubba attributed this ratio to the vast majority of outages affecting only a "very small 'blast radius' of customer subscriptions."
(Azure Service Health is a suite of experiences that provide personalized guidance and support for Azure service issues, including outages and even planned maintenance. Azure Service Health is composed of Azure status, the Service Health service, and Resource Health.)
Microsoft is working to make this same kind of outage notification system consistent across its other cloud products, including Microsoft 365 and Power Platform, Kubba said. Already, customers can see the M365 Status account on Twitter herding users toward their portals and direct messages when problems occur.
As I've noted in the past, this system works for admins and those with admin access to their cloud accounts. But unless IT is notifying users internally of what's happening when outages occur, many users still turn to Twitter to find out if others are affected and have any information when an Office 365 outage happens.
Kubba did say that customers can request post-incident reports for smaller outages (bigger ones will have publicly shared PIRs) and noted the team is continuing to try to make things even more transparent and to show users concrete steps Microsoft is taking to try to head off related types of outages going forward.