Microsoft: Azure is at 99.995% uptime but we can do more

Microsoft is working on multiple fronts to try to improve Azure reliability. Here are some of the initiatives underway.

azureresilience.jpg

Credit: Microsoft

Microsoft is taking steps to improve Azure's reliability beyond its stated current 99.995% average uptime. Microsoft Azure Chief Technology Officer Mark Russinovich outlined some of these steps in a July 15 blog post.

(I'm not sure what spurred Microsoft to post about this today. Maybe it's timed to coincide with the Microsoft Inspire partner conference and/or the Microsoft Ready sales kick-off this week? Perhaps it's JEDI competition-related? Start of the new fiscal year? Got me.)

In the blog post itemizing the coming improvements, Russinovich acknowledges that Azure was affected by "three unique and significant incidents" over the past year, including the data center outage in the South Central US region in September 2018; back-to-back Azure Active Directory Multi-Factor Authentication problems in November 2018 and DNS maintenance issues in May this year. (Note: This isn't an exhaustive list. There have been a few additional Azure-related outages in the last 12 months, such as this one in January.)

Russinovich said Microsoft had created a new Quality Engineering team in his CTO office, which will work alongside its Site Reliability Engineering (SRE) team on finding new ways to make Azure even more reliable.

Russinovich said Microsoft has several other initiatives underway meant to improve Microsoft's cloud service's resiliency. He said Microsoft is working to bring availability zones to the ten next largest Azure regions between now and 2021. Availability zones are already live in the ten largest Azure regions. Availability Zones are meant to help protect customers from datacenter-level failures. The zones are located inside Azure regions and offer independent power sources, networking, and cooling. There is a minimum of three separated zone locations in enabled regions.

Microsoft is extending its safe deployment practice framework to include software-defined infrastructure changes such as networking and DNS. This framework is meant to ensure all code and configuration changes happening in Azure go through a set of specific dev/test, staging, private previews, hardware diversity pilot and longer validation periods before rolling out to region pairs. Microsoft also is making more investments to improve zero-impact and low-impact updating technologies like hot patching, live migration, and in-place migration, as well.

Microsoft currently prioritizes data retention over time-to-restore. But some customers said they'd like the option to make this trade-off decision themselves, so Microsoft is previewing the ability to initiate their failover at the storage-account level.

Its Project Tardigrade service is meant to detect hardware failures or memory leaks before they occur so that Azure can freeze virtual machines briefly so the potentially affected workloads can be moved to another host. Microsoft has not provided any information on when this service will be available in preview or final form.

"The capability of continuous, real-time improvement is one of the great advantages of cloud services, and while we will never eliminate all such risks, we are deeply focused on reducing both the frequency and the impact of service issues while being transparent with our customers, partners, and the broader industry," Russinovich said.