Innovation

Microsoft wants its Azure servers to be as durable as tardigrades

Microsoft is working to improve the reliability of its Azure infrastructure on multiple fronts, including via 'Project Tardigrade,' its effort to enable a cloud app to survive platform failure.

Written by Mary Jo Foley, Senior Contributing Editor May 14, 2019 at 8:53 a.m. PT

More Microsoft

Microsoft is working on multiple fronts to improve resiliency of its Azure datacenters. During his "Inside Azure datacenter architecture" presentation at Build 2019 last week, Mark Russinovich, the Chief Technical Officer for Azure, outlined some of the reliability-specific areas on which the company is focusing.

One of the efforts Russinovich described is known as "Project Tardigrade." As Russinovich reminded the Build audience, a tardigrade (the microscopic animal also known as a "water bear" or "moss piglet") is one of the most durable creatures ever discovered. It can survive in outer space and at extreme temperatures.

With Tardigrade, Microsoft's goal is to enable a cloud app to survive platform failure.

"We want our servers to be like tardigrades," Russinovich said. "We don't want to have to reboot the virtual machines (VMs)," when things go bad. With Tardigrade, the "VMs get frozen in RAM, with their state preserved." The operating system resumes on a fresh server.

Russinovich didn't provide any details as to when this technology will be rolled out, but he did show a demonstration of it working during his Build presentation.

Update (May 14). There was a Microsoft Research project known as Tardigrade. Here's a research paper dated May 2015 that detailed Microsoft's vision for Tardigrade as "leveraging lightweight virtual machines to easily and efficiently construct fault-tolerant services." Microsoft contacted me late in the day on May 14 to say that the two Tardigrades are not related.

A summary from the MSR Tardigrade research paper:

"Tardigrade (is) a system that deploys an existing, unmodified binary as a fault-tolerant service. Tardigrade replicates the service on several machines so that it continues running even when some of them fail. Yet, it keeps the service states synchronized so clients see strongly consistent results."

Tardigrade, as outlined by Microsoft's researchers, used a "a lightweight virtual machine (which) is a process-sandboxed so that its external dependencies are completely encapsulated, enabling it to be migrated across machines. To let unmodified binaries run within such a sandbox, the sandbox also contains a library OS providing the expected API."

A library OS? Yes, it seems Tardigrade does have its roots in work that Microsoft did around "Drawbridge."

Drawbridge was a Microsoft research project meant to offer a new form of virtualization for application sandboxing. It relied in picoprocesses (a process-based isolation container with a minimal kernel) and a library OS, or an operating system refactored to run as a set of libraries within the context of an application, as Microsoft researchers described it. Microsoft relied on Drawbridge concepts to bring SQL Server to Linux and the Windows Subsystem for Linux to Windows 10.

A Microsoft spokesperson said the Microsoft Research Tardigrade had nothing to do with the Azure Project Tardigrade, in spite of the same name (and what sounded to me like a potentially similar focus). The spokesperson said the Azure Project Tardigrade is a brand-new initiative.

Microsoft also is looking to improve its datacenter reliability by rolling out more availability zones across the world, as Russinovich told Geekwire last week. Availability Zones are meant to help protect customers from datacenter-level failures. The zones are located inside Azure regions and offer independent power source, networking and cooling. There are a minimum of three separated zone locations in enabled Azure regions.

While Microsoft officials often claim that Microsoft has more cloud regions worldwide than any cloud provider, relatively few of the Azure regions support availability zones. AWS, for its part, defines a "region" as a geographic location where it operates a cluster of availability zones. AWS currently has 64 availability zones in 21 regions.