Failure is not an option for Netflix's service-oriented architecture

A Netflix software engineer describes how the world's largest pure cloud service keeps on delivering.

Netflix keeps coming up as the ideal poster child for service-oriented architecture and cloud done in a big way. As my ZDNet colleague Steven J Vaughan-Nichols recently pointed out: "Netflix, without doubt, is the largest pure cloud service."

Netflix - Image assembled by Joe McKendrick

That means extraordinary, extraordinary attention needs to be paid to resiliency. Ben Christensen, senior software engineer for the API Platform at Netflix pointed out: "unmitigated system failures can impact the user experience, a product's image, and a company's brand and, potentially, revenue."

In a new post at O'Reilly Programming, Christensen said failure is not an option for Netflix's SOA-based infrastructure. The key, he said, is to isolate failures or hiccups within application instances. A tool the company has built to accomplish this is Hystrix, which focuses on failure isolation and graceful degradation. "It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such 'minor mistakes' that led to major user impact," he said.

The problem statement on the Histrix site puts it bluntly:

Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably fail at some point. If not isolated from these external failures, the host application is at risk of being taken down with them. For example, running an application that depends on 30 services that each have 99.99 percent uptime we get ... 3 million failures out of every 1 billion requests, or more than two hours of downtime per month, even if all dependencies have excellent uptime ... Reality is generally worse.

To address requirements for uptime across all services, Christensen said his team employs Histrix to accomplish the following:

  • "Isolate client network interaction using the bulkhead and circuit breaker patterns."

  • "Fallback and degrade gracefully when possible."

  • "Fail fast when fallbacks aren't available and rapidly recover."

  • "Monitor, alert, and push configuration changes with low latency (seconds)."

Netflix's resilience challenge comes from the need to monitor and support a range of client types and interactions.

One of the beauties of a service oriented architecture, if designed well, is loose coupling -- an IT infrastructure and applications are deployed as independent components. If one element or service fails or is changed, other components in the service chain are unaffected.

Show Comments