Failure is not an option for Netflix's service-oriented architecture

Failure is not an option for Netflix's service-oriented architecture

Summary: A Netflix software engineer describes how the world's largest pure cloud service keeps on delivering.


Netflix keeps coming up as the ideal poster child for service-oriented architecture and cloud done in a big way. As my ZDNet colleague Steven J Vaughan-Nichols recently pointed out: "Netflix, without doubt, is the largest pure cloud service."

Netflix - Image assembled by Joe McKendrick

That means extraordinary, extraordinary attention needs to be paid to resiliency. Ben Christensen, senior software engineer for the API Platform at Netflix pointed out: "unmitigated system failures can impact the user experience, a product's image, and a company's brand and, potentially, revenue."

In a new post at O'Reilly Programming, Christensen said failure is not an option for Netflix's SOA-based infrastructure. The key, he said, is to isolate failures or hiccups within application instances. A tool the company has built to accomplish this is Hystrix, which focuses on failure isolation and graceful degradation. "It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such 'minor mistakes' that led to major user impact," he said.

The problem statement on the Histrix site puts it bluntly:

Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably fail at some point. If not isolated from these external failures, the host application is at risk of being taken down with them. For example, running an application that depends on 30 services that each have 99.99 percent uptime we get ... 3 million failures out of every 1 billion requests, or more than two hours of downtime per month, even if all dependencies have excellent uptime ... Reality is generally worse.

To address requirements for uptime across all services, Christensen said his team employs Histrix to accomplish the following:

  • "Isolate client network interaction using the bulkhead and circuit breaker patterns."

  • "Fallback and degrade gracefully when possible."

  • "Fail fast when fallbacks aren't available and rapidly recover."

  • "Monitor, alert, and push configuration changes with low latency (seconds)."

Netflix's resilience challenge comes from the need to monitor and support a range of client types and interactions.

One of the beauties of a service oriented architecture, if designed well, is loose coupling -- an IT infrastructure and applications are deployed as independent components. If one element or service fails or is changed, other components in the service chain are unaffected.

Topics: Data Centers, IT Priorities

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Good..

    ...Now. Can they have better movies to show?
    • Id say no sh** since the online catalog is pathetic

      but really the whole premise of the article is a crock. The Netflix service has gone down multiple times. AWS is not reliable without multiple geodiverse redundancies and automatic failover. Users don't care if you isolate problems or not. They don't care if you take 10 minutes or 10 hours to fix a server or a switch or reimage an entire data center. But you'd better be doing it all while seamlessly continuing to serve their data uninterrupted from another unaffected location so your problems don't become their problems. I can't believe Netflix hasn't already migrated their entire set of services to azure for it's greater reliability.
      Johnny Vegas