The biggest cloud app of all: Netflix

The largest pure-cloud play service of all is based on Netflix's open-source stack running on Amazon Web Services.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

Netflix, the popular video-streaming service that takes up a third of all internet traffic during peak traffic hours isn't just the single largest internet traffic service. Netflix, without doubt, is also the largest pure cloud service.

Netflix, with more than a billion video delivery instances per month, is the largest cloud application in the world.

At the Linux Foundation's Linux Collaboration Summit in San Francisco, California, Adrian Cockcroft, director of architecture for Netflix's cloud systems team, after first thanking everyone "for building the internet so we can fill it with movies", said that Netflix's Linux, FreeBSD, and open-source based services are "cloud native".

By this, Cockcroft meant that even with more than a billion video instances delivered every month over the internet, "there is no datacenter behind Netflix". Instead, Netflix, which has been using Amazon Web Services since 2009 for some of its services, moved its entire technology infrastructure to AWS in November 2012.

Specifically, depending on customer demand, Netflix's front-end services are running on 500 to 1,000 Linux-based Tomcat JavaServer and NGINX web servers. These are empowered by hundreds of other Amazon Simple Storage Service (S3) and the NoSQL Cassandra database servers using the Memcached high-performance, distributed memory object caching system. All of this, and more besides, are distributed across three Amazon Web Services availability zones. Every time you visit Netflix either with a device or a web browser, all these are brought together within a second to show you your video selections.

According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones. Netcraft didn't simply take Amazon's word for this. They tested out total Amazon Elastic Compute Cloud (EC2) failures with its open-source Chaos Gorilla software. "We go around trying to break things to prove everything is resistant to it," said Cockcroft. Netflix, in concert with Amazon, is working on multi EC2 region availability. Once in place, an entire EC2 zone failure won't stop Netflix videos from flowing to customers.

That won't be easy though. It's not so much that the problem is replicating videos and services across the EC2 zones. Netflix already has its own content delivery network (CDN), Open Connect, and servers placed at local ISP hubs for that. No, the real problem is setting the Domain Name System (DNS) so that users are directed to the right Amazon zone when one is down. That's because Cockcroft said, DNS provider wildly different application programming interfaces (API)s, and they're designed to be hand-managed by an engineer and thus are not at all easy to automate.

That isn't stopping Netflix from addressing the problem just because it's difficult. Indeed, Netflix plans on failure. As Cockcroft titled his talk, Netflix is about dystopia as a service. The plan isn't if something will fail on the cloud, it's on how to keep working no matter how the clouds or specific services fail. Netflix's services are designed to, when something go wrong, gradually degrade rather than fail completely.

As he said, sure, perfection, utopia would be great, but if you're always striving for perfection, you always end up compromising. So instead of striving for perfection, Netflix is continuously updating its systems in real time rather than perfecting them. How fast is that? Netflix wants to "code features in days instead of months; we want to deploy new hardware in minutes instead of weeks; and we want to see instant responses in seconds instead of hours". By deploying on the cloud, Netflix can do all of this.

Sure, sometimes, this doesn't work. In December 2012, for example, a failure in AWS's Elastic Load Balancer in the US-East-Region1 datacenter brought Netflix down during the Christmas holiday.

On the other hand, the Netflix method of producing code sooner rather than later, and running in such a way that the service keeps going even though some components are — not may, but are — broken and inefficient at any given time, has produced a service that is capable of being the single largest consumer of internet bandwidth. Clearly, it's not perfect, but Netflix's design decision to "create a highly agile and highly available service from ephemeral and often broken components" on the cloud works, and as far as Netflix is concerned, for day to day cloud-based video delivery, that's much better than "perfection" could ever be.

Related stories

Editorial standards