At the Linux Foundation's Linux Collaboration Summit in San Francisco, California, Adrian Cockcroft, director of architecture for Netflix's cloud systems team, after first thanking everyone "for building the internet so we can fill it with movies", said that Netflix's Linux, FreeBSD, and open-source based services are "cloud native".
Specifically, depending on customer demand, Netflix's front-end services are running on 500 to 1,000 Linux-based Tomcat JavaServer and NGINX web servers. These are empowered by hundreds of other Amazon Simple Storage Service (S3) and the NoSQL Cassandra database servers using the Memcached high-performance, distributed memory object caching system. All of this, and more besides, are distributed across three Amazon Web Services availability zones. Every time you visit Netflix either with a device or a web browser, all these are brought together within a second to show you your video selections.
According to Cockcroft, if something goes wrong, Netflix can continue to run the entire service on two out of three zones. Netcraft didn't simply take Amazon's word for this. They tested out total Amazon Elastic Compute Cloud (EC2) failures with its open-source Chaos Gorilla software. "We go around trying to break things to prove everything is resistant to it," said Cockcroft. Netflix, in concert with Amazon, is working on multi EC2 region availability. Once in place, an entire EC2 zone failure won't stop Netflix videos from flowing to customers.
That won't be easy though. It's not so much that the problem is replicating videos and services across the EC2 zones. Netflix already has its own content delivery network (CDN), Open Connect, and servers placed at local ISP hubs for that. No, the real problem is setting the Domain Name System (DNS) so that users are directed to the right Amazon zone when one is down. That's because Cockcroft said, DNS provider wildly different application programming interfaces (API)s, and they're designed to be hand-managed by an engineer and thus are not at all easy to automate.
That isn't stopping Netflix from addressing the problem just because it's difficult. Indeed, Netflix plans on failure. As Cockcroft titled his talk, Netflix is about dystopia as a service. The plan isn't if something will fail on the cloud, it's on how to keep working no matter how the clouds or specific services fail. Netflix's services are designed to, when something go wrong, gradually degrade rather than fail completely.
As he said, sure, perfection, utopia would be great, but if you're always striving for perfection, you always end up compromising. So instead of striving for perfection, Netflix is continuously updating its systems in real time rather than perfecting them. How fast is that? Netflix wants to "code features in days instead of months; we want to deploy new hardware in minutes instead of weeks; and we want to see instant responses in seconds instead of hours". By deploying on the cloud, Netflix can do all of this.
On the other hand, the Netflix method of producing code sooner rather than later, and running in such a way that the service keeps going even though some components are — not may, but are — broken and inefficient at any given time, has produced a service that is capable of being the single largest consumer of internet bandwidth. Clearly, it's not perfect, but Netflix's design decision to "create a highly agile and highly available service from ephemeral and often broken components" on the cloud works, and as far as Netflix is concerned, for day to day cloud-based video delivery, that's much better than "perfection" could ever be.