Netflix: How we got a grip on AWS's cloud

Netflix cloud architect Adrian Cockcroft discusses the company's choice of Amazon Web Services for its cloud, the impact of that decision on its developers, and what it's looking for from future cloud technologies
Written by Jack Clark, Contributor

In 2011, Netflix's online video rental service regularly accounted for 30 percent of all US download internet traffic.

To support this combination of huge traffic and unpredictable demand spikes, Netflix has spent the past few years developing a global video distribution system using the Amazon Web Services (AWS) cloud.

By outsourcing to Amazon, the company says it has been able to save on the costs of maintaining and updating a datacentre infrastructure and can react better to demand. However, Netflix cloud architect Adrian Cockcroft still has a number of items on his wishlist — chiefly, faster input-output mechanisms — which are lacking in the cloud.

Before taking on responsibility for overseeing and developing Netflix's cloud, Cockcroft worked at Sun and eBay, where he helped found eBay Research Labs.

According to Cockcroft, moving to the cloud lets an organisation change how its developers work and enables it to dispense with IT operations — a concept he calls 'no-ops'. But he believes companies must still be highly critical in evaluating the types of technology on offer in the cloud.

Q: At the end of February Microsoft's Azure cloud had a severe outage and Amazon has had trouble in the past. How can you be so confident in depending on a single cloud?
A: I think there're some architectural differences — the way that Microsoft has built their cloud, they have much more linkage between regions. They have data replication across the country that is centrally managed so they have to have services that span everything. We haven't used their architecture in any real sense other than looking at it for some storage purposes.

Amazon is very anal about having regions be very separate, so the US east, west and central regions are very centrally managed — they don't talk to each other at all, which is actually a pain because there are services we'd like to have cross-region but we can't because they don't want to do the coupling. They also have [separate] availability zones [within each region]. They've had control failures for [Elastic Block Store (EBS)] zones but they've redesigned EBS to stop that.

What are you greatest concerns when it comes to designing a cloud-based architecture?
When we first went to the cloud we started off with a series of pathfinder projects and benchmarks — what is this beast, how does it behave, which facilities are mature, how does this scale and how does it work?

Netflix HQ

By outsourcing to Amazon, Netflix says it has been able to save on the costs of maintaining and updating a datacentre infrastructure and can react better to demand. Image credit: Netflix

The Netflix architecture is based on the stuff we found that works and we tended to avoid some of the things that didn't work as well, which is why we don't have a strong dependency on EBS, which has always had performance variants and there have been a number of outages that have helped us say, 'It's something not to use'. It's relatively low-performing — one of the weak spots in the [AWS] cloud.

The instances available from AWS have similar CPU, memory and network capacity to instances available for private datacentre use, but are currently much more limited for disk I/O. They typically have two internal disks and there are network attached storage options like EBS which can provide a few hundred I/O per second. It's easy in the datacentre to provide thousands or tens of thousands of I/O per second. So that is a gap in cloud offerings from AWS.

The hard thing to do in the cloud is to do high-performance IO [input-output], but that is starting to change as third-party vendors are figuring out ways of connecting high-performance IO externally, and we've worked around it with our [Cassandra] data store architecture.

Amazon themselves now have DynamoDB with solid-state disks behind it which is a very encouraging sign for me — I've been asking for SSDs in the cloud for some time. We're hoping that eventually we can get more access to them than just through DynamoDB.

Many enterprises seem keen on SSDs, so why do you think it has taken Amazon a while to roll them out?
It's purely scale for them. For Amazon to do something they have to do it on a scale that's really mind-boggling. If you think about deploying an infrastructure service with a new type of hardware — if they got it wrong, they can't turn it back out and do it again differently. So they have to over-engineer what they do.

In some ways there are parallels between Apple and Amazon.

In some ways there are parallels between Apple and Amazon. Apple builds products that take a long time and when they come out they are very well polished. With Amazon they take a long time to get stuff done but when it comes out it is very large scale. There is a long lead time for everything they do, but they have enormous resources and are starting work on these projects earlier than other people and they're having more people working on things.

What we're doing at Netflix is leveraging that investment. Amazon has thousands of people working on AWS and way more engineers than we have working on everything at Netflix. We're able to leverage [that investment] by using the APIs and telling them what we want.

It seems as though the major consumers of cloud are either technology-oriented start-ups or large companies, such as Netflix. What about medium-sized businesses?
Well, most of the people using clouds are start-ups, using five or 10 machines. We started there. Two years ago our production system was...

...a few hundred machines and it just kept growing.

What's generally happening in [medium-sized businesses] is green-field development, so when they want to develop a new application from scratch, engineers are grabbing machines from the cloud and using Amazon as an infrastructure layer. They call it shadow IT. In the end, it may get deployed internally or is deployed in the cloud, but at least it's architecturally ready to be run in the cloud.

How easy is it for Netflix developers to access the AWS cloud that underpins Netflix?
If you think about infrastructure as a service and platform as a service (PaaS), what we've built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.

We've started open-sourcing components of that. We've developed something for managing Cassandra called Priam. That's basically a Tomcat server that runs on every instance and controls backups. Since we've open-sourced it, you can now build Cassandra as a service on a cloud platform.

How do you stop your cloud PaaS from being overwhelmed by complexity and dependency issues?
We don't keep track of dependencies. We let every individual developer keep track of what they have to do. It's your own responsibility to understand what the dependencies are in terms of consuming and providing [services].

We've built a decoupled system where every service is capable of withstanding the failure of every service it depends on.

Everyone is sitting in the middle of a bunch of supplier and consumer relationships and every team is responsible for knowing what those relationships are and managing them. It's completely devolved — we don't have any centralised control. We can't provide an architecture diagram, it has too many boxes and arrows. There are literally hundreds of services running.

I don't worry about it because we've built a decoupled system where every service is capable of withstanding the failure of every service it depends on. The typical environment you have for developers is this image that they can write code that works on a perfect machine that will always work, and operations will figure out how to create this perfect machine for them. That's the traditional dev-ops, developer versus operations contract. But then of course machines aren't perfect and code isn't perfect, so everything breaks and everyone complains to each other.

So we got rid of the operations piece of that and just have the developers, so you can't depend on everybody and you have to assume that all the other developers are writing broken code that isn't properly deployed. And when you write a REST call to them, you might get nothing back or broken code and you just have to deal with that.

By making everyone responsible for the robustness of their code, we've ended up training a whole building full of developers to build their code very robustly, and I think it's because we took the crutch of having operations to fix it up for you away from them.

There's some controversy over calling this no-ops. There's a way of organising yourself which is a platform as a service with developers doing everything and automating away the operations function — they're operating it themselves but not spending too much time doing it.

What do you wish to see from Amazon in the future?
The thing I've been publicly asking for has been better IO in the cloud. Obviously I want SSDs in there. We've been asking cloud vendors to do that for a while. With Cassandra, we've had to go onto horizontal scale and use the internal disks and triple replicate across availability zones, so you end up with a triple-redundant data store that is careful not to overload the disks.

Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.
Editorial standards