Netflix: How we got a grip on AWS's cloud

Summary:Netflix cloud architect Adrian Cockcroft discusses the company's choice of Amazon Web Services for its cloud, the impact of that decision on its developers, and what it's looking for from future cloud technologies

...a few hundred machines and it just kept growing.

What's generally happening in [medium-sized businesses] is green-field development, so when they want to develop a new application from scratch, engineers are grabbing machines from the cloud and using Amazon as an infrastructure layer. They call it shadow IT. In the end, it may get deployed internally or is deployed in the cloud, but at least it's architecturally ready to be run in the cloud.

How easy is it for Netflix developers to access the AWS cloud that underpins Netflix?
If you think about infrastructure as a service and platform as a service (PaaS), what we've built is a PaaS over the top of the AWS infrastructure, which is as thin a layer as we could build, leveraging as many Amazon features as seemed interesting and useful. Then we put a thin layer over that to isolate our developers from it.

We've started open-sourcing components of that. We've developed something for managing Cassandra called Priam. That's basically a Tomcat server that runs on every instance and controls backups. Since we've open-sourced it, you can now build Cassandra as a service on a cloud platform.

How do you stop your cloud PaaS from being overwhelmed by complexity and dependency issues?
We don't keep track of dependencies. We let every individual developer keep track of what they have to do. It's your own responsibility to understand what the dependencies are in terms of consuming and providing [services].

We've built a decoupled system where every service is capable of withstanding the failure of every service it depends on.

Everyone is sitting in the middle of a bunch of supplier and consumer relationships and every team is responsible for knowing what those relationships are and managing them. It's completely devolved — we don't have any centralised control. We can't provide an architecture diagram, it has too many boxes and arrows. There are literally hundreds of services running.

I don't worry about it because we've built a decoupled system where every service is capable of withstanding the failure of every service it depends on. The typical environment you have for developers is this image that they can write code that works on a perfect machine that will always work, and operations will figure out how to create this perfect machine for them. That's the traditional dev-ops, developer versus operations contract. But then of course machines aren't perfect and code isn't perfect, so everything breaks and everyone complains to each other.

So we got rid of the operations piece of that and just have the developers, so you can't depend on everybody and you have to assume that all the other developers are writing broken code that isn't properly deployed. And when you write a REST call to them, you might get nothing back or broken code and you just have to deal with that.

By making everyone responsible for the robustness of their code, we've ended up training a whole building full of developers to build their code very robustly, and I think it's because we took the crutch of having operations to fix it up for you away from them.

There's some controversy over calling this no-ops. There's a way of organising yourself which is a platform as a service with developers doing everything and automating away the operations function — they're operating it themselves but not spending too much time doing it.

What do you wish to see from Amazon in the future?
The thing I've been publicly asking for has been better IO in the cloud. Obviously I want SSDs in there. We've been asking cloud vendors to do that for a while. With Cassandra, we've had to go onto horizontal scale and use the internal disks and triple replicate across availability zones, so you end up with a triple-redundant data store that is careful not to overload the disks.

Get the latest technology news and analysis, blogs and reviews delivered directly to your inbox with ZDNet UK's newsletters.

Topics: Cloud


Jack Clark has spent the past three years writing about the technical and economic principles that are driving the shift to cloud computing. He's visited data centers on two continents, quizzed senior engineers from Google, Intel and Facebook on the technologies they work on and read more technical papers than you care to name on topics f... Full Bio

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.