Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs

Summary:The just-released Chaos Monkey tool lets cloud administrators unleash a mischievous program onto their cloud to randomly break components. But why would anyone want to do this?

Name any major failure that has struck a cloud recently - Amazon , Microsoft , Heroku - and the reason for the failure will be the same: an unforeseen problem.

But it doesn't have to be that way. Netflix, which operates a vast multi-continent video distribution cloud on top of Amazon Web Services, got so annoyed with unforeseen bugs in its own software that it designed a tool named Chaos Monkey to go out into its cloud and break things. The only difference between Netflix's tool and a real outage is that Chaos Monkey runs only in office hours.

"Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee?" Cory Bennett and Ariel Tseitlin wrote in a post to the company's engineering blog on Monday.

"Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."

The tool runs within Amazon Web Services. It seeks out workloads running in Auto Scaling Groups and terminates the virtual machines (instances) at random. This lets companies check how resilient their clouds are and, the theory goes, causes failures to occur in office hours at a time when companies are best equipped to investigate and deal with the effects of the outage.

Administrators can change the probability chance of Chaos Monkey, which also works with other cloud providers, shutting down instances, according to the sensitivity of the workload, and can make certain applications opt out of the destructive program entirely.

The source code for Chaos Monkey is available online.

Topics: Cloud, Amazon


Jack Clark has spent the past three years writing about the technical and economic principles that are driving the shift to cloud computing. He's visited data centers on two continents, quizzed senior engineers from Google, Intel and Facebook on the technologies they work on and read more technical papers than you care to name on topics f... Full Bio

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.