Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs

Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs

Summary: The just-released Chaos Monkey tool lets cloud administrators unleash a mischievous program onto their cloud to randomly break components. But why would anyone want to do this?

TOPICS: Cloud, Amazon

Name any major failure that has struck a cloud recently - Amazon, Microsoft, Heroku - and the reason for the failure will be the same: an unforeseen problem.

But it doesn't have to be that way. Netflix, which operates a vast multi-continent video distribution cloud on top of Amazon Web Services, got so annoyed with unforeseen bugs in its own software that it designed a tool named Chaos Monkey to go out into its cloud and break things. The only difference between Netflix's tool and a real outage is that Chaos Monkey runs only in office hours.

"Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee?" Cory Bennett and Ariel Tseitlin wrote in a post to the company's engineering blog on Monday.

"Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."

The tool runs within Amazon Web Services. It seeks out workloads running in Auto Scaling Groups and terminates the virtual machines (instances) at random. This lets companies check how resilient their clouds are and, the theory goes, causes failures to occur in office hours at a time when companies are best equipped to investigate and deal with the effects of the outage.

Administrators can change the probability chance of Chaos Monkey, which also works with other cloud providers, shutting down instances, according to the sensitivity of the workload, and can make certain applications opt out of the destructive program entirely.

The source code for Chaos Monkey is available online.

Topics: Cloud, Amazon

Jack Clark

About Jack Clark

Currently a reporter for ZDNet UK, I previously worked as a technology researcher and reporter for a London-based news agency.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • I always wondered

    what Netflix did besides make pop-up windows. Good to know.
    Aloysious Farquart
  • The Issue of Cost

    Not an expert, but I imagine 'stress tests' can be expensive, esp. if done in full seriousness. There will always be 'unforeseen' problems -- but maybe the reason there are so many currently is because makers/owners/users don't put their products and systems fully to the test??