Name any major failure that has struck a cloud recently - Amazon, Microsoft, Heroku - and the reason for the failure will be the same: an unforeseen problem.
But it doesn't have to be that way. Netflix, which operates a vast multi-continent video distribution cloud on top of Amazon Web Services, got so annoyed with unforeseen bugs in its own software that it designed a tool named Chaos Monkey to go out into its cloud and break things. The only difference between Netflix's tool and a real outage is that Chaos Monkey runs only in office hours.
"Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee?" Cory Bennett and Ariel Tseitlin wrote in a post to the company's engineering blog on Monday.
"Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."
The tool runs within Amazon Web Services. It seeks out workloads running in Auto Scaling Groups and terminates the virtual machines (instances) at random. This lets companies check how resilient their clouds are and, the theory goes, causes failures to occur in office hours at a time when companies are best equipped to investigate and deal with the effects of the outage.
Administrators can change the probability chance of Chaos Monkey, which also works with other cloud providers, shutting down instances, according to the sensitivity of the workload, and can make certain applications opt out of the destructive program entirely.
The source code for Chaos Monkey is available online.