Video: DevOps is the hot topic in 2018: Five reasons why
One of DevOps's greatest promises is speeding up software delivery with Continuous Integration and Continuous Delivery (CI/CD). There's just one little problem: To do it efficiently, you need to test the latest patches in production with canary testing. Everyone says they do this, but they mostly do a poor job of it. Now, Google and Netflix have partnered together to release Kayenta, which automates efficient canary testing.
Andrew Phillips, a Google cloud product manager, explained in an interview that canary testing, like the proverbial canary in a coal mine, presents a small group of users with new code to see if something goes wrong in production. Typically, this has been done by hand. Developers then watch their dashboards to see if the canary dies (aka the code fails). This method, while popular, makes it far too easy to miss problems.
Read also: Linux Meltdown patch: 'Up to 800 percent CPU overhead', Netflix tests show
Kayenta, which is licensed under the Apache 2 license, was developed jointly by Google and Netflix. The program sprang from Netflix's internal canary system. According to Netflix director of delivery engineering Andy Glover, the joint development team spent almost a year getting the code for the open-source version ready for release. It's designed to gives enterprise teams the confidence to quickly push production changes by reducing error-prone, time-intensive, and cumbersome manual or ad-hoc canary analysis.
The program is integrated with Spinnaker. This is a popular open-source, multi-cloud continuous delivery platform. Spinnaker and Kayenta runs with most cloud platforms including Amazon Web Services (AWS), Azure, Google Cloud, Kubernetes, and OpenStack. In short, no matter what your cloud, chances are you can deploy CI/CD on it with Spinnaker and Kayenta.
Specifically, teams can easily set up an automated canary analysis stage within a Spinnaker pipeline. You decide which metrics Kayenta should measure and use in its tests, and it provides. This gives your latest code an aggregate score for the canary test. You can then set a score for success. If the code score hits your benchmarks. Kayenta can automatically promote or fail the canary, trigger a human approval path, or rollback the changes.
According to Greg Burrell, Netflix' senior reliability engineer, "The quality of the canary version is assessed by comparing key metrics that describe the behavior of the old and new versions. If there is significant degradation in these metrics, the canary is aborted and all of the traffic is routed to the stable version in an effort to minimize the impact of unexpected behavior."
How do you determine what's acceptable or not? Kayenta currently supports the following data sources: Prometheus, Stackdriver, Datadog, and Netflix's Atlas for metrics. You can also combine different metric sources combined into a single analysis, i.e., some metrics may come from one source while other metrics can come from another.
Read also: Netflix asks you to start hacking, bug bounty program is now public
Netflix is eating its own canary. Burrell wrote, "We are in the middle of migrating from our legacy system to Kayenta. Currently, Kayenta runs approximately 30 percent of our production canary judgments, which amounts to an average of 200 judgments per day. Over the next few months, we plan on migrating all internal users to Kayenta."
If it's good enough for Netflix, which runs entirely on the cloud and embraced DevOps early on, it should be good enough for you to try. I see Kayenta as a major step forward in making CI/CD a part of any company's software deployment plans. It's that good.