Download this complimentary webinar to learn how to use Forrester's automation framework to guide decisioning, rationalize your automation portfolio, and prepare for the future of work.
"By taking away the easy parts of [the] task, automation can make the difficult parts of the human operator's task more difficult."
In other words, automate all the easy things, and what's left for people to do? The hard things.
This maxim has never been truer. When systems become too automated, their behavior in key respects becomes harder and harder to predict and set them straight when they go wrong requires deeper and deeper expertise. While we are in a world of dramatically increasing automation -- chatbots, DevOps pipelines, AIOps, and more -- the dark side is increasingly seen in problems such as the Boeing 737 MAX. When human factors are left out of the design process, and humans, therefore, cannot function effectively as a coordinated system with the automation, very bad things can happen.
On a less dramatic front, here at Forrester, we are hearing signals that not all is well on the automation front. A few large but very competent clients have mentioned to me lately that the mean time to restore (MTTR) is drifting upward, unexpectedly given their investments in trying to reduce it. Bob Davis of Plutora (a company that aggregates a lot of operational IT data) confirmed this in a conversation: "We've become sensitive to the topic of MTTR over the past six months as a measure of maturity. As customers get more sophisticated, we're seeing unexpected behavior, with MTTR going up."
Note that MTTR may not ultimately be a great metric to keep tracking; John Allspaw of Adaptive Capacity Labs has criticized it. But as it is such a widespread industry metric, I still believe it is a useful though imperfect signal, especially over larger-scale data sets and longer time horizons.
We also have statements from vendors such as Atlassian and Zendesk that the effective lifespan of knowledge articles is shrinking and the incidence of known errors (i.e., repeating incidents) is falling. This means that for any given incident, issue, or defect, there is a higher likelihood that it is a "zero-day" concern (to borrow a term from security). Such concerns require higher skills -- in classic service desk/NOC terms, it moves from Tier 1 to Tier 2 or 3.
And finally, there is the problem of Hollnagel's law of stretched systems, which states that "every system is stretched to operate at its capacity; as soon as there is some improvement (for example, in the form of new technology), it will be exploited to achieve a new intensity and tempo of activity." (Thanks to J. Paul Reed of Netflix for tracking the source of this down for me.)
All in all, the contradicting dynamics (a classic balancing feedback problem) can be represented thus:
So what is to be done? It's critical to recognize that this problem is inherent. It won't go away. But in our latest report, there are some recommendations, including:
- Design the human/machine system as a unified whole.
- Embrace safety sciences and resilience engineering, including fields such as engineering psychology and human factors that have long studied these problems.
- Empower teams as your highest-value unit.
- Adopt the SRE perspective on automation.
- Use AI itself to help solve your observability problem.
- Adopt blameless retrospectives.
- Rationalize your automation portfolio.
This post was written by Principal Analysts Charles Betz and Chris Gardner. It originally appeared here.