3 steps for introducing more chaos into systems (yes, that's a good thing)

Everyone wants stability in their information technology systems, but with too much stability, IT departments lose their edge, one expert warns.
Written by Joe McKendrick, Contributing Writer

Did you know there is such a thing as too much stability in your IT infrastructure?

Computer user-James Martin CNET

Stability is supposed to be the goal of every right-minded IT executive right? But perhaps, just as too much food is too much of a good thing, or too much drink is definitely too much of a good thing, the same applies to IT stability.

That's the view of HP Software Security Evangelist Rafal Los. In a recent post, Los says stability brings four very negative results: "complacency, change resistance, rigidity and a diminished capacity to respond and recover." (Also the results of too much food and drink, by the way.)
Los argues that an IT shop running too smoothly will quickly lose its edge -- so when something goes wrong, it really goes wrong in a big way. "Every organization I’ve ever been a part of has spent countless dollars and immeasurable energy striving for stability in which everything is predictable," he says. "Unfortunately, these are the organizations that recover slowest when the inevitable, unpredictable catastrophe hits." An apt comparison may be "a search-and-rescue team that sits idle for too long can become rusty under pressure without constant drilling and practice."
Instead of striving for stability, IT executives should strive for more resiliency, Los says. In essence, be a little more of a "chaotic" enterprise.

His advice to be more chaotic, and less steady state:

  • Automate, incrementally: "Learning from each successive instrumented failure, we can leverage automation to detect and/or compensate faster and more accurately in the future, when an unplanned failure strikes. With enough instrumented chaos, the likelihood of unhandled failures will fall dramatically."
  • Allow some components to fail, on purpose: "Allow components (a switch, virtual machine, content cache, or even a database query) to regularly fail, and your systems and processes to detect and respond. Having built in component-level resiliency and constantly testing for it in a state of controlled chaos, you can be more confident that real failure won’t catch you off guard."
  • Learn to fail fast and recover fast: This "improves your MTtR (Mean Time to Repair) metric." The goal should be continuous improvement and learning, not building completely unbreakable systems.  

Yes, some chaos can be a good thing. As Los points out, it's good to keep IT ops, systems and people toned up, and perhaps a little enerrgized. Of course, unfortunately, some readers may feel they're immersed in a little too much chaos brought on by their organizations. That's another story.  

(Photo: James Martin, CNET.)

Editorial standards