Enterprise systems are inherently complex, often involving many business processes, people, and organizations across a company. Given this built-in complexity, it's no surprise that failures abound; it's amazing these systems function at all.
We could make these same comments about any complex, mission critical system. For example, look no further than the space program or health care delivery. In both cases, massive complexity is connected to a need to get things right: failure means potential loss of life.
To say that complicated systems are more prone to break down than simpler systems is obvious. But there are also other, more subtle truths regarding failure and complex systems.
A paper copyrighted in 1998, called How Complex Systems Fail and written by an M.D., Dr. Richard Cook, describes 18 truths about the underlying reasons complicated systems break down. On the surface the list appears surprisingly simple, but deeper meaning is also present. Some of the points are obvious while others may surprise you.
THE EIGHTEEN TRUTHS
The first few items explain that catastrophic failure only occurs when multiple components break down simultaneously:
1. Complex systems are intrinsically hazardous systems. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems.
2. Complex systems are heavily and successfully defended against failure. The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.
4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.
5. Complex systems run in degraded mode. A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.
Point six is important because it clearly states that the potential for failure is inherent in complex systems. For large-scale enterprise systems, the profound implications mean that system planners must accept the potential for failure and build in safeguards. Sounds obvious, but too often we ignore this reality:
6. Catastrophe is always just around the corner. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system's own nature.
Given the inherent potential for failure, the next point describes the difficulty in assigning simple blame when something goes wrong. For analytic convenience (or laziness), we may prefer to distill narrow causes for failure, but that can lead to incorrect conclusions:
7. Post-accident attribution accident to a ‘root cause' is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause' of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.
The next group goes beyond the nature of complex systems and discusses the all-important human element in causing failure:
8. Hindsight biases post-accident assessments of human performance. Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.
9. Human operators have dual roles: as producers & as defenders against failure. The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable.
10. All practitioner actions are gambles. After accidents, the overt failure often appears to have been inevitable and the practitioner's actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
11. Actions at the sharp end resolve all ambiguity. Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp end of the system. After an accident, practitioner actions may be regarded as ‘errors' or ‘violations' but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.
Starting with the nature of complex systems and then discussing the human element, the paper argues that sensitivity to preventing failure must be built in ongoing operations. In my experience, this is true and has substantial implications for the organizational culture of project teams:
12. Human practitioners are the adaptable element of complex systems. Practitioners and first line management actively adapt the system to maximize production and minimize accidents. These adaptations often occur on a moment by moment basis.
13. Human expertise in complex systems is constantly changing. Complex systems require substantial human expertise in their operation and management. Critical issues related to expertise arise from (1) the need to use scarce expertise as a resource for the most difficult or demanding production needs and (2) the need to develop expertise for future use.
14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
15. Views of ‘cause' limit the effectiveness of defenses against future events. Post-accident remedies for "human error" are usually predicated on obstructing activities that can "cause" accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents.
16. Safety is a characteristic of systems and not of their components. Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing.
17. People continuously create safety. Failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance. These activities are, for the most part, part of normal operations and superficially straightforward. But because system operations are never trouble free, human practitioner adaptations to changing conditions actually create safety from moment to moment.
The paper concludes with a ray of hope to those have been through the wars:
18. Failure free operations require experience with failure. Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the "edge of the envelope". It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.
THE PROJECT FAILURES ANALYSIS
It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management.
It dates from 1998!!. Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own, it is a whole four pages long, and he wrote it with medical systems in mind. But that doesn't matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.
Read this paper. And READ it: none of this 21st Century 10-second-attention-span scanning. READ IT HARD. Blow your service management mind.
My take. This is one of the most insightful and important papers on failure I have read. Although focused on health care delivery, the lessons are equally applicable to large enterprise software systems.
[Photo of the long tail of failure from iStockphoto.]