The Ironies of Reliability


Reliability promotes failures. Failures promote reliability

When a system is reliable long enough, production pressure causes the operators to drive the system harder; Over time operators become less careful as the trauma of the last failure wears off. More workload is applied, new features introduced, etc, until the system trails again into the danger zone (e.g. high load once thought to be dangerous), sailing through smoothly this time, thus boosting the confidence of operators in the robustness of their system. Eventually the system does fail and after several such failures safety again becomes a higher priority than production - now effort is put into making the system safe again and the cycle begins once more.

Reliability requires failure

Failure is necessary to improve reliability, because it provides important information on system behavior in rarely seen states. Failure is also an important source of real world information and training for the operators and designers of the system. It follows, that

A system without failures is unreliable

A system which has never failed is inherently unreliable. Its operators have no information on its possible failure modes and behavior under stress. But moreover, the operators of the system are untrained in handling real world failure and often overconfident because of their system’s outstanding record.

There are many ways to fail, but only a few to succeed

To paraphrase Tolstoy:

All working systems are alike; each failed system is faulty in its own way

Correct operation is transient, failure tends to be stable

Due to the natural variation in systems and the statistical bias towards failure (there are many more failure states than ok states) a system will spontaneously drift towards failure. Continuous supervision and remediation efforts are required to keep the system in proper status; Thus, the “normal” state of the system is inherently unstable. We know from personal experience that left unattended, our systems will fail. In contrast, a failed system is not likely to correct itself spontaneously (although this does happen sometimes). A system may drift into a broken state in which it will stay broken forever - failure can be stable.

Failure is the only reliable thing

For a system to be completely reliable, it must avoid any failure all the time. This is of course infeasible and the only prediction we can reliably make is “The system will fail eventually”

Fault tolerance introduces failure

Safety devices, often taking the form of replicas, fail-overs and clusters are used to deal with component failures in IT systems. While they allow the system to tolerate some failures, they often introduce new kinds of failures into the system - E.g. split brain and replication loops. Moreover, the added complexity of multiple components and the interaction between them also makes the system harder to reason about and failure remediation much harder.

Further reading

reliability devops
comments powered by Disqus