Why Systems Fail: A Discussion of Robust Engineering Design
Simply put, systems tend to fail due, in-part, to a lack of robustness. That is to say, they were never designed to withstand unforeseen variation.
Introduction
Systems such as political institutions, energy distribution networks, interstates and roadways, cities and neighborhoods are, at their core, engineering problem sets. Specifically, they are systems which were designed and built with certain assumptions and criteria in mind. And, like all problems, they are soluble.
Why then, do these systems tend to break?
The argument I will make here is a simple lack of robustness.
Of course, the logical next set of questions are "What constitutes robustness?" and "How to know when a solution (i.e. system) contains robustness?". Perhaps it is instead easier to provide examples of systems that are most definitely not robust, that is to say systems which are fragile.
Examples of Fragility
1. A public transit network which assumes a maximum number of travelers: Of course the requirement cannot be for the transit system to handle an infinite number of travelers. That is both unreasonable and unnecessary. However, if there is no consideration given to a growing population the system's solution will be static and encourage other engineers to design systems based on a static populace. This could manifest in a subway system needed to expand and add more parallel routes, but is unable to because a water system exists adjacent to current subway tunnels - thus boxing in our public transit network.
2. A web application running on a single server which assumes a maximum number of requests per second: Because a single server has physical limits on memory and computational speed, there exists a rather immediate threshold of performance, of which, any software running on the server cannot surpass. The linear solution (i.e. a derivation of the original solution in which two servers are used) would generate a host of complications simply because the original software was not designed to function on more than one server. Not surprisingly, an entire field of software engineering has been developed to tackle this exact problem, namely Site Reliability Engineers or SRE(s).
Commonalities
The commonalities between the above examples are fairly straight forward.
The first, is that of maximum present efficiency. Or, plainly said, the two systems were designed for the present and not the future. Both the subway system and the web application were engineered towards maximizing current possible production and minimizing current possible cost. By that methodology, the systems are successful, but only for the exact moment in which they were instantiated. Any and every possible moment past that point, given an increase or decrease in system load, the systems begin to fracture.
As noted by Andrei A. Klishin from the University of Washington "the coupling of heterogeneous subsystems restricts subsystem component specifications, and small changes in the design of one subsystem can trigger avalanches of change in connected subsystems".
The second, is that of problem expansion. Again, in plain english, the systems caused problems to manifest in the surrounding systems when they required adjustment. In the case of the subway system, once the need for expansion was established our electrical and water systems needed to be completely rerouted. In the case of the web application, once the single server reached capacity and a second server was introduced, complications of central storage, state management, and load-balancing arose; all of which would require a complete rework of the web application's primary networking logic.
What's the catch? Efficiency.
It should come at no surprise that a trade-off exists when engineering robustness; efficiency. In fact, the relationship between robustness and efficiency is practically inverse. This doesn't seem to challenge intuition. It would be expected that any additional allocation of resources to a system, without a direct and immediate purpose, would lessen the overall efficiency of the system. More code and more tunnels require more work with no additional increase to the current performance.
Is it worth the trade off?
Perhaps. It could be argued that robustness can be over done. Much like the supply and demand model in economic theory, the intersection of the two curves denotes an optimum. At a certain point, the system needs to be considered complete.
The idea is not to build systems that are impervious to change or so easily changed so as to lose the principles behind the solution, but instead systems which do not require dismantling in-order to achieve success in a newly adapted environment.
How to design with Robustness?
In short, peer into the future. The idea of designing for a new, nonexistent set of problems, necessarily requires the designer to imagine the future. An engineer must suppose some potential futures, while simultaneously imagining how possible variations in a solution may impact surrounding and existing solutions. Fortunately, the human mind is suited for such a task.
Final Thoughts
The discipline of engineering is no small venture in its own right. And, given that no problem solvable with engineering exists in isolation, the interdependencies amongst their solutions breed widespread fragility. That is, unless those solutions are engineered with robustness from the onset.
References:
- Klishin, A. A., Kirkley, A., Singer, D. J., & Van Anders, G. (2020). Robust design from systems physics. Scientific Reports, 10(1), 1-16. https://doi.org/10.1038/s41598-020-70980-5
- Fong, J. , Filliben, J. , Heckert, N. , deWit, R. and Bernstein, B. (2008), Robust Engineering Design for Failure Prevention, Proceedings of ASME Pressure Vessels and Piping Division Conference, Chicago, IL, [online], https://tsapps.nist.gov/publication/getpdf.cfm?pubid=152089 (Accessed September 21, 2023)
Back to the blog