Published 2026-03-29

Why Complex Systems Fail

Every organization's technology environment is a complex system. Understanding what that means — and what it doesn't — is essential to managing risk effectively.

Complex vs. Complicated

A complicated system has many parts but they interact in predictable ways. A jet engine is complicated. You can take it apart, understand every component, and put it back together. A complex system has interactions that cannot be fully enumerated, much less predicted. Your IT infrastructure — connected to cloud services, third-party APIs, supply chains, and human operators — is complex.

Charles Perrow described this distinction in Normal Accidents (1999). Systems with interactive complexity (components interact in unexpected ways) and tight coupling (changes propagate rapidly with no buffer) will inevitably produce failures that no one anticipated and no safety system can prevent.

Perrow developed his framework for safety-critical industrial systems — nuclear plants, chemical refineries, aviation. Applying it to information security and IT infrastructure is an extension worth saying out loud, but a principled one rather than analogical: interactive complexity and tight coupling are properties of automata — they apply equally to mechanical systems, control systems, networked computers, and software stacks. There is no privileged class of systems that escapes them. The patterns Perrow documented in 1979 at Three Mile Island are visible today in cascading cloud outages, supply-chain compromises that propagate through dependency graphs, and security tooling whose interactions create failure modes its operators cannot enumerate. The framework travels because the underlying mathematics travels.

System Accidents

Interactive Complexity
Components interact in ways that cannot be fully enumerated. A change in one subsystem produces effects in another subsystem through pathways that were not designed and may not be visible.
Tight Coupling
Changes propagate rapidly through the system with little or no buffer. There is no time to understand what is happening before the effects cascade.
System Accident
A failure that arises from the interaction of components, not from the failure of any individual component. No one made an error. The system did exactly what it was designed to do — and the result was catastrophic.

Three Mile Island. The 2003 Northeast blackout. The 2024 CrowdStrike incident that grounded airlines worldwide — a single update to a security product cascaded through millions of systems in hours. These are not bugs. They are not human errors. They are system accidents — the inevitable consequence of building systems more complex than anyone can understand.

Why This Matters Now

Every organization's technology environment has become a complex system. Cloud dependencies, microservices, API integrations, SaaS platforms, AI tools — each connection adds interactive complexity. Each automation adds tight coupling. The number of possible failure modes in a modern enterprise IT environment exceeds what any human or AI can enumerate.

This is not a theoretical concern. If your organization uses cloud services, third-party APIs, or any form of automation, you are operating a complex system whether you manage it as one or not.

More Controls, More Complexity

The natural response to complexity is more complexity. More monitoring tools. More redundant systems. More compliance checklists. More automated controls. Each of these adds value — but each also adds interactions that create new failure modes.

Perrow documented this at Three Mile Island: the safety systems designed to prevent a meltdown actually contributed to it. The operators couldn't distinguish between a real emergency and a safety system malfunction because the safety systems had made the overall system more complex than the operators could comprehend.

This pattern plays out in technology environments every day:

  • Adding a WAF protects against attacks but adds a failure mode — WAF misconfiguration blocks legitimate traffic.
  • Implementing multi-factor authentication reduces unauthorized access but adds a dependency — MFA outage locks everyone out.
  • Deploying to multiple availability zones improves resilience but adds complexity — cross-zone synchronization failures.
  • Automating incident response speeds reaction time but adds tight coupling — automated response to a false positive causes the very outage it was designed to prevent.

Incompressible vs. Compressible Complexity

Not all complexity is the same. Incompressible complexity is essential — you can't eliminate it without losing functionality. Your authentication system is complex because authentication is inherently complex. There is nothing to be done about this.

Compressible complexity is the complexity you can reduce. Unnecessary integrations. Redundant tools that overlap. Processes that exist because "we've always done it that way." Compliance theater that adds checkboxes without reducing actual risk. This is where your energy should go.

The danger is confusing the two. Over-analysis of incompressible complexity wastes resources and creates a false sense that the complexity can be managed away. Ignoring compressible complexity leaves real risk on the table. The skill is knowing which is which — and that skill comes from experience, not from frameworks.

Managing What Cannot Be Prevented

You cannot eliminate system accidents. You can reduce their frequency and limit their impact. Here's how:

  • Reduce unnecessary coupling. Microservices that share databases are tightly coupled regardless of their architecture. Shared authentication providers, shared DNS, shared CI/CD pipelines — each creates a coupling that can cascade. Identify single points of failure and decide which ones you can tolerate.
  • Prefer successive approximation over grand plans. Start with what you know, make small improvements, measure results. RESCOR's RAPID methodology (developed in 1992) was built on this principle: frequent, lightweight development cycles produce better results than extensive analysis and planning.
  • Focus on compressible complexity. Don't waste energy analyzing complexity you can't change. Find the unnecessary integrations, the redundant tools, the processes that add overhead without reducing risk. Eliminate those.
  • Build for graceful degradation. Systems that detect and contain failures are more resilient than systems that try to prevent all failures. Circuit breakers, bulkheads, fallback modes, manual overrides — these are more valuable than additional monitoring tools.
  • Measure risk quantitatively. Qualitative risk labels ("low," "medium," "high") hide the actual exposure. STORM quantitative risk measurement reveals whether your controls are actually reducing risk or just redistributing it.
  • Accept that some decisions will be wrong. The fear of wrong decisions creates more risk than actually making wrong decisions. Make more, smaller decisions. Measure the results. Adjust.
  • Keep humans in the loop. Automated systems make faster decisions than humans. They also make faster mistakes. In any system where the consequences of a mistake are significant, a human confirmation step between "system proposes" and "action happens" prevents most catastrophic outcomes.

The Bottom Line

Your technology environment is a complex system. It will produce failures that no one anticipated. The question is not how to prevent those failures — it's how to limit their impact when they occur. Organizations that accept this reality and design for resilience will outperform organizations that spend their resources trying to prevent the unpredictable.

Continuous, small improvements — what quality management calls successive approximation — produce superior results to extensive analysis and grand strategic initiatives. Start with what you know. Make small changes. Measure the results. Repeat. That is how humans naturally acquire every skill, and it is how organizations naturally improve their resilience.

Related: The Limits of Artificial Intelligence explores how complexity science constrains AI prediction. Should You Trust AI-Generated Code? applies these principles to software development.