Redundancy

Add independent alternatives so one failure doesn’t stop the outcome; test failover so the backup isn’t imaginary.

Author

Reliability engineering & safety science (von Neumann, Shannon; modern SRE and business continuity practice)

model type

about

Redundancy means providing more than one way to achieve a required function. In series systems the weakest link fails the whole; redundancy converts the path to parallel, so other components, suppliers or people can take over. It’s different from buffers (time/stock) and best when backups are independent and regularly exercised so they’ll work under stress.

How it works

Patterns

Active–active (parallel) – multiple units serve at once; one can disappear with no outage.
Active–passive (standby) – secondary takes over on failure; classify as hot/warm/cold by readiness.
2N/N+1/quorum – full duplication (2N), one extra unit (N+1), or majority voting (quorum, RAID, consensus).

Independence & diversity – spread across vendors/regions/power/failure modes; add design diversity to avoid common-mode failure.

Reliability math (intuition) – series reliability multiplies (one failure kills); parallel succeeds if any path works.

Graceful degradation – non-essential features shed load to keep the core available.

People & process – cross-training, runbooks, and documentation raise the bus factor.

Data & backups – separate copies, media and locations; verify with restore tests.

use-cases

SRE/IT – multi-AZ/region, load balancers, database replicas, circuit breakers.

Supply chain – dual-source critical inputs; safety stock at bottlenecks.

Operations – spare capacity, alternate routes, manual fallbacks.

Finance – liquidity buffers, diversified facilities, ring-fenced risk.

Org design – deputy roles, rota coverage, shared ownership of key knowledge.

How to apply

Map the function and SPOFs – draw the value stream; mark single points of failure (tech, vendor, person, licence, site).
Choose a pattern per SPOF – N+1 for components, 2N for safety-critical, quorum for consensus systems, graceful degrade for peak load.
Ensure independence – separate clouds/regions/power feeds; vendor and design diversity where failure modes could correlate.
Instrument detection & switchover – health checks, timeouts, automated failover with manual override.
Drill it – scheduled game days and restore tests; rotate duties so backups stay warm.
Keep parity – config and data sync for standbys; prevent drift with automation.
Set service levels – target availability/MTTR; place redundancy where the impact or irreversibility is highest.
Review cost vs risk – model expected loss vs capex/opex; keep redundancy where it buys meaningful risk reduction.

pitfalls & cautions

Common-mode failure – “redundant” paths sharing a region, provider, library or process.

Bit-rot – cold backups decay; no one practices the switchover.

Split-brain & inconsistency – unsynchronised replicas; design clear leadership/quorum rules.

Complexity tax – more parts mean more failure modes; keep designs simple and observable.

False comfort – redundancy without detection, automation, or runbooks.

Security surface – extra endpoints and creds expand attack surface; pair with controls.

Learn more mental models