HomeRedundancy

Redundancy

Deliberate duplication of critical elements or capacity to avoid single points of failure and reduce downtime. You trade cost and complexity for availability and graceful degradation.
author
General engineering and operations practice
Model type
,
About
Redundancy is a resilience pattern: add alternate paths or spare capacity so the system still meets its objective when parts fail. It shows up in infra (HA pairs), supply chains (second sources), finance (cash buffers), and teams (cross-training). Redundancy is not waste when it’s targeted at constraints and failure modes.
How it works – what to map
N+1 / 2N capacity – one extra unit beyond need (N+1) or a full duplicate (2N).
Active–active vs active–passive – load shared all the time vs hot/warm/cold standby.
Diversity – different vendors/paths/technologies to avoid common-mode failure.
Buffers – inventory, queues, cash to absorb variance and desynchronise spikes.
Geographic & failure-domain isolation – contain blast radius; avoid correlated outages.
Human redundancy – pairing, shadowing, runbooks; no single indispensable person.
Use cases
Infra & platforms – multi-AZ deployments, redundant links, database replicas, blue/green. In crypto people compare platforms like Ethereum (with multiple nodes) to Solana (with few – and many blockchain resets)
Supply chain – dual sourcing, safety stock, alternate logistics lanes.
Finance – liquidity buffers, credit lines, runway targets.
Operations – cross-trained schedulable staff; backup on-call; spare equipment.
Data & compliance – 3-2-1 backups, immutable snapshots, DR drills.
Comms – secondary ISP/SIM, failover routing, out-of-band channels for incidents.
How to apply
DMap SPOFs – list single points of failure and their blast radius; include people and vendors.
Set targets – SLOs, RTO/RPO, max queueing delay; price the cost of downtime.
Choose pattern – N+1/2N, active–active, buffers, diversity, or geo isolation.
Decouple & isolate – fail fast, circuit-break, define failure domains.
Test failover – game days, chaos drills, people-out tests; fix runbooks.
Instrument – watch MTTR, takeover time, split-brain risk, and buffer health.
Optimise – remove redundant-for-redundancy; keep only what materially cuts risk.
pitfalls and cautions
Common-mode failures – “redundant” units share the same bug, power, or provider.
Un-tested standby – cold spares that don’t start; failover procedures nobody has run.
Complexity tax – more parts → more failure modes; keep interfaces simple.
Hidden coupling – shared configs/keys/queues create synchronous collapse.
Cost without benefit – redundancy where downtime is cheap; spend where impact is real.