Resiliency Patterns in Distributed Systems
After this topic, you will be able to:
- Explain the relationship between resiliency, availability, and fault tolerance
- Identify the major categories of resiliency patterns and their use cases
- Describe how resiliency patterns work together to build reliable distributed systems
TL;DR
Resiliency is a system’s ability to detect, absorb, and recover from failures while maintaining acceptable service levels. Unlike high availability (which focuses on uptime) or fault tolerance (which masks failures), resiliency acknowledges that failures will happen and designs systems to gracefully degrade and recover. Modern distributed systems combine isolation patterns (bulkheads, circuit breakers), recovery patterns (retries, fallbacks), and coordination patterns (health checks, load shedding) to build resilient architectures that survive real-world chaos.
Cheat Sheet: Resiliency = Detection + Isolation + Recovery. Key patterns: Circuit Breaker (stop calling failing services), Bulkhead (isolate failures), Retry with Backoff (handle transient failures), Fallback (degrade gracefully), Health Checks (detect problems early). Always combine multiple patterns—no single technique provides complete resiliency.
Why This Matters
When Netflix’s Chaos Monkey randomly terminates production instances, the service keeps streaming. When AWS has a regional outage, well-designed systems failover to other regions. When a payment processor experiences latency spikes, resilient e-commerce platforms degrade gracefully rather than timing out every checkout. This is resiliency in action—and it’s what separates systems that survive real-world conditions from those that crumble under pressure.
In system design interviews, resiliency demonstrates mature engineering thinking. Junior engineers often focus on the happy path, assuming perfect network conditions and zero failures. Senior engineers know that distributed systems fail in spectacular and unpredictable ways: network partitions, cascading failures, resource exhaustion, dependency outages, and slow degradation. Interviewers want to see that you understand failure modes and can design systems that remain operational when—not if—things go wrong.
Resiliency matters because modern systems are inherently fragile. A typical microservices architecture might have dozens of services, each with multiple dependencies. If Service A calls Service B, which calls Service C, and C starts timing out, the failure can cascade backward through the entire call chain, taking down services that were otherwise healthy. Without resiliency patterns, a single slow database query can bring down your entire platform. With proper resiliency, you isolate failures, recover quickly, and maintain service for users even when parts of your system are struggling.
The business impact is measurable. Amazon found that every 100ms of latency costs them 1% in sales. Google discovered that a 500ms delay reduces traffic by 20%. When your payment processor goes down during Black Friday, resiliency patterns are the difference between losing millions in revenue and gracefully falling back to an alternative processor. Companies like Netflix, Uber, and Stripe invest heavily in resiliency because downtime is existential—users switch to competitors instantly when services fail.
The Landscape
The resiliency landscape has evolved dramatically as systems moved from monoliths to distributed architectures. In the monolithic era, resiliency meant running redundant servers behind a load balancer and having good backup procedures. If the application crashed, you restarted it. If the database failed, you failed over to a replica. Failures were relatively simple and localized.
Microservices changed everything. Now a single user request might traverse ten services, each with its own database, cache, and external dependencies. Failures became distributed, cascading, and much harder to predict. This complexity drove the development of modern resiliency patterns, pioneered by companies operating at massive scale. Netflix open-sourced Hystrix (circuit breaker library) after learning painful lessons about cascading failures. Google’s Site Reliability Engineering practices codified patterns for building resilient systems. Amazon’s “everything fails all the time” philosophy led to sophisticated failure injection and recovery mechanisms.
Today’s resiliency toolkit spans multiple layers. At the network layer, you have load balancers with health checks and automatic failover. At the application layer, you have circuit breakers, bulkheads, retries, timeouts, and fallbacks. At the data layer, you have replication, backups, and eventual consistency patterns. At the infrastructure layer, you have multi-region deployments, chaos engineering, and automated recovery. These layers work together—resiliency is never a single technique but a defense-in-depth strategy.
The cloud era introduced new resiliency challenges and tools. Cloud providers offer managed services with built-in resiliency (like AWS’s multi-AZ RDS), but they also introduce new failure modes (like regional outages). Service meshes like Istio and Linkerd provide resiliency features (retries, timeouts, circuit breakers) at the infrastructure level, making them automatic rather than requiring application-level implementation. Observability tools like Datadog and Honeycomb make it possible to detect and diagnose failures quickly, which is essential for recovery.
Chaos engineering emerged as a discipline for proactively testing resiliency. Netflix’s Chaos Monkey randomly terminates instances in production to ensure systems can handle failures. Gremlin and other chaos engineering platforms let you inject failures systematically—network latency, CPU exhaustion, disk failures—to validate that your resiliency patterns actually work. This shift from reactive (fixing failures after they happen) to proactive (breaking things intentionally) represents mature resiliency thinking.
Key Areas
Isolation Patterns prevent failures from spreading. When one component fails, isolation patterns contain the damage so it doesn’t cascade to other parts of the system. The bulkhead pattern partitions resources (thread pools, connection pools, memory) so that one failing operation can’t exhaust resources needed by other operations. Circuit breakers detect when a downstream service is failing and stop sending requests to it, preventing cascading timeouts. Rate limiting protects services from being overwhelmed by too many requests. These patterns are your first line of defense—they stop small failures from becoming system-wide outages. Netflix uses bulkheads extensively to ensure that when one API endpoint is slow, it doesn’t block threads needed by other endpoints.
Recovery Patterns help systems bounce back from failures. Retries with exponential backoff handle transient failures (like temporary network glitches) by trying again after waiting progressively longer intervals. Fallbacks provide degraded functionality when primary systems fail—showing cached data when the database is unavailable, or using a backup payment processor when the primary one is down. Timeouts prevent operations from hanging indefinitely, ensuring that failures are detected quickly. These patterns acknowledge that failures happen and focus on recovering gracefully. Stripe’s payment processing uses sophisticated retry logic with idempotency keys to ensure payments eventually succeed even when networks are unreliable.
Detection and Monitoring patterns identify failures quickly so recovery can begin. Health checks continuously probe services to detect when they’re unhealthy, allowing load balancers to route traffic away from failing instances. Heartbeats and watchdogs detect when processes have hung or crashed. Distributed tracing (like Jaeger or Zipkin) helps identify where failures are occurring in complex call chains. Metrics and alerting (like Prometheus and Grafana) provide visibility into system health. Fast detection is critical—the longer a failure goes undetected, the more damage it causes. Google’s SRE practices emphasize that mean time to detect (MTTD) is often more important than mean time to repair (MTTR).
Coordination Patterns manage how services interact during failures. Load shedding deliberately drops low-priority requests when the system is overloaded, preserving capacity for high-priority operations. Backpressure propagates load information upstream so that fast producers slow down when consumers can’t keep up. Graceful degradation systematically disables non-essential features to keep core functionality working. Leader election and consensus algorithms (like Raft or Paxos) ensure that distributed systems can make decisions even when some nodes fail. These patterns are essential for systems that need to remain operational under extreme load or partial failures. During AWS outages, well-designed systems use load shedding to prioritize critical operations like authentication over less critical features like recommendation engines.
Testing and Validation patterns ensure resiliency actually works. Chaos engineering deliberately injects failures to validate that recovery mechanisms function correctly. Fault injection testing simulates specific failure scenarios (network partitions, slow dependencies, resource exhaustion) in controlled environments. Game days and disaster recovery drills practice failure scenarios with human operators to ensure runbooks are correct and teams know how to respond. Synthetic monitoring continuously exercises critical paths to detect problems before users do. Without testing, resiliency patterns are theoretical—you don’t know if they work until a real failure occurs. Netflix runs Chaos Monkey in production continuously because they learned that untested resiliency measures often fail when you need them most.
Resiliency Pattern Taxonomy
graph TB
subgraph Isolation Patterns
CB["Circuit Breaker<br/><i>Stop calling failing services</i>"]
BH["Bulkhead<br/><i>Partition resources</i>"]
RL["Rate Limiting<br/><i>Prevent overload</i>"]
end
subgraph Recovery Patterns
RT["Retry with Backoff<br/><i>Handle transient failures</i>"]
FB["Fallback<br/><i>Degraded functionality</i>"]
TO["Timeout<br/><i>Fail fast</i>"]
end
subgraph Detection Patterns
HC["Health Checks<br/><i>Probe service status</i>"]
HB["Heartbeats<br/><i>Detect hung processes</i>"]
DT["Distributed Tracing<br/><i>Track request flows</i>"]
end
subgraph Coordination Patterns
LS["Load Shedding<br/><i>Drop low-priority requests</i>"]
BP["Backpressure<br/><i>Slow down producers</i>"]
GD["Graceful Degradation<br/><i>Disable non-essential features</i>"]
end
Failure["System Failure"] --> Isolation Patterns
Failure --> Recovery Patterns
Failure --> Detection Patterns
Failure --> Coordination Patterns
Resiliency patterns organized into four categories: Isolation (contain failures), Recovery (bounce back), Detection (identify problems), and Coordination (manage system-wide responses). Effective resiliency requires combining patterns from all categories.
How Things Connect
Resiliency patterns form a layered defense system where each layer addresses different failure modes and timescales. Isolation patterns (bulkheads, circuit breakers) operate at millisecond timescales, immediately preventing failures from spreading. Recovery patterns (retries, fallbacks) operate at second timescales, attempting to recover from transient failures. Detection patterns (health checks, monitoring) operate at second-to-minute timescales, identifying when recovery isn’t working. Coordination patterns (load shedding, graceful degradation) operate at minute timescales, managing system-wide responses to sustained problems.
These patterns complement each other in specific ways. Circuit breakers work with retries: retries handle transient failures, but if failures persist, the circuit breaker opens to prevent wasted retry attempts. Bulkheads work with timeouts: bulkheads limit how many threads can be blocked, while timeouts ensure those threads don’t block forever. Health checks work with load balancers: health checks detect failing instances, and load balancers route traffic away from them. Fallbacks work with circuit breakers: when a circuit breaker opens, fallbacks provide alternative functionality.
The relationship between resiliency, availability, and fault tolerance is often confused but important to understand. Availability measures uptime—what percentage of time is the system operational? Fault tolerance means the system continues operating correctly even when components fail, typically through redundancy and failover. Resiliency is broader: it’s the ability to detect failures, contain their impact, recover quickly, and learn from incidents. A system can be highly available (99.99% uptime) but not resilient if it achieves that uptime through luck rather than robust failure handling. Conversely, a resilient system might have planned downtime (lower availability) but recover quickly from unexpected failures.
Resiliency patterns also connect to other system design areas. Scalability patterns (like load balancing and horizontal scaling) improve resiliency by distributing load across multiple instances, so no single instance is a single point of failure. Observability is essential for resiliency—you can’t recover from failures you can’t detect. Security patterns (like rate limiting and authentication) prevent malicious actors from causing failures. Data consistency patterns (like eventual consistency and conflict resolution) help systems remain operational even when network partitions occur.
The key insight is that resiliency is emergent—it comes from how patterns interact, not from any single pattern. A circuit breaker alone doesn’t make a system resilient; it needs to be combined with retries, fallbacks, monitoring, and testing. Netflix’s resiliency doesn’t come from using Hystrix; it comes from combining circuit breakers with chaos engineering, comprehensive monitoring, automated remediation, and a culture that expects and plans for failures. In interviews, demonstrating this holistic understanding—showing how patterns work together—is what distinguishes senior engineers from those who just memorize individual techniques.
Cascading Failure: Without vs With Resiliency Patterns
graph LR
subgraph Without Resiliency
U1["User"] --"1. Request"--> A1["Service A<br/><i>No timeout</i>"]
A1 --"2. Call"--> B1["Service B<br/><i>No circuit breaker</i>"]
B1 --"3. Call"--> C1["Service C<br/><i>SLOW/FAILING</i>"]
C1 -."Timeout (30s)".-> B1
B1 -."Threads blocked".-> B1
B1 -."Timeout (30s)".-> A1
A1 -."Threads blocked".-> A1
A1 -."Timeout (30s)".-> U1
Note1["❌ All threads blocked<br/>❌ 30s user timeout<br/>❌ Failure cascades up"]
end
subgraph With Resiliency
U2["User"] --"1. Request"--> A2["Service A<br/><i>3s timeout, bulkhead</i>"]
A2 --"2. Call (1s timeout)"--> B2["Service B<br/><i>Circuit breaker</i>"]
B2 --"3. Call"--> C2["Service C<br/><i>SLOW/FAILING</i>"]
C2 -."Timeout (1s)".-> B2
B2 --"Circuit opens"--> B2
B2 --"4. Fallback response"--> A2
A2 --"5. Cached data"--> U2
Note2["✓ Isolated thread pool<br/>✓ 1s response time<br/>✓ Graceful degradation"]
end
Without resiliency patterns (left), a slow Service C blocks threads in Service B, which blocks threads in Service A, causing a cascading failure with 30+ second timeouts. With resiliency patterns (right), circuit breakers detect the failure, bulkheads isolate resources, and fallbacks provide degraded functionality within 1 second.
Pattern Interaction: Circuit Breaker + Retry + Fallback
stateDiagram-v2
[*] --> Closed: Initial state
Closed --> Closed: Success (reset failure count)
Closed --> Open: Failure threshold exceeded<br/>(e.g., 5 failures in 10s)
Open --> HalfOpen: Timeout period expires<br/>(e.g., after 30s)
Open --> Open: All requests fail fast<br/>(no retry attempted)
HalfOpen --> Closed: Test request succeeds
HalfOpen --> Open: Test request fails
note right of Closed
Retry Pattern Active:
- Retry with exponential backoff
- Max 3 attempts
- Add jitter to prevent thundering herd
end note
note right of Open
Fallback Pattern Active:
- Return cached data
- Use backup service
- Degrade gracefully
end note
note right of HalfOpen
Limited Testing:
- Allow 1 request through
- If success, resume normal operation
- If failure, back to Open state
end note
Circuit breaker state machine showing how it interacts with retry and fallback patterns. In Closed state, retries handle transient failures. When failures exceed threshold, circuit Opens and all requests use fallbacks immediately (fail-fast). After timeout, circuit enters Half-Open to test if service recovered.
Real-World Context
Netflix pioneered many modern resiliency practices out of necessity. When they migrated from data centers to AWS, they faced a new reality: cloud infrastructure fails frequently and unpredictably. Their response was to embrace failure rather than fight it. They built Hystrix to implement circuit breakers and bulkheads across their microservices. They created Chaos Monkey to randomly terminate production instances, forcing their systems to handle failures continuously. They developed Simian Army (Chaos Gorilla, Chaos Kong) to simulate larger failures like entire availability zone or region outages. This chaos engineering approach proved so valuable that Netflix open-sourced their tools and other companies adopted similar practices. Today, Netflix can lose entire AWS regions and continue streaming because their resiliency patterns are battle-tested in production daily.
Uber’s resiliency strategy focuses on graceful degradation and fallbacks. Their dispatch system has multiple fallback layers: if the primary matching algorithm fails, they fall back to a simpler algorithm; if real-time pricing fails, they use cached prices; if the map service is slow, they show a simplified map. During a major AWS S3 outage that affected many companies, Uber remained operational because their services had fallbacks that didn’t depend on S3. They also use sophisticated load shedding: during peak demand (like New Year’s Eve), they prioritize ride requests over less critical features like viewing ride history. This ensures core functionality works even when the system is overloaded.
Stripe’s payment processing requires extreme resiliency because financial transactions must be reliable. They use idempotency keys to make retries safe—if a payment request is retried due to a network failure, the idempotency key ensures the customer isn’t charged twice. They implement sophisticated retry logic with exponential backoff and jitter to handle transient failures from payment processors. They maintain fallback payment processors so that if their primary processor has issues, they can route transactions to alternatives. They use circuit breakers to detect when a payment processor is failing and stop sending requests to it. Their resiliency patterns are so robust that they maintain 99.99%+ availability even though they depend on external payment networks that are less reliable.
Google’s approach to resiliency is codified in their Site Reliability Engineering practices. They use error budgets to balance feature development with reliability: each service gets a budget for downtime (e.g., 99.9% availability allows 43 minutes of downtime per month), and if they exceed that budget, feature development stops until reliability improves. They practice chaos engineering through DiRT (Disaster Recovery Testing) exercises where they deliberately break production systems to validate recovery procedures. They use load shedding extensively: during overload, they drop low-priority requests (like analytics) to preserve capacity for high-priority requests (like search queries). Their resiliency patterns are deeply integrated into their infrastructure—features like automatic failover and health checking are built into their load balancers and service mesh rather than requiring application-level implementation.
Amazon’s resiliency philosophy is “everything fails all the time.” They design systems assuming that any component can fail at any moment. Their microservices architecture uses bulkheads extensively—each service has isolated resources so that failures don’t cascade. They use cell-based architecture where the system is partitioned into independent cells, and a failure in one cell doesn’t affect others. During AWS outages, Amazon.com often remains operational because their services are distributed across multiple regions and can failover automatically. They practice game days where teams simulate major failures (like losing an entire region) to validate that their runbooks and automation work correctly. This continuous testing ensures that when real failures occur, recovery is automatic and fast.
Multi-Region Failover Architecture (Netflix-Style)
graph TB
subgraph User Layer
Users["Users<br/><i>Global</i>"]
DNS["Route 53 DNS<br/><i>Health-based routing</i>"]
end
subgraph Region: us-east-1 [PRIMARY]
LB1["Load Balancer<br/><i>Health checks every 10s</i>"]
subgraph Services - East
API1["API Service<br/><i>Circuit breakers</i>"]
Stream1["Streaming Service<br/><i>Bulkheads</i>"]
end
Cache1[("Redis Cache<br/><i>Cross-region replication</i>")]
DB1[("Cassandra<br/><i>Multi-region writes</i>")]
end
subgraph Region: us-west-2 [STANDBY]
LB2["Load Balancer<br/><i>Health checks every 10s</i>"]
subgraph Services - West
API2["API Service<br/><i>Circuit breakers</i>"]
Stream2["Streaming Service<br/><i>Bulkheads</i>"]
end
Cache2[("Redis Cache<br/><i>Cross-region replication</i>")]
DB2[("Cassandra<br/><i>Multi-region writes</i>")]
end
Chaos["Chaos Kong<br/><i>Simulates region failure</i>"]
Users --> DNS
DNS --"Primary route"--> LB1
DNS -."Failover (if us-east-1 unhealthy)".-> LB2
LB1 --> API1 & Stream1
LB2 --> API2 & Stream2
API1 & Stream1 --> Cache1 & DB1
API2 & Stream2 --> Cache2 & DB2
Cache1 <-."Async replication".-> Cache2
DB1 <-."Multi-region sync".-> DB2
Chaos -."Randomly terminates".-> Region: us-east-1 [PRIMARY]
Netflix-style multi-region architecture with automated failover. DNS health checks detect region failures and route traffic to healthy regions. Cassandra provides multi-region writes for data consistency. Chaos Kong randomly simulates region failures in production to validate failover mechanisms work correctly.
Interview Essentials
Mid-Level
At the mid-level, demonstrate that you understand basic resiliency patterns and when to apply them. When designing a system, proactively mention timeouts, retries, and circuit breakers—don’t wait for the interviewer to ask about failure handling. Explain the difference between transient failures (network glitches, temporary overload) and persistent failures (service down, database corrupted), and how different patterns address each. Show that you understand cascading failures: if Service A calls Service B which calls Service C, and C is slow, explain how that slowness propagates backward and how circuit breakers prevent it. Be able to calculate basic timeout values: if you have a 3-second user-facing timeout and call three services sequentially, each service needs a timeout under 1 second to leave time for network overhead. Mention that retries need exponential backoff and jitter to avoid thundering herd problems. When discussing databases, mention read replicas for resiliency and explain how failover works. Show awareness that resiliency has trade-offs: retries increase latency, circuit breakers can cause false positives, fallbacks might show stale data.
Senior
Senior engineers should demonstrate deep understanding of how resiliency patterns interact and the trade-offs between them. When an interviewer asks about handling a failing dependency, don’t just say “use a circuit breaker”—explain the full strategy: circuit breaker to detect failures, bulkhead to isolate them, fallback to provide degraded functionality, and monitoring to alert on-call engineers. Discuss how you’d tune circuit breaker thresholds: too sensitive and you get false positives, too lenient and failures cascade. Explain retry budgets: if you retry three times, you’re amplifying load 4x on the downstream service, which can make failures worse. Show that you understand the difference between fail-fast (circuit breaker) and fail-safe (fallback) strategies and when each is appropriate. Discuss idempotency: retries are only safe if operations are idempotent, so you need idempotency keys for non-idempotent operations like payments. Mention chaos engineering: explain how you’d use fault injection to validate that your resiliency patterns actually work. Be able to discuss real-world examples: how would you handle an AWS region outage? How would you design a system to survive a DDoS attack? Show awareness of the CAP theorem and how it relates to resiliency: during network partitions, you must choose between consistency and availability.
Retry Budget and Load Amplification
graph LR
subgraph Original Request Flow
Client1["Client"] --"100 req/s"--> Service1["Service A"]
Service1 --"100 req/s"--> Service2["Service B"]
end
subgraph With Naive Retries (3x)
Client2["Client"] --"100 req/s"--> ServiceA["Service A<br/><i>Retry 3x on failure</i>"]
ServiceA --"400 req/s<br/>(100 original + 300 retries)"--> ServiceB["Service B<br/><i>OVERLOADED</i>"]
ServiceB -."Fails due to overload".-> ServiceA
Note1["❌ 4x load amplification<br/>❌ Makes failure worse"]
end
subgraph With Retry Budget
Client3["Client"] --"100 req/s"--> ServiceA2["Service A<br/><i>Retry budget: 20%</i>"]
ServiceA2 --"120 req/s<br/>(100 original + 20 retries)"--> ServiceB2["Service B<br/><i>Can handle load</i>"]
ServiceA2 --"Remaining failures<br/>use fallback"--> Fallback["Fallback<br/><i>Cached data</i>"]
Note2["✓ 1.2x load amplification<br/>✓ Controlled retry rate<br/>✓ Fallback for excess failures"]
end
Naive retry logic (middle) amplifies load 4x when retrying every failure 3 times, overwhelming the downstream service and making failures worse. Retry budgets (bottom) limit retries to 20% of original traffic, preventing load amplification while using fallbacks for remaining failures. Critical for senior-level understanding of retry trade-offs.
Staff+
Staff-plus engineers should demonstrate strategic thinking about resiliency across entire systems and organizations. Discuss how you’d build a resiliency culture: error budgets, blameless postmortems, chaos engineering as a regular practice. Explain how you’d prioritize resiliency investments: which services need the most resiliency (critical path, high traffic) versus which can tolerate more failures (batch jobs, analytics). Show understanding of resiliency economics: calculate the cost of downtime versus the cost of resiliency measures, and make data-driven decisions about how much resiliency is enough. Discuss organizational patterns: how do you ensure that teams actually implement resiliency patterns rather than just talking about them? Mention testing strategies: chaos engineering in production, game days, synthetic monitoring. Explain how you’d handle multi-region failover: DNS-based, load balancer-based, or application-based, and the trade-offs of each. Discuss the relationship between resiliency and incident response: resiliency patterns reduce incident frequency and severity, but you still need robust on-call practices and runbooks. Show awareness of emerging patterns: service meshes providing resiliency at the infrastructure layer, serverless architectures with built-in retries and scaling. Be able to critique resiliency decisions: when is eventual consistency acceptable versus when do you need strong consistency? When should you fail fast versus retry aggressively? Demonstrate that you can make nuanced trade-offs based on business requirements, not just apply patterns dogmatically.
Common Interview Questions
How would you design a system to handle a critical dependency (like a payment processor) going down?
Explain the difference between a circuit breaker and a retry with backoff. When would you use each?
Your service is experiencing cascading failures. Walk me through how you’d diagnose and fix the problem.
How would you design a multi-region system that can survive a full region outage?
What’s the difference between high availability and resiliency? Can you have one without the other?
How do you test that your resiliency patterns actually work?
Your retry logic is making failures worse by overwhelming the downstream service. How do you fix this?
Explain how you’d implement graceful degradation for a social media feed during a database outage.
What metrics would you monitor to detect resiliency problems before they cause outages?
How do you balance resiliency with latency? Adding retries and fallbacks increases response time.
Red Flags to Avoid
Not mentioning resiliency patterns at all—assuming the happy path and ignoring failure modes entirely
Only mentioning one pattern (like retries) without discussing how patterns work together
Suggesting unlimited retries or retries without backoff—this causes thundering herd problems
Not understanding the difference between transient and persistent failures
Proposing synchronous calls to many services without timeouts or circuit breakers—this creates cascading failure risk
Ignoring the impact of retries on downstream services—retries amplify load
Not considering idempotency when discussing retries—non-idempotent retries can cause duplicate operations
Treating resiliency as an afterthought rather than designing it in from the start
Not discussing how to test resiliency—untested resiliency patterns often fail when you need them
Confusing high availability (uptime) with resiliency (failure handling and recovery)
Key Takeaways
Resiliency is about detecting, isolating, and recovering from failures—not preventing them. Modern distributed systems will fail; resilient systems handle failures gracefully and recover quickly.
No single pattern provides complete resiliency. Effective resiliency comes from combining isolation patterns (bulkheads, circuit breakers), recovery patterns (retries, fallbacks), detection patterns (health checks, monitoring), and coordination patterns (load shedding, graceful degradation) into a defense-in-depth strategy.
Resiliency patterns have trade-offs that must be carefully balanced. Retries increase latency and amplify load. Circuit breakers can cause false positives. Fallbacks might show stale data. The right balance depends on your specific requirements—financial systems need different resiliency than social media feeds.
Testing is essential—untested resiliency patterns often fail when you need them most. Use chaos engineering to inject failures in production, fault injection testing in staging, and game days to practice incident response. Netflix’s Chaos Monkey approach of continuously breaking production systems ensures resiliency patterns are battle-tested.
Resiliency is broader than high availability or fault tolerance. Availability measures uptime, fault tolerance masks failures through redundancy, but resiliency encompasses detection, containment, recovery, and learning from failures. A truly resilient system combines all three concepts with robust monitoring, automated recovery, and a culture that expects and plans for failures.