Circuit Breaker Pattern: Stop Cascading Failures
After this topic, you will be able to:
- Implement circuit breaker state machines with appropriate threshold calculations
- Configure timeout, failure threshold, and half-open retry parameters for different failure scenarios
- Analyze circuit breaker metrics to tune sensitivity and recovery behavior
TL;DR
Circuit breakers prevent cascading failures by fast-failing requests to unhealthy dependencies instead of waiting for timeouts. When a service detects too many failures, it “opens” the circuit and immediately rejects requests for a cooldown period, then gradually tests recovery. This protects upstream callers from wasting resources on doomed requests and gives downstream services breathing room to recover.
Cheat Sheet: Closed (normal) → Open (fast-fail after threshold) → Half-Open (test recovery) → back to Closed or Open based on probe results.
The Problem It Solves
In distributed systems, a single slow or failing service can trigger a catastrophic cascade. Imagine your payment service calls a fraud detection API that suddenly starts timing out after 30 seconds due to a database issue. Every payment request now waits 30 seconds before failing, exhausting your thread pool. New requests queue up, memory spikes, and soon your entire payment service crashes—even though the fraud API is the only broken component. Worse, your retries hammer the already-struggling fraud service, preventing its recovery.
The core problem is optimistic blocking: callers assume dependencies will respond quickly and block resources (threads, connections, memory) waiting for responses. When a dependency degrades, this optimism becomes toxic. You waste precious resources on requests that will fail anyway, and your retries amplify the load on the struggling service. Traditional timeout-based error handling is too slow—by the time you detect the pattern of failures, you’ve already exhausted resources and contributed to the cascade.
Circuit breakers solve this by detecting failure patterns early and switching to pessimistic fast-failing. Instead of waiting for each request to time out individually, the circuit breaker tracks failure rates and proactively rejects requests when a threshold is crossed. This prevents resource exhaustion in the caller and reduces load on the failing service, giving it a chance to recover.
Cascade Failure Without Circuit Breaker
graph TB
subgraph Without Circuit Breaker
User1["👤 User Requests<br/><i>100 req/s</i>"]
Payment["Payment Service<br/><i>Thread Pool: 200</i>"]
Fraud["Fraud API<br/><i>⚠️ DB Overloaded</i>"]
User1 --"1. POST /payment"--> Payment
Payment --"2. Check fraud<br/>(30s timeout)"--> Fraud
Fraud -."3. Timeout after 30s".-> Payment
Payment --"4. Retry (30s)"--> Fraud
Fraud -."5. Timeout after 30s".-> Payment
Note1["⚠️ Problem:<br/>• 200 threads blocked for 60s each<br/>• Thread pool exhausted in 2 seconds<br/>• New requests queue up<br/>• Memory spikes, service crashes<br/>• Retries hammer struggling Fraud API"]
end
subgraph With Circuit Breaker
User2["👤 User Requests<br/><i>100 req/s</i>"]
Payment2["Payment Service<br/><i>Thread Pool: 200</i>"]
CB["Circuit Breaker<br/><i>State: OPEN</i>"]
Fraud2["Fraud API<br/><i>⚠️ DB Overloaded</i>"]
User2 --"1. POST /payment"--> Payment2
Payment2 --"2. Check fraud"--> CB
CB -."3. Fast-fail (<1ms)<br/>CircuitBreakerOpen".-> Payment2
Payment2 --"4. Fallback:<br/>Allow small payments"--> User2
Note2["✅ Solution:<br/>• No threads blocked<br/>• Thread pool stays healthy<br/>• Fast response to users<br/>• Fraud API gets breathing room<br/>• Gradual recovery testing"]
end
Comparison showing how a degraded fraud API causes cascade failure without circuit breakers (top) versus graceful degradation with circuit breakers (bottom). Without protection, blocked threads exhaust the payment service’s resources. With circuit breakers, requests fail fast and the service remains healthy.
Solution Overview
A circuit breaker wraps calls to external dependencies and monitors their success/failure rates in real-time. It operates as a state machine with three states: Closed (normal operation, requests pass through), Open (fast-fail mode, requests rejected immediately), and Half-Open (testing recovery with limited traffic).
When the circuit is Closed, requests flow normally but failures are counted. If failures exceed a configured threshold within a time window (e.g., 50% failure rate over 10 seconds, or 5 consecutive failures), the circuit “trips” to Open. In the Open state, all requests fail immediately with a circuit breaker exception—no network calls are made, no threads blocked, no timeouts waited. This protects the caller’s resources and stops hammering the failing dependency.
After a cooldown period (e.g., 30 seconds), the circuit transitions to Half-Open and allows a small number of probe requests through. If these probes succeed, the circuit closes and normal traffic resumes. If they fail, the circuit reopens and the cooldown timer resets. This gradual recovery prevents thundering herds where all callers simultaneously retry the moment a service recovers.
The pattern is inspired by electrical circuit breakers in buildings: when a fault is detected, the breaker trips to prevent damage, and you must manually (or automatically) test before restoring power. Netflix’s Hystrix library popularized this pattern in microservices, though Hystrix is now in maintenance mode in favor of lighter-weight alternatives like Resilience4j.
Circuit Breaker State Machine
stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded<br/>(e.g., 50% failures in 10s)
Open --> HalfOpen: Cooldown period elapsed<br/>(e.g., after 30s)
HalfOpen --> Closed: Probe requests succeed<br/>(e.g., 2 out of 3 pass)
HalfOpen --> Open: Probe requests fail<br/>(reset cooldown timer)
Closed: Normal Operation<br/>• Requests pass through<br/>• Track success/failure rates<br/>• Monitor threshold
Open: Fast-Fail Mode<br/>• Reject all requests immediately<br/>• No network calls made<br/>• Return fallback response
HalfOpen: Recovery Testing<br/>• Allow limited probe traffic<br/>• Test if service recovered<br/>• Decide next state
The circuit breaker operates as a three-state machine. It starts Closed (normal operation), trips to Open when failures exceed the threshold, waits for a cooldown period, then transitions to Half-Open to test recovery with probe requests before fully closing or reopening.
How It Works
Let’s walk through a concrete example: your order service calls an inventory service to check stock availability. You configure a circuit breaker with a 50% failure threshold over a 10-second rolling window, requiring at least 5 requests to trip, and a 30-second cooldown.
Step 1: Normal Operation (Closed State) Requests flow through the circuit breaker to the inventory service. The breaker tracks outcomes in a sliding window: [success, success, success, failure, success]. With 1 failure out of 5 requests (20%), the circuit remains Closed. Failures might be timeouts, HTTP 500s, or connection errors—anything you configure as a “failure.”
Step 2: Degradation Detected The inventory service’s database starts struggling. Over the next 10 seconds, you see: [failure, failure, success, failure, failure, failure]. That’s 5 failures out of 6 requests (83%), exceeding your 50% threshold. The circuit breaker trips to Open.
Step 3: Fast-Fail Protection (Open State)
New order requests immediately receive a CircuitBreakerOpenException without calling the inventory service. Your order service can now respond quickly with a fallback: “Unable to check inventory, please try again later” or use cached stock data. Critically, you’re not blocking threads waiting for 30-second timeouts, and you’re not sending traffic to the struggling inventory service. Your thread pool remains healthy, and the inventory service gets breathing room to recover.
Step 4: Recovery Testing (Half-Open State) After 30 seconds, the circuit transitions to Half-Open. The next request is allowed through as a probe. If it succeeds, the circuit closes immediately and normal traffic resumes. If it fails, the circuit reopens and waits another 30 seconds. Some implementations allow multiple probes (e.g., 3 requests) and require a majority to succeed before closing.
Step 5: Full Recovery or Re-Trip If the inventory service has recovered, probes succeed and the circuit closes. If the underlying issue persists, the circuit reopens, protecting your system from another wave of failures. This cycle continues until the dependency stabilizes.
The key insight is that the circuit breaker makes a collective decision based on aggregate failure rates, rather than treating each request independently. This allows fast detection and response to systemic issues.
Circuit Breaker Preventing Cascade Failure
sequenceDiagram
participant User
participant OrderService
participant CircuitBreaker
participant InventoryService
Note over CircuitBreaker: State: CLOSED (Normal)
User->>OrderService: 1. Place Order
OrderService->>CircuitBreaker: 2. Check Stock
CircuitBreaker->>InventoryService: 3. GET /inventory/check
InventoryService-->>CircuitBreaker: 4. Success (200 OK)
CircuitBreaker-->>OrderService: 5. Stock Available
OrderService-->>User: 6. Order Confirmed
Note over InventoryService: Database starts struggling...
User->>OrderService: 7. Place Order
OrderService->>CircuitBreaker: 8. Check Stock
CircuitBreaker->>InventoryService: 9. GET /inventory/check
InventoryService--xCircuitBreaker: 10. Timeout (30s)
Note over CircuitBreaker: Failure count: 5/6 (83%)<br/>Threshold exceeded!
Note over CircuitBreaker: State: OPEN
User->>OrderService: 11. Place Order
OrderService->>CircuitBreaker: 12. Check Stock
Note over CircuitBreaker: Circuit OPEN - Fast Fail!
CircuitBreaker-->>OrderService: 13. CircuitBreakerOpenException<br/>(no network call, <1ms)
OrderService-->>User: 14. Fallback: Use cached stock data
Note over CircuitBreaker: After 30s cooldown...
Note over CircuitBreaker: State: HALF-OPEN
User->>OrderService: 15. Place Order
OrderService->>CircuitBreaker: 16. Check Stock (probe)
CircuitBreaker->>InventoryService: 17. GET /inventory/check
InventoryService-->>CircuitBreaker: 18. Success (200 OK)
Note over CircuitBreaker: Probe succeeded!<br/>State: CLOSED
CircuitBreaker-->>OrderService: 19. Stock Available
OrderService-->>User: 20. Order Confirmed
Timeline showing how a circuit breaker detects the inventory service degradation (steps 7-10), trips to Open state to fast-fail subsequent requests without blocking threads (steps 11-14), then tests recovery after cooldown (steps 15-20). This prevents the order service from exhausting resources waiting for timeouts.
Threshold Tuning
Configuring circuit breaker thresholds is more art than science, but you can use SLOs and observed failure patterns as guides. Start with these parameters:
Failure Threshold: Set this based on your dependency’s normal error rate. If your inventory service typically has a 1% error rate, a 50% threshold gives you a huge margin before tripping. For critical paths, use lower thresholds (20-30%) to fail fast. For less critical dependencies, use higher thresholds (60-70%) to avoid false positives. The formula: threshold = baseline_error_rate + (acceptable_degradation_multiplier × baseline_error_rate). If baseline is 2% and you’ll tolerate 10x degradation, threshold = 2% + (10 × 2%) = 22%.
Minimum Request Volume: Require a minimum number of requests before the circuit can trip (e.g., 10 requests in the window). This prevents tripping on a single unlucky failure during low-traffic periods. Set this to roughly your expected requests-per-second × window_duration, with a floor of 5-10.
Time Window: Use a rolling window that’s 2-3x your dependency’s P99 latency. If the inventory service normally responds in 200ms (P99), use a 500ms-1s window. This captures enough samples to detect patterns without being too slow to react. For slower dependencies (5s P99), use 10-15s windows.
Cooldown Period: Set this to your dependency’s expected recovery time plus a buffer. If the inventory service typically recovers from database issues in 10-20 seconds, use a 30-second cooldown. Too short and you’ll re-trip immediately; too long and you’ll miss the recovery window. Netflix uses exponential backoff: 30s, 60s, 120s for successive trips.
Half-Open Probe Count: Allow 1-3 probe requests. A single probe is fastest but risky if the service is flapping. Three probes with a “2 out of 3 must succeed” rule is more robust. Calculate based on your confidence interval: if you want 95% confidence the service is healthy, use more probes.
Example calculation for a payment service calling a fraud API with a 99.9% SLO:
- Baseline error rate: 0.1% (from SLO)
- Acceptable degradation: 50x (5% failure rate)
- Threshold: 5%
- Min requests: 20 (at 100 req/s, this is 0.2s of traffic)
- Window: 10s (fraud API P99 is 3s)
- Cooldown: 60s (fraud API recovery typically takes 30-45s)
- Probes: 3, requiring 2 successes
Monitor false positive rates (circuit trips when service is actually healthy) and false negatives (circuit doesn’t trip when it should). Tune thresholds based on these metrics.
Threshold Calculation Decision Tree
flowchart TB
Start(["Configure Circuit Breaker<br/>for Dependency"])
Start --> Q1{"What's the baseline<br/>error rate?"}
Q1 -->|"From SLO/Metrics"| Baseline["Baseline Error Rate<br/><i>e.g., 0.1% for 99.9% SLO</i>"]
Baseline --> Q2{"How critical is<br/>this dependency?"}
Q2 -->|"Critical Path<br/>(payment, auth)"| Critical["Low Tolerance<br/>Multiplier: 10-20x<br/><i>Threshold = 1-2%</i>"]
Q2 -->|"Important<br/>(recommendations)"| Important["Medium Tolerance<br/>Multiplier: 50x<br/><i>Threshold = 5%</i>"]
Q2 -->|"Nice-to-have<br/>(analytics)"| NiceToHave["High Tolerance<br/>Multiplier: 100x<br/><i>Threshold = 10%</i>"]
Critical --> Q3{"What's expected<br/>traffic rate?"}
Important --> Q3
NiceToHave --> Q3
Q3 -->|"High (>10 req/s)"| HighTraffic["Min Requests: 20-50<br/>Window: 5-10s<br/><i>Fast detection</i>"]
Q3 -->|"Medium (1-10 req/s)"| MedTraffic["Min Requests: 10-20<br/>Window: 10-30s<br/><i>Balance speed/accuracy</i>"]
Q3 -->|"Low (<1 req/s)"| LowTraffic["Min Requests: 5-10<br/>Window: 60s+<br/><i>Avoid false positives</i>"]
HighTraffic --> Q4{"What's dependency<br/>recovery time?"}
MedTraffic --> Q4
LowTraffic --> Q4
Q4 -->|"Fast (<10s)"| FastRecovery["Cooldown: 15-30s<br/>Probes: 1-2<br/><i>Quick return to normal</i>"]
Q4 -->|"Medium (10-60s)"| MedRecovery["Cooldown: 30-60s<br/>Probes: 2-3<br/><i>Standard config</i>"]
Q4 -->|"Slow (>60s)"| SlowRecovery["Cooldown: 60-120s<br/>Probes: 3-5<br/><i>Gradual recovery</i>"]
FastRecovery --> End(["Monitor & Tune<br/>Based on Metrics"])
MedRecovery --> End
SlowRecovery --> End
Decision tree for calculating circuit breaker thresholds based on dependency characteristics. Start with baseline error rate from SLOs, adjust for criticality, then configure window and cooldown based on traffic patterns and recovery time. This systematic approach prevents both false positives and delayed detection.
Variants
1. Count-Based Circuit Breaker Tracks the last N requests (e.g., 100) and trips when X fail (e.g., 50). Simple to implement with a ring buffer. Use when traffic is steady and predictable. Downside: doesn’t account for time—50 failures over 1 second is very different from 50 failures over 10 minutes. Netflix’s Hystrix uses this approach.
2. Time-Based Circuit Breaker Tracks failures within a sliding time window (e.g., 50% failure rate in the last 10 seconds). More adaptive to varying traffic patterns. Use when traffic is bursty or unpredictable. Requires more complex bookkeeping (e.g., HdrHistogram or time-bucketed counters). Resilience4j implements this variant.
3. Adaptive Circuit Breaker Dynamically adjusts thresholds based on observed baseline error rates. If your dependency normally has a 2% error rate, the circuit might trip at 10%. If error rate drops to 0.5%, the circuit becomes more sensitive and trips at 5%. Use for dependencies with variable but predictable behavior. Requires machine learning or statistical process control. Google’s SRE teams use adaptive thresholds for some internal services.
4. Per-Endpoint Circuit Breaker
Maintains separate circuit states for different endpoints of the same service (e.g., /inventory/check vs /inventory/reserve). Use when different endpoints have different reliability characteristics. Downside: more memory and complexity. Stripe uses per-endpoint breakers for their payment APIs.
5. Gradual Recovery Circuit Breaker Instead of binary Open/Closed states, gradually increases allowed traffic in Half-Open: 10%, 25%, 50%, 100%. Use for high-traffic systems where a full cutover might overwhelm a recovering service. Requires more sophisticated traffic shaping. Uber uses gradual recovery for critical path services.
Trade-offs
Responsiveness vs. Stability
- Aggressive thresholds (low failure rate, short window): Detect issues quickly and protect callers, but risk false positives during transient blips. Use for critical dependencies where cascading failures are catastrophic.
- Conservative thresholds (high failure rate, long window): Tolerate transient issues without tripping, but slower to detect systemic problems. Use for non-critical dependencies or when false positives are expensive (e.g., customer-facing errors).
- Decision: Set thresholds based on blast radius. For dependencies in the critical path (payment processing), be aggressive. For nice-to-have features (recommendation engine), be conservative.
Fast Recovery vs. Thundering Herd
- Short cooldown + single probe: Recover quickly when the service is healthy, but risk overwhelming a fragile service with a sudden traffic spike. Use when your dependency can handle full load immediately.
- Long cooldown + multiple probes: Gradual recovery prevents thundering herds, but delays return to normal operation. Use when your dependency needs time to warm up (e.g., cold caches, connection pools).
- Decision: Match cooldown to your dependency’s recovery profile. Stateless services can handle short cooldowns; stateful services with caches need longer ramp-up.
Granularity vs. Complexity
- Single circuit per service: Simple to implement and reason about, but a failure in one endpoint trips the circuit for all endpoints. Use for small services with uniform behavior.
- Circuit per endpoint: More precise isolation, but higher memory overhead and configuration complexity. Use for large services with heterogeneous endpoints.
- Decision: Start with service-level circuits. Add per-endpoint circuits only if you observe one bad endpoint taking down unrelated functionality.
Client-Side vs. Server-Side
- Client-side breakers: Each caller maintains its own circuit state. Faster reaction (no network round-trip), but inconsistent state across clients. Use for low-latency requirements.
- Server-side breakers: Centralized state, consistent decisions, but adds latency and a new failure point. Use when you need coordinated behavior (e.g., rate limiting).
- Decision: Default to client-side for simplicity. Use server-side only when you need global coordination or the client is untrusted.
When to Use (and When Not To)
Use circuit breakers when:
-
You have synchronous dependencies with variable latency. If your service calls external APIs, databases, or other microservices over the network, circuit breakers prevent slow dependencies from exhausting your resources. Asynchronous message queues don’t need circuit breakers—backpressure mechanisms handle overload.
-
Cascading failures are a real risk. In systems with deep call chains (A → B → C → D), a failure in D can cascade back through C, B, and A. Circuit breakers at each hop contain the blast radius. If your architecture is simple (e.g., a monolith with a single database), cascading failures are less likely.
-
Your dependencies have unpredictable failure modes. If a dependency can fail slowly (timeouts, retries, connection pool exhaustion), circuit breakers provide fast-fail protection. If failures are always fast (immediate connection refused), timeouts alone may suffice.
-
You can provide meaningful fallbacks. Circuit breakers are most valuable when you can degrade gracefully: serve stale data, use default values, or return partial results. If you must fail the entire request anyway, circuit breakers still help by failing fast and preserving resources.
Don’t use circuit breakers when:
-
Failures are always transient and fast. If your dependency fails quickly (e.g., immediate 503 responses) and recovers within milliseconds, circuit breakers add overhead without benefit. Just retry.
-
You need every request to succeed. For write operations where data loss is unacceptable (e.g., financial transactions), circuit breakers that fast-fail are dangerous. Use retries with idempotency instead, or queue writes for later processing.
-
Your traffic is too low. With 1 request per minute, you’ll never accumulate enough samples to make informed decisions. Circuit breakers need steady traffic to detect patterns.
-
The dependency is internal and highly reliable. If you’re calling a local cache or an in-process library, circuit breakers are overkill. Save them for network boundaries.
Real-World Examples
Netflix: Hystrix and the Streaming API Netflix pioneered circuit breakers in microservices with their Hystrix library, protecting their streaming API from cascading failures. When a user opens the Netflix app, the API aggregates data from dozens of microservices: user profile, recommendations, continue watching, etc. If the recommendations service degrades, Hystrix’s circuit breaker trips after 50% of requests fail in a 10-second window. The API immediately returns a fallback (generic recommendations or an empty carousel) without waiting for timeouts. This prevented a single slow service from taking down the entire homepage. Netflix observed that circuit breakers reduced P99 latency by 40% during incidents and prevented three major outages in 2014. They’ve since moved to lighter-weight alternatives (Resilience4j, Envoy), but the pattern remains core to their architecture.
Uber: Preventing Payment Cascades Uber’s payment service uses circuit breakers to protect against failures in their fraud detection system. When a rider completes a trip, the payment service calls the fraud API to validate the transaction. During a 2018 incident, the fraud API’s database became overloaded, causing 30-second timeouts. Without circuit breakers, every payment request would have blocked for 30 seconds, exhausting Uber’s payment service thread pool and preventing all transactions. Instead, circuit breakers detected the 80% failure rate within 5 seconds and tripped. The payment service switched to a fallback: allow transactions under $50 without fraud checks, queue larger transactions for async processing. This kept 90% of payments flowing during the 20-minute incident. Uber’s circuit breakers use adaptive thresholds: during peak hours (Friday nights), they’re more sensitive to catch issues before they cascade.
Amazon: DynamoDB Client-Side Breakers Amazon’s DynamoDB client libraries include built-in circuit breakers to protect against partition-level failures. When a DynamoDB partition becomes hot (too many requests to a single key), it starts throttling requests with 400 errors. Without circuit breakers, clients would retry aggressively, making the throttling worse. The DynamoDB client detects throttling patterns (50% throttle rate over 1 second) and opens the circuit for that specific partition key. Requests to other partitions continue normally. After a 10-second cooldown, the client sends probe requests with exponential backoff. This partition-level granularity prevents one hot key from affecting the entire table. Amazon’s internal metrics show this reduces retry storms by 70% and improves P99 latency during throttling events by 60%.
Netflix Hystrix Architecture
graph LR
subgraph Client: Netflix Mobile App
User["👤 User<br/><i>Opens App</i>"]
end
subgraph Netflix API Gateway
API["API Aggregator<br/><i>Thread Pool: 500</i>"]
CB1["Circuit Breaker<br/><i>Profile Service</i>"]
CB2["Circuit Breaker<br/><i>Recommendations</i>"]
CB3["Circuit Breaker<br/><i>Continue Watching</i>"]
CB4["Circuit Breaker<br/><i>Trending</i>"]
end
subgraph Microservices
Profile["Profile Service<br/><i>✅ Healthy</i>"]
Recs["Recommendations<br/><i>⚠️ DB Overloaded</i>"]
Continue["Continue Watching<br/><i>✅ Healthy</i>"]
Trending["Trending<br/><i>✅ Healthy</i>"]
end
User --"1. GET /homepage"--> API
API --"2. Fetch user"--> CB1
API --"3. Get recs"--> CB2
API --"4. Get continue"--> CB3
API --"5. Get trending"--> CB4
CB1 --"Request"--> Profile
Profile --"Success"--> CB1
CB2 -."Circuit OPEN<br/>Fast-fail".-> API
Note1["Fallback:<br/>Generic recommendations<br/>or empty carousel"]
CB2 -.-> Note1
CB3 --"Request"--> Continue
Continue --"Success"--> CB3
CB4 --"Request"--> Trending
Trending --"Success"--> CB4
CB1 --"User data"--> API
CB2 -."Fallback data".-> API
CB3 --"Continue data"--> API
CB4 --"Trending data"--> API
API --"6. Aggregated response<br/>(partial data, fast)"--> User
Netflix’s Hystrix circuit breaker architecture protecting the API gateway from a failing recommendations service. When the recommendations circuit opens due to database overload, the API returns a fallback (generic recommendations) without blocking threads. Other services continue operating normally, preventing a single failure from cascading to the entire homepage.
Interview Essentials
Mid-Level
Explain the three states (Closed, Open, Half-Open) and when transitions occur. Describe a scenario where a circuit breaker prevents cascading failures: “If our payment service calls a fraud API that starts timing out, the circuit breaker detects the pattern of failures and trips to Open. Now payment requests fail fast instead of blocking threads waiting for timeouts. After a cooldown, we test recovery in Half-Open before resuming normal traffic.” Implement a simple count-based circuit breaker with a failure threshold and cooldown timer. Discuss how circuit breakers interact with retries: “Circuit breakers and retries are complementary. Retries handle transient failures; circuit breakers handle systemic failures. If retries aren’t working (failure rate stays high), the circuit breaker trips to stop wasting resources.”
Senior
Design circuit breaker configurations for different dependency types: fast vs. slow, critical vs. non-critical. Calculate appropriate thresholds based on SLOs: “If our fraud API has a 99.9% SLO (0.1% error rate) and we’ll tolerate 50x degradation, we set the threshold at 5%. With 100 req/s traffic, we need at least 20 requests in our window to avoid false positives.” Explain trade-offs between client-side and server-side breakers, and when to use per-endpoint vs. per-service granularity. Discuss monitoring: “We track circuit breaker state transitions, false positive rates (trips when service is healthy), and false negatives (doesn’t trip when it should). We alert on Open state duration and Half-Open probe failure rates.” Describe how circuit breakers fit into a broader resilience strategy with bulkheads, rate limiting, and timeouts.
Staff+
Design a distributed circuit breaker system that coordinates state across multiple instances without a central coordinator. Discuss consistency trade-offs: “Client-side breakers react faster but can have inconsistent state. We can use gossip protocols or shared state in Redis, but that adds latency and a new failure point. For most use cases, eventual consistency is fine—if 10% of instances haven’t tripped yet, they’ll trip within seconds.” Explain adaptive threshold algorithms: “We track baseline error rates with exponential moving averages and trip when current rate exceeds baseline by a configurable multiplier. This adapts to dependencies with variable but predictable behavior.” Discuss organizational challenges: “Circuit breakers require cultural buy-in. Teams must design fallbacks, accept that some requests will fail fast, and resist the urge to disable breakers during incidents. We enforce this with SLO-based alerts: if your service’s error rate spikes when a dependency’s circuit opens, you need better fallbacks.” Describe how to prevent thundering herds during recovery: “We use jittered cooldowns and gradual traffic ramp-up. When a circuit closes, we don’t immediately send 100% traffic. We start at 10%, monitor error rates, then increase to 25%, 50%, 100% over 30 seconds.”
Common Interview Questions
How do you prevent false positives (circuit trips when service is actually healthy)?
What’s the difference between a circuit breaker and a timeout?
How do you handle write operations with circuit breakers (can’t just drop writes)?
Should circuit breakers be per-instance or shared across all instances?
How do you test circuit breaker behavior in production without causing outages?
Red Flags to Avoid
Saying circuit breakers eliminate the need for timeouts (you need both—timeouts for individual requests, breakers for aggregate patterns)
Not discussing fallback strategies (circuit breakers are only useful if you can degrade gracefully)
Claiming circuit breakers solve all resilience problems (they’re one tool in a broader strategy)
Ignoring the thundering herd problem during recovery (all clients retrying simultaneously)
Not mentioning monitoring and alerting (circuit breaker state is a critical operational signal)
Key Takeaways
Circuit breakers prevent cascading failures by detecting aggregate failure patterns and fast-failing requests instead of waiting for timeouts. They operate as a state machine: Closed (normal) → Open (fast-fail) → Half-Open (test recovery).
Configure thresholds based on your dependency’s baseline error rate and SLOs. Use aggressive thresholds (low failure rate, short window) for critical dependencies; conservative thresholds for non-critical ones. Require minimum request volumes to avoid false positives during low traffic.
Circuit breakers are most valuable when you can provide meaningful fallbacks (stale data, defaults, partial results). For write operations where data loss is unacceptable, use retries with idempotency instead of fast-failing.
Prevent thundering herds during recovery with gradual traffic ramp-up and jittered cooldowns. Don’t send 100% traffic the moment a circuit closes—test with 10%, then 25%, 50%, 100% over time.
Monitor circuit breaker state transitions, false positive rates, and Half-Open probe failures. Alert on prolonged Open states and tune thresholds based on observed patterns. Circuit breakers are operational signals, not just error handlers.