Circuit Breaker for High Availability Systems

TL;DR

Circuit breakers prevent cascading failures in distributed systems by automatically stopping requests to failing services, giving them time to recover. When error rates exceed a threshold, the breaker “trips” to open state, immediately rejecting requests without attempting the call. After a timeout, it enters half-open state to test if the service has recovered. This pattern is essential for building resilient microservices that fail fast rather than exhausting resources waiting for unresponsive dependencies.

Cheat Sheet: Closed (normal) → Open (failing, reject immediately) → Half-Open (testing recovery) → back to Closed or Open based on test results.

The Analogy

Think of a circuit breaker in your home’s electrical panel. When too much current flows through a circuit (like plugging in too many appliances), the breaker automatically trips to prevent wires from overheating and causing a fire. You can’t use that circuit until you flip the breaker back on—but you shouldn’t do that immediately because the problem might still exist. Similarly, when a service starts failing, the circuit breaker stops sending requests to prevent resource exhaustion. After waiting for the service to potentially recover, it cautiously tests whether things are working again before fully restoring traffic.

Why This Matters in Interviews

Circuit breakers come up in almost every high availability and microservices discussion. Interviewers want to see that you understand how to prevent cascading failures—one of the most common causes of large-scale outages. They’re testing whether you think about failure modes proactively, not just happy-path scenarios. Strong candidates explain the state transitions clearly, discuss timeout and threshold tuning, and connect circuit breakers to broader resilience patterns like bulkheads and retries. The pattern appears in questions about designing API gateways, service meshes, payment systems, and any architecture with external dependencies.

Core Concept

The circuit breaker pattern protects distributed systems from cascading failures by monitoring requests to external services and automatically stopping traffic when those services become unhealthy. Named after electrical circuit breakers that prevent overload, this pattern prevents a failing service from consuming resources across your entire system. Without circuit breakers, when Service A calls failing Service B, threads in Service A get blocked waiting for timeouts. As requests pile up, Service A exhausts its thread pool and becomes unresponsive, which then cascades to services calling Service A.

The pattern works by wrapping service calls in a circuit breaker object that tracks success and failure rates. When failures exceed a configured threshold within a time window, the breaker “opens” and immediately rejects subsequent requests without attempting the call. This fail-fast behavior prevents resource exhaustion and gives the failing service time to recover. After a cooldown period, the breaker enters a half-open state to test whether the service has recovered. If test requests succeed, the breaker closes and normal traffic resumes; if they fail, it reopens.

Circuit breakers are fundamental to building resilient microservices because they isolate failures and prevent them from propagating. They’re typically implemented at the client side of service calls, meaning each service that makes outbound requests maintains circuit breakers for its dependencies. This decentralized approach ensures that failures in one part of the system don’t bring down unrelated services. Modern implementations often include metrics, dashboards, and alerts so teams can monitor circuit breaker states and respond to systemic issues.

How It Works

Step 1: Closed State (Normal Operation) The circuit breaker starts in the closed state, allowing all requests to pass through to the downstream service. It monitors each request’s outcome—success or failure—and maintains a sliding window of recent results. For example, it might track the last 100 requests or all requests in the past 60 seconds. During this state, the system operates normally, but the breaker is constantly evaluating whether the failure rate exceeds its threshold. If a request fails (timeout, exception, or error response), the breaker increments its failure counter.

Step 2: Threshold Evaluation The circuit breaker continuously calculates the failure rate using its configured window. Common configurations include “5 failures out of the last 10 requests” or “error rate above 50% over the past minute.” The threshold should be tuned based on your service’s normal behavior—some services naturally have higher error rates than others. When the failure rate crosses the threshold, the breaker immediately transitions to the open state. This transition happens instantly; there’s no gradual degradation. The breaker also starts a timer for the open state duration.

Step 3: Open State (Failing Fast) Once open, the circuit breaker immediately rejects all requests without attempting to call the downstream service. Instead of waiting for timeouts or network errors, it returns a predefined error response (like HTTP 503 Service Unavailable) or executes a fallback function. This fail-fast behavior is crucial: it prevents thread exhaustion, reduces latency for callers, and stops hammering the already-struggling downstream service. The breaker remains open for a configured duration (typically 30-60 seconds), giving the downstream service time to recover. During this period, monitoring systems should alert on-call engineers about the open circuit.

Step 4: Half-Open State (Testing Recovery) After the open-state timeout expires, the breaker transitions to half-open state. In this state, it allows a limited number of test requests through to determine if the downstream service has recovered. For example, it might allow 3 requests through while blocking the rest. If these test requests succeed, the breaker assumes the service is healthy and transitions back to closed state, resuming normal traffic. If any test request fails, the breaker immediately reopens and starts another timeout period. This cautious approach prevents overwhelming a service that’s still struggling.

Step 5: State Transition and Monitoring The circuit breaker continuously cycles through these states based on service health. Modern implementations expose metrics for each state transition, time spent in each state, and the number of rejected requests. These metrics feed into dashboards and alerting systems. Teams should monitor circuit breaker patterns: frequent open/close cycles might indicate an unstable service, while a breaker stuck open suggests a serious outage. The breaker’s state should also be exposed via health check endpoints so load balancers and orchestration systems can route traffic appropriately.

Circuit Breaker State Transition Flow

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: Failure threshold exceeded<br/>(e.g., 5 failures in 10 requests)
    Open --> HalfOpen: Timeout expires<br/>(e.g., after 30 seconds)
    HalfOpen --> Closed: Test requests succeed<br/>(e.g., 3/3 successful)
    HalfOpen --> Open: Test request fails<br/>(any failure)
    Open --> Open: Requests rejected immediately<br/>(fail fast)
    Closed --> Closed: Normal operation<br/>(monitoring failures)
    
    note right of Closed
        All requests pass through
        Tracking success/failure rate
    end note
    
    note right of Open
        All requests rejected
        No calls to downstream service
        Cooldown timer running
    end note
    
    note right of HalfOpen
        Limited test requests allowed
        Determining if service recovered
    end note

The circuit breaker cycles through three states based on downstream service health. Closed state monitors failures, Open state rejects requests immediately, and Half-Open state cautiously tests for recovery. The transition from Half-Open back to Open uses exponential backoff to avoid hammering struggling services.

Request Flow Through Circuit Breaker States

graph LR
    subgraph Closed State - Normal Operation
        Client1[Client] --"1. Request"--> CB1[Circuit Breaker<br/>Closed]
        CB1 --"2. Forward request"--> Service1[Downstream Service]
        Service1 --"3. Response (success)"--> CB1
        CB1 --"4. Return response"--> Client1
        CB1 -."Track: 95% success rate".-> Monitor1[Metrics]
    end
    
    subgraph Open State - Failing Fast
        Client2[Client] --"1. Request"--> CB2[Circuit Breaker<br/>Open]
        CB2 --"2. Reject immediately<br/>(503 Service Unavailable)"--> Client2
        Service2[Downstream Service<br/>Not Called]
        CB2 -."Wait for timeout (30s)".-> Timer[Cooldown Timer]
    end
    
    subgraph Half-Open State - Testing Recovery
        Client3[Client] --"1. Test request (1 of 3)"--> CB3[Circuit Breaker<br/>Half-Open]
        CB3 --"2. Forward test request"--> Service3[Downstream Service]
        Service3 --"3. Response"--> CB3
        CB3 --"4. Evaluate results"--> Decision{All tests<br/>passed?}
        Decision --"Yes"--> Close[Transition to Closed]
        Decision --"No"--> Reopen[Transition to Open]
    end

Request handling differs dramatically across circuit breaker states. In Closed state, all requests pass through normally. In Open state, requests are rejected immediately without calling the downstream service, preventing resource exhaustion. In Half-Open state, a limited number of test requests determine whether to resume normal operation or reopen the circuit.

Key Principles

Fail Fast, Not Slow When a service is down, waiting for timeouts wastes resources and degrades user experience. Circuit breakers detect failures quickly and reject requests immediately, typically returning an error in milliseconds rather than waiting 30+ seconds for a timeout. This principle is critical in high-throughput systems where thread pools are limited resources. For example, if your API gateway has 200 threads and each blocked request ties up a thread for 30 seconds, you can only handle 6-7 failing requests per second before the entire gateway becomes unresponsive. With circuit breakers, those threads are freed immediately, allowing the gateway to serve other requests successfully.

Automatic Recovery Testing Circuit breakers automatically test for service recovery without manual intervention. The half-open state implements a controlled experiment: send a few requests and see what happens. This is safer than immediately flooding a recovering service with full traffic, which could cause it to fail again. The principle here is progressive restoration—gradually increase traffic as confidence in the service’s health grows. Netflix’s Hystrix library, for instance, allows configuring how many test requests to send and what success rate is required before fully closing the circuit. This automation is essential for 24/7 operations where manual intervention isn’t scalable.

Threshold Tuning Based on Service Characteristics Not all services should have the same circuit breaker thresholds. A payment processing service might trip after 3 consecutive failures because financial transactions require high reliability. A recommendation service might tolerate 50% failures because recommendations are non-critical and can fall back to cached results. The key principle is aligning thresholds with business impact and service SLAs. During interviews, strong candidates discuss how they’d tune thresholds differently for critical vs. non-critical dependencies. For example, Uber’s payment service likely has very sensitive circuit breakers, while their restaurant photo service can tolerate more failures.

Bulkhead Isolation Circuit breakers should be implemented per dependency, not globally. If Service A calls both Service B and Service C, it should maintain separate circuit breakers for each. This isolation prevents failures in Service B from affecting calls to Service C—a principle called bulkheading (borrowed from ship design where compartments prevent total flooding). Without this isolation, a single failing dependency could disable all external calls. In practice, this means maintaining a map of circuit breakers keyed by service identifier, and potentially even by endpoint within a service. For instance, a search service might have separate breakers for its autocomplete endpoint and its full-text search endpoint.

Observable State Transitions Every circuit breaker state change should be logged, metered, and potentially alerted on. When a breaker opens, it’s a signal that something is wrong—either with the downstream service or with your configuration. Teams should have dashboards showing all circuit breaker states across their architecture, with the ability to drill into specific services. This observability principle extends to exposing breaker state via APIs so other systems can react. For example, a load balancer might stop sending traffic to an instance whose circuit breakers are all open, or a monitoring system might automatically create an incident ticket when critical service breakers open.

Bulkhead Isolation with Per-Dependency Circuit Breakers

graph TB
    Service[Service A<br/>API Gateway]
    
    subgraph Circuit Breaker Layer
        CB_B[Circuit Breaker<br/>for Service B<br/><i>Closed</i>]
        CB_C[Circuit Breaker<br/>for Service C<br/><i>Open - Failing</i>]
        CB_D[Circuit Breaker<br/>for Service D<br/><i>Closed</i>]
    end
    
    ServiceB[Service B<br/>Payment Service<br/>✓ Healthy]
    ServiceC[Service C<br/>Recommendation Service<br/>✗ Failing]
    ServiceD[Service D<br/>User Service<br/>✓ Healthy]
    
    Service --"Payment requests"--> CB_B
    Service --"Recommendation requests"--> CB_C
    Service --"User requests"--> CB_D
    
    CB_B --"Forward"--> ServiceB
    CB_C -."Reject immediately<br/>(fail fast)".-> ServiceC
    CB_D --"Forward"--> ServiceD
    
    ServiceB --"Success"--> CB_B
    ServiceD --"Success"--> CB_D

Bulkhead isolation prevents failures in one dependency from affecting others. Service A maintains separate circuit breakers for each downstream service. When Service C fails, only its circuit breaker opens—requests to Services B and D continue normally. This isolation is critical for preventing cascading failures across unrelated system components.

Deep Dive

Types / Variants

Count-Based Circuit Breakers Count-based breakers track a fixed number of recent requests (e.g., the last 100 calls) and calculate the failure rate from that window. When failures exceed the threshold within that window, the breaker trips. This approach is simple to implement and reason about—you always know exactly how many requests are being considered. However, it doesn’t account for time, so a burst of failures from 10 minutes ago can still affect the current state. Count-based breakers work well for high-throughput services where you get enough requests to fill the window quickly. For example, Netflix’s Hystrix uses a rolling window of the last N requests. The main drawback is that low-traffic services might take a long time to accumulate enough requests to trip the breaker, during which failures continue to impact users.

Time-Based Circuit Breakers Time-based breakers track all requests within a sliding time window (e.g., the past 60 seconds) and calculate the failure rate from that period. This approach better reflects current service health because old failures naturally age out. It’s particularly useful for services with variable traffic patterns—during low-traffic periods, you still get meaningful failure rates. The implementation is more complex because you need to maintain timestamps and periodically clean up old data. Resilience4j, a popular Java library, uses time-based windows by default. The tradeoff is memory overhead: you’re storing more data than count-based breakers. Time-based breakers are ideal for services where traffic varies significantly throughout the day, like e-commerce sites that see spikes during sales events.

Adaptive Circuit Breakers Adaptive breakers dynamically adjust their thresholds based on observed service behavior rather than using fixed thresholds. For example, they might use statistical models to detect anomalies in error rates or latency percentiles. If a service normally has a 1% error rate but suddenly jumps to 5%, an adaptive breaker would trip even though 5% might be acceptable for a different service. Google’s SRE practices include adaptive error budgets that work similarly. The advantage is that breakers automatically tune themselves to each service’s characteristics without manual configuration. The downside is complexity and potential for false positives during legitimate traffic spikes. Adaptive breakers are best for large-scale systems with many services where manual tuning isn’t feasible.

Concurrent Request Limiting (Semaphore Isolation) Some circuit breaker implementations include semaphore-based concurrency limits in addition to error rate tracking. This variant limits how many concurrent requests can be in-flight to a downstream service, regardless of error rates. For example, you might allow only 10 concurrent requests to a database. When the limit is reached, additional requests are rejected immediately. This prevents thread pool exhaustion even if the downstream service is slow but not failing. Hystrix supports this as an alternative to its default thread pool isolation. The tradeoff is that you’re limiting throughput even when the service is healthy. This variant is crucial for protecting against slow dependencies that don’t necessarily return errors but tie up resources.

Fallback-Enabled Circuit Breakers Many circuit breaker implementations support fallback functions that execute when the breaker is open. Instead of just returning an error, the breaker can return cached data, default values, or results from an alternative service. For example, when a product recommendation service’s breaker opens, it might return popular items from a cache rather than showing an error. This pattern combines circuit breaking with graceful degradation. The implementation complexity increases because you need to maintain fallback logic and potentially caching infrastructure. Fallback-enabled breakers are essential for user-facing services where showing something is better than showing nothing. Amazon’s product pages likely use this approach—if personalized recommendations fail, they show generic bestsellers.

Trade-offs

Threshold Sensitivity: Aggressive vs. Conservative Aggressive thresholds (e.g., trip after 3 failures) provide fast failure detection and protect your system quickly, but they increase the risk of false positives from transient errors. A brief network hiccup could unnecessarily open the breaker, degrading user experience even though the service is actually healthy. Conservative thresholds (e.g., trip after 50% failure rate over 100 requests) reduce false positives but allow more failures to impact users before protection kicks in. The decision framework: use aggressive thresholds for critical dependencies where any failure is concerning (payment processing, authentication) and conservative thresholds for services with naturally higher error rates or where occasional failures are acceptable (analytics, recommendations). In practice, start conservative and tighten based on observed behavior.

Open State Duration: Short vs. Long Short open state durations (10-30 seconds) allow faster recovery when services come back online, reducing the window of degraded service. However, they risk overwhelming a service that’s still struggling by testing recovery too frequently. Long durations (2-5 minutes) give services more time to fully recover but extend the period where users experience degraded functionality. The decision framework depends on your service’s recovery characteristics. Stateless services that can recover instantly (like a restarted container) benefit from short durations. Stateful services that need time to rebuild caches or reconnect to databases need longer durations. Monitor your actual service recovery times and set durations to 2-3x the typical recovery time. For example, if your service usually recovers in 30 seconds, use a 60-90 second open state.

Half-Open Test Volume: Few vs. Many Requests Testing with few requests (1-3) in half-open state minimizes risk to the recovering service but provides less statistical confidence about its health. A single successful request might be lucky, not indicative of full recovery. Testing with many requests (10-20) gives better confidence but could overwhelm a service that’s only partially recovered. The decision framework: use few test requests for services with expensive operations (database writes, external API calls with rate limits) and more test requests for cheap operations (cache reads, simple computations). Also consider the cost of a false negative—if reopening the breaker is very disruptive to users, use more test requests to be confident. Netflix’s approach is to gradually increase test volume: start with 1 request, then 3, then 10, closing the breaker only after sustained success.

Granularity: Service-Level vs. Endpoint-Level Breakers Service-level breakers (one breaker per downstream service) are simpler to implement and reason about—you have fewer breakers to monitor and configure. However, they can’t distinguish between healthy and unhealthy endpoints within a service. If one endpoint is failing but others work fine, a service-level breaker would block all traffic. Endpoint-level breakers (separate breaker per endpoint) provide finer-grained control and prevent healthy endpoints from being affected by failing ones. The tradeoff is complexity: more breakers mean more configuration, monitoring, and memory overhead. The decision framework: use service-level breakers for small services with similar endpoints, and endpoint-level breakers for large services with diverse functionality. For example, a payment service might have separate breakers for “authorize” and “capture” endpoints since they have different failure characteristics.

Client-Side vs. Server-Side Implementation Client-side breakers (implemented in the calling service) give each client control over its own resilience and allow different clients to have different thresholds. However, they don’t protect the downstream service from being overwhelmed by many clients simultaneously. Server-side breakers (implemented in the called service or in a proxy) protect the service itself and provide centralized control, but they don’t prevent clients from wasting resources on requests that will be rejected. The best approach is often both: client-side breakers for fast failure and resource protection, plus server-side rate limiting or load shedding for service protection. This defense-in-depth approach is what companies like Google and Netflix use. The decision framework: always implement client-side breakers, and add server-side protection for critical services that many clients depend on.

Threshold Sensitivity Comparison: Aggressive vs Conservative

graph TB
    subgraph Aggressive Threshold - Trip after 3 failures
        Req1A[Request 1] --> Fail1A[❌ Failure]
        Req2A[Request 2] --> Fail2A[❌ Failure]
        Req3A[Request 3] --> Fail3A[❌ Failure]
        Fail3A --> TripA[Circuit Opens<br/><i>After 3 requests</i>]
        TripA --> ProtectA[✓ Fast protection<br/>✗ Risk of false positives]
    end
    
    subgraph Conservative Threshold - Trip at 50% over 100 requests
        Req1C[Requests 1-50] --> Mix1C[25 success, 25 failures<br/>50% error rate]
        Req2C[Requests 51-100] --> Mix2C[25 success, 25 failures<br/>50% error rate]
        Mix2C --> TripC[Circuit Opens<br/><i>After 100 requests</i>]
        TripC --> ProtectC[✓ Fewer false positives<br/>✗ More failures impact users]
    end
    
    Decision{Service<br/>Criticality?}
    Decision --"Critical<br/>(Payment, Auth)"--> Aggressive
    Decision --"Non-Critical<br/>(Analytics, Recommendations)"--> Conservative

Aggressive thresholds provide fast failure detection but increase false positive risk from transient errors. Conservative thresholds reduce false positives but allow more user-impacting failures before protection activates. Choose aggressive thresholds for critical dependencies where any failure is concerning, and conservative thresholds for services with naturally higher error rates or where occasional failures are acceptable.