Failover in System Design: Active-Passive Guide
After this topic, you will be able to:
- Implement active-passive and active-active fail-over strategies
- Compare fail-over detection mechanisms and their latency implications
- Apply fail-over patterns to design highly available services
TL;DR
Fail-over automatically switches traffic from a failed primary system to a standby backup, ensuring continuous service availability. Active-passive uses a hot standby that takes over when the primary fails, while active-active distributes load across multiple systems that can each handle the full workload. Detection mechanisms (heartbeats, health checks) determine how quickly fail-over occurs, typically ranging from seconds to minutes.
Cheat Sheet: Active-Passive = one handles traffic, one waits | Active-Active = both handle traffic | Detection via heartbeat (network pulse) or health checks (application-level probes) | Watch for split-brain (both think they’re primary) | RTO depends on cold vs hot standby.
The Analogy
Think of fail-over like having a backup pilot on a commercial flight. In active-passive mode, the co-pilot monitors the captain but only takes the controls if the captain becomes incapacitated—there’s a brief moment of transition. In active-active mode, both pilots actively fly the plane together, each capable of handling everything, so if one has a problem the other is already at the controls. The “heartbeat” is like the co-pilot constantly checking “Are you okay?” and taking over the moment they don’t get a response.
Why This Matters in Interviews
Fail-over is a fundamental building block for achieving high availability SLAs (99.9%+) that interviewers expect you to discuss when designing critical systems. You’ll be evaluated on whether you understand the trade-offs between active-passive and active-active, can calculate recovery time objectives (RTO), and recognize the split-brain problem. Senior candidates must explain how fail-over integrates with load balancers, databases, and stateful services. This topic appears in nearly every system design interview for services requiring uptime guarantees—payment systems, messaging platforms, streaming services.
Core Concept
Fail-over is an availability pattern that maintains service continuity by automatically redirecting traffic from a failed component to a healthy backup. When your primary database crashes at 3 AM, fail-over is what keeps your application running without waking up an engineer. The pattern assumes that failures are inevitable—hardware dies, networks partition, software crashes—and builds resilience by maintaining redundant capacity that can assume the primary role within seconds or minutes.
The core mechanism involves three components: a primary system handling production traffic, one or more standby systems ready to take over, and a detection mechanism that identifies failures and triggers the switch. The sophistication lies in how quickly you detect failures, how seamlessly you transfer state, and how you prevent both systems from thinking they’re primary simultaneously (the split-brain problem). Companies like Netflix use fail-over extensively across their microservices architecture, with each service having standby instances across multiple AWS availability zones.
Active-Passive vs Active-Active Architecture Comparison
graph TB
subgraph Active-Passive
AP_LB[Load Balancer]
AP_Primary["Primary Server<br/><i>Handles all writes</i>"]
AP_Standby["Standby Server<br/><i>Idle/Read-only</i>"]
AP_DB[("Shared Database")]
AP_LB --"100% traffic"--> AP_Primary
AP_LB -."0% traffic<br/>(until fail-over)".-> AP_Standby
AP_Primary --"Replication"--> AP_Standby
AP_Primary --> AP_DB
AP_Standby -."Receives updates".-> AP_DB
end
subgraph Active-Active
AA_LB[Load Balancer]
AA_Node1["Node 1<br/><i>Handles writes</i>"]
AA_Node2["Node 2<br/><i>Handles writes</i>"]
AA_DB1[("Database<br/>Partition A")]
AA_DB2[("Database<br/>Partition B")]
AA_LB --"50% traffic"--> AA_Node1
AA_LB --"50% traffic"--> AA_Node2
AA_Node1 --> AA_DB1
AA_Node2 --> AA_DB2
AA_Node1 <-."Sync/Replicate".-> AA_Node2
AA_DB1 <-."Cross-replication".-> AA_DB2
end
Active-passive maintains a hot standby that only takes over during failures, wasting 50% capacity but ensuring single-writer consistency. Active-active distributes load across all nodes for better utilization, but requires coordination for write conflicts or data partitioning.
Split-Brain Scenario and Prevention Mechanisms
graph TB
subgraph Split-Brain Problem
SB_P["Primary<br/><i>Thinks it's active</i>"]
SB_S["Standby<br/><i>Thinks it's active</i>"]
SB_Net["❌ Network Partition"]
SB_C1[Client A]
SB_C2[Client B]
SB_DB1[("Database<br/>Version 1")]
SB_DB2[("Database<br/>Version 2")]
SB_C1 --"Write: balance=$100"--> SB_P
SB_C2 --"Write: balance=$50"--> SB_S
SB_P --> SB_DB1
SB_S --> SB_DB2
SB_P -.-x SB_Net
SB_Net x-.- SB_S
end
subgraph Prevention: Quorum-Based Writes
Q_P["Primary"]
Q_S["Standby"]
Q_R1[("Replica 1")]
Q_R2[("Replica 2")]
Q_R3[("Replica 3")]
Q_Net["❌ Network Partition"]
Q_P --"Write request"--> Q_R1
Q_P --"Write request"--> Q_R2
Q_P -.-x Q_Net
Q_Net x-.- Q_R3
Q_S -."Cannot reach quorum (2/3)".-> Q_R3
Q_P --"✓ Quorum achieved (2/3)"--> Q_R1
end
subgraph Prevention: Distributed Lock
DL_Consensus["Consensus System<br/><i>etcd/ZooKeeper</i>"]
DL_P["Primary<br/><i>Has lease</i>"]
DL_S["Standby<br/><i>No lease</i>"]
DL_P --"1. Heartbeat + renew lease"--> DL_Consensus
DL_S --"2. Request lease"--> DL_Consensus
DL_Consensus --"3. Denied (Primary has lease)"--> DL_S
DL_Consensus -."Lease expires after timeout".-> DL_P
end
Split-brain occurs when network partitions cause both primary and standby to accept writes, creating divergent state. Prevention requires either quorum-based operations (requiring majority consensus) or distributed locks (ensuring only one node holds the write lease at any time).
Multi-Region Active-Active Fail-Over (Netflix-Style Architecture)
graph TB
subgraph Global Layer
DNS["Route 53 DNS<br/><i>Latency-based routing</i>"]
GLB["Global Load Balancer<br/><i>Health-aware</i>"]
end
subgraph Region: us-east-1
subgraph AZ-1a
E1_App1["App Server"]
E1_Cache1[("Redis")]
end
subgraph AZ-1b
E1_App2["App Server"]
E1_Cache2[("Redis")]
end
subgraph AZ-1c
E1_App3["App Server"]
E1_DB[("Cassandra<br/>Replica 1")]
end
E1_LB["Regional LB<br/><i>Health checks every 5s</i>"]
end
subgraph Region: us-west-2
subgraph AZ-2a
W2_App1["App Server"]
end
subgraph AZ-2b
W2_App2["App Server"]
W2_DB[("Cassandra<br/>Replica 2")]
end
W2_LB["Regional LB"]
end
subgraph Region: eu-west-1
EU_App["App Server"]
EU_DB[("Cassandra<br/>Replica 3")]
EU_LB["Regional LB"]
end
DNS --> GLB
GLB --"Primary: 60% traffic"--> E1_LB
GLB --"Secondary: 30% traffic"--> W2_LB
GLB --"Tertiary: 10% traffic"--> EU_LB
E1_LB --> E1_App1 & E1_App2 & E1_App3
W2_LB --> W2_App1 & W2_App2
EU_LB --> EU_App
E1_App1 & E1_App2 & E1_App3 --> E1_DB
W2_App1 & W2_App2 --> W2_DB
EU_App --> EU_DB
E1_DB <-."Quorum replication".-> W2_DB
W2_DB <-."Quorum replication".-> EU_DB
EU_DB <-."Quorum replication".-> E1_DB
Failure["❌ AZ-1c fails"]
Failure -."Health check fails".-> E1_App3
E1_LB -."Remove from pool (5s)".-> E1_App3
GLB -."Rebalance: 40% / 40% / 20%".-> W2_LB
Netflix-style multi-region active-active architecture distributes traffic across three regions, each with multiple availability zones. When an AZ fails, the regional load balancer removes failed instances within 5 seconds, and the global load balancer rebalances traffic across remaining healthy regions. Cassandra’s quorum-based replication ensures consistency even during partial failures.
How It Works
Fail-over operates through continuous health monitoring and automated decision-making. The detection layer sends periodic heartbeat signals (typically every 1-5 seconds) from the primary to a monitoring system or directly to the standby. If heartbeats stop arriving—indicating network failure, system crash, or resource exhaustion—the monitoring system initiates fail-over after a timeout period (usually 10-30 seconds to avoid false positives from transient network hiccups).
Once fail-over triggers, the standby system must assume the primary’s identity. For stateless services, this means updating DNS records or load balancer configurations to point to the standby’s IP address. For stateful services like databases, the standby must have up-to-date data (via replication), promote itself to accept writes, and ensure the old primary cannot come back online and create conflicting state. The entire sequence—detection, decision, promotion, traffic redirection—defines your Recovery Time Objective (RTO). A cold standby that must boot from scratch might take 5-10 minutes; a hot standby already running and synchronized can take over in 10-30 seconds.
Active-Passive Fail-Over Detection and Promotion Flow
sequenceDiagram
participant M as Monitoring System
participant P as Primary Server
participant S as Standby Server
participant LB as Load Balancer
participant C as Clients
Note over M,S: Normal Operation
loop Every 2 seconds
P->>M: Heartbeat signal
M->>M: Reset timeout counter
end
P->>C: Handle requests
P->>S: Replicate data
Note over M,S: Primary Failure
P-xM: Heartbeat stops
M->>M: Wait 10-30 sec timeout
M->>M: Declare primary failed
Note over M,S: Fail-Over Sequence
M->>S: 1. Trigger promotion
S->>S: 2. Verify data sync
S->>S: 3. Promote to primary
M->>LB: 4. Update routing config
LB->>LB: 5. Point to standby IP
S->>C: 6. Accept new requests
Note over M,S: Total RTO: 15-45 seconds
The fail-over sequence shows how heartbeat monitoring detects failures and orchestrates the promotion of a standby to primary. The 10-30 second timeout prevents false positives from transient network issues, while the multi-step promotion ensures data consistency before accepting traffic.
Fail-Over Detection Mechanisms and Timeout Tuning
graph TB
Start(["Primary Operating Normally"])
HB["Send Heartbeat<br/><i>Every 2 seconds</i>"]
Receive{"Heartbeat<br/>Received?"}
Counter["Increment Miss Counter<br/><i>Current: X/3</i>"]
Threshold{"Missed >= 3<br/>heartbeats?"}
Wait["Wait 10-30 sec<br/><i>Grace period</i>"]
HealthCheck["Run Health Checks<br/><i>TCP, HTTP, App-level</i>"]
Healthy{"All checks<br/>pass?"}
FalsePositive["False Alarm<br/>Reset counter"]
TriggerFO["Trigger Fail-Over<br/><i>Promote standby</i>"]
Start --> HB
HB --> Receive
Receive -->|"Yes"| Start
Receive -->|"No"| Counter
Counter --> Threshold
Threshold -->|"No"| HB
Threshold -->|"Yes"| Wait
Wait --> HealthCheck
HealthCheck --> Healthy
Healthy -->|"Yes"| FalsePositive
FalsePositive --> Start
Healthy -->|"No"| TriggerFO
Note1["⚖️ Trade-off:<br/>Short timeout = faster recovery<br/>but more false positives"]
Note2["⚖️ Trade-off:<br/>Long timeout = fewer false alarms<br/>but longer outages"]
Threshold -.-> Note1
Wait -.-> Note2
Fail-over detection uses multiple signals (heartbeat timeouts, health checks) with configurable thresholds to balance false positive rate against recovery time. Production systems typically use 10-30 second grace periods after missing 3 consecutive heartbeats to avoid unnecessary fail-overs from transient network issues.
Key Principles
principle: Health Detection Accuracy explanation: The monitoring mechanism must distinguish between actual failures and transient network issues. Too sensitive, and you get false positives causing unnecessary fail-overs (which themselves can cause outages). Too lenient, and you delay recovery during real failures. Most production systems use multiple signals—heartbeat timeouts, health check endpoints returning errors, and resource metrics like CPU/memory exhaustion—combined with configurable thresholds. example: Discord’s voice servers use both network heartbeats (every 2 seconds) and application-level health checks (can the server encode audio?) with a 15-second timeout. This prevents fail-over during brief network blips while catching actual server crashes within acceptable latency.
principle: State Synchronization explanation: The standby must have sufficiently recent state to take over without data loss or corruption. For databases, this means replication lag must be minimal (ideally zero for synchronous replication, or seconds for asynchronous). For in-memory caches or session stores, you need to decide whether to replicate state continuously or accept that fail-over means losing recent data. example: Twitter’s timeline service uses active-passive fail-over with asynchronous replication, accepting that the standby might be 1-2 seconds behind. During fail-over, a small number of recent tweets might need to be re-fetched, but this is preferable to the latency cost of synchronous replication on every write.
principle: Fencing and Split-Brain Prevention explanation: You must ensure the old primary cannot continue operating after fail-over, creating a split-brain scenario where two systems both believe they’re primary and accept conflicting writes. Fencing mechanisms include STONITH (Shoot The Other Node In The Head) that forcibly powers down the old primary, or distributed consensus systems that grant a lease to only one primary at a time. example: Cassandra uses a quorum-based approach where a node must communicate with a majority of replicas to accept writes. During network partitions, only the partition with quorum can continue operating, preventing split-brain even if both sides think they’re primary.
Deep Dive
Types / Variants
Active Passive
In active-passive (also called master-slave), the primary handles all traffic while the standby remains idle or serves read-only queries. The standby continuously receives replicated data but doesn’t accept writes. When fail-over occurs, the standby promotes itself to primary, updates routing (DNS, load balancer, or virtual IP), and begins accepting writes. This pattern is simpler to implement and reason about because there’s always exactly one writer, avoiding consistency conflicts.
The key decision is hot vs. cold standby. A hot standby runs the full application stack, has data pre-loaded in memory, and can take over in 10-30 seconds. A cold standby is powered off or running minimal processes, requiring 5-10 minutes to boot, load data, and warm caches. Hot standbys waste resources (you’re paying for idle capacity) but provide faster recovery. Most production systems use hot standbys for critical paths and cold standbys for less critical components. The downtime window during fail-over is your RTO—if you promise 99.9% uptime (43 minutes/month), you can afford a few 30-second fail-overs but not 10-minute ones.
Active Active
In active-active (also called master-master), multiple systems simultaneously handle production traffic, each capable of serving the full workload. A load balancer distributes requests across all active nodes, and if one fails, the load balancer simply stops sending it traffic—the remaining nodes absorb the load. This provides faster “fail-over” (really just load redistribution) since there’s no promotion step, and better resource utilization since all capacity is actively used.
The complexity lies in maintaining consistency when multiple systems accept writes. For stateless services (API servers, web frontends), active-active is straightforward—each request is independent. For stateful services (databases, caches), you need either: (1) partitioning where each node owns a subset of data, (2) multi-master replication with conflict resolution, or (3) distributed consensus (Raft, Paxos) to coordinate writes. Active-active works beautifully for read-heavy workloads where you can replicate data everywhere, but write-heavy workloads often fall back to active-passive to avoid coordination overhead.
Trade-offs
dimension: Recovery Time option_a: Active-Passive: 10-30 seconds (hot standby) to 5-10 minutes (cold standby) for detection, promotion, and routing updates option_b: Active-Active: Near-instant (1-2 seconds) since load balancer just stops routing to failed node decision_framework: Choose active-active for user-facing services where even 30 seconds of degraded performance is unacceptable. Choose active-passive for backend systems where brief outages are tolerable and you want simpler consistency guarantees.
dimension: Resource Efficiency option_a: Active-Passive: 50% waste (standby sits idle) but simpler to operate option_b: Active-Active: 100% utilization but requires handling partial failures and load redistribution decision_framework: For expensive resources (large databases, GPU clusters), active-passive waste is costly. For cheap resources (stateless API servers), the operational simplicity of active-passive may be worth the cost.
dimension: Consistency Complexity option_a: Active-Passive: Single writer eliminates conflicts, simpler replication option_b: Active-Active: Must handle concurrent writes, conflict resolution, or coordination protocols decision_framework: If your data model has complex relationships or requires strong consistency (financial transactions, inventory), active-passive is safer. If you can tolerate eventual consistency or partition data cleanly, active-active provides better availability.
Common Pitfalls
pitfall: Split-Brain Scenarios why_it_happens: Network partitions can cause both primary and standby to think the other has failed, leading both to accept writes and create divergent state. This is catastrophic for systems requiring consistency—imagine two database primaries both processing payments, creating duplicate charges or inventory conflicts. how_to_avoid: Implement fencing mechanisms that prevent the old primary from continuing after fail-over. Use distributed consensus (etcd, ZooKeeper) to grant a lease to exactly one primary, or STONITH to forcibly power down the old primary. For databases, use quorum-based writes where a node must reach a majority of replicas to commit.
pitfall: Cascading Failures During Fail-Over why_it_happens: When the primary fails and traffic shifts to the standby, the sudden load spike can overwhelm the standby if it wasn’t provisioned for full capacity or if caches are cold. This causes the standby to fail, triggering another fail-over, creating a cascade. how_to_avoid: Ensure standbys have identical capacity to primaries and keep caches warm by serving read traffic. Implement gradual traffic shifting during fail-over (10% → 50% → 100% over 30 seconds) to allow the standby to warm up. Add circuit breakers to prevent retry storms during the transition.
pitfall: Stale DNS Caching why_it_happens: If you use DNS-based fail-over (updating A records to point to the standby), clients and intermediate DNS servers may cache the old IP for minutes or hours based on TTL settings, continuing to send traffic to the failed primary. how_to_avoid: Use low TTLs (30-60 seconds) for critical services, though this increases DNS query load. Better yet, use load balancer-based fail-over where the load balancer performs health checks and routes traffic without DNS changes. For client-side fail-over, implement retry logic that tries alternate endpoints.
Real-World Examples
company: Netflix system: Cassandra Database Clusters usage_detail: Netflix runs Cassandra in active-active mode across three AWS availability zones, with each zone containing a full replica of the data. When an AZ fails (which happens several times per year), the load balancer immediately stops routing to that zone and the remaining two zones absorb the traffic. Because Cassandra uses quorum reads/writes (requiring 2 out of 3 replicas to agree), the system continues operating with strong consistency even during single-AZ failures. This design provides sub-second fail-over with no data loss, supporting Netflix’s 99.99% availability target for streaming.
company: Discord system: Voice Server Infrastructure usage_detail: Discord uses active-passive fail-over for voice servers with hot standbys. Each voice region has a primary server handling WebRTC connections and a standby server receiving replicated state (user positions, audio routing). Heartbeats run every 2 seconds, and if three consecutive heartbeats fail (6-second timeout), the standby promotes itself and sends reconnection messages to all connected clients. Clients automatically reconnect to the new server within 10 seconds. This approach balances cost (standbys are cheaper than running full active-active) with acceptable recovery time for voice chat.
company: Stripe system: Payment Processing API usage_detail: Stripe’s API uses active-active fail-over across multiple data centers with sophisticated request routing. Each data center can handle the full payment load, and requests are routed to the nearest healthy data center. When a data center fails health checks (measured by error rates and latency), the global load balancer removes it from rotation within 5 seconds. Because payment requests are idempotent (using idempotency keys), clients can safely retry failed requests against a different data center without creating duplicate charges. This design achieves 99.99%+ availability while maintaining strong consistency for financial transactions.
Interview Expectations
Mid-Level
Explain the difference between active-passive and active-active fail-over with concrete examples. Describe how heartbeat-based detection works and typical timeout values. Discuss the trade-off between recovery time and false positive rate. When designing a system, identify which components need fail-over and choose an appropriate pattern based on consistency requirements and cost constraints.
Senior
Design fail-over mechanisms that handle edge cases like split-brain, cascading failures, and partial network partitions. Calculate RTO based on detection timeout, promotion time, and DNS/load balancer propagation. Explain how fail-over integrates with data replication (synchronous vs. asynchronous) and the consistency implications. Discuss operational concerns like testing fail-over regularly, monitoring fail-over latency, and handling fail-back (returning to the original primary). Compare fail-over with other availability patterns like circuit breakers and bulkheads.
Staff+
Architect multi-region fail-over strategies that balance latency, cost, and regulatory requirements (data residency). Design consensus-based fail-over using Raft or Paxos to prevent split-brain in distributed systems. Explain how to achieve zero-downtime fail-over for stateful services by combining synchronous replication, connection draining, and gradual traffic shifting. Discuss the economics of fail-over capacity planning—when to use hot standbys vs. cold standbys vs. active-active based on cost of downtime. Describe how to test fail-over at scale (chaos engineering) without impacting production.
Common Interview Questions
How would you implement fail-over for a stateful service like a database vs. a stateless API server?
What happens if the network between primary and standby fails but both systems are healthy? How do you prevent split-brain?
How do you decide between active-passive and active-active for a payment processing system?
What’s your RTO if you use DNS-based fail-over with a 60-second TTL vs. load balancer-based fail-over?
How would you test that your fail-over actually works without causing a production outage?
Red Flags to Avoid
Not mentioning split-brain or assuming it can’t happen
Claiming fail-over is instant without accounting for detection time, promotion, and routing updates
Choosing active-active for strongly consistent systems without explaining conflict resolution
Not considering the cost of idle standby capacity or how to justify it
Forgetting that fail-over itself can cause outages if not implemented carefully
Key Takeaways
Active-passive uses a standby that takes over when the primary fails (10-30 second RTO), while active-active distributes load across multiple systems for near-instant fail-over (1-2 seconds). Choose based on consistency requirements and acceptable downtime.
Detection mechanisms (heartbeat, health checks) must balance false positives (unnecessary fail-overs) against detection latency (longer outages). Typical production systems use 10-30 second timeouts with multiple signals.
Split-brain (both systems thinking they’re primary) is the most dangerous fail-over failure mode. Prevent it using fencing (STONITH), distributed consensus (Raft/Paxos), or quorum-based operations.
Fail-over isn’t free—hot standbys waste 50% capacity, cold standbys increase RTO, and active-active requires complex coordination. Calculate the cost of downtime vs. the cost of redundancy to choose the right pattern.
Test fail-over regularly in production using chaos engineering. Untested fail-over mechanisms fail when you need them most, often making outages worse instead of better.
Related Topics
Prerequisites
Availability Patterns - Understanding availability fundamentals and SLA calculations
Replication - How data is synchronized between primary and standby systems
Load Balancing - How traffic is distributed and redirected during fail-over
Next Steps
Health Checks - Implementing robust failure detection mechanisms
Disaster Recovery - Broader strategies for handling catastrophic failures
Consensus Algorithms - Preventing split-brain using Raft and Paxos
Related
Circuit Breaker - Complementary pattern for handling downstream failures
CAP Theorem - Understanding consistency trade-offs during partitions