Availability Patterns: Active-Active & Failover

intermediate 12 min read Updated 2026-02-11

After this topic, you will be able to:

  • Compare different availability patterns and their redundancy strategies
  • Analyze trade-offs between active-passive and active-active architectures
  • Evaluate which availability pattern fits specific reliability requirements

TL;DR

Availability patterns are architectural strategies that keep systems operational during failures through redundancy, fail-over, and replication. The core trade-off is between cost (running duplicate infrastructure) and uptime guarantees. Active-passive patterns prioritize cost efficiency with standby replicas, while active-active patterns maximize availability by distributing load across all replicas.

Cheat Sheet: Redundancy = multiple copies of components. Fail-over = switching to backup when primary fails. Replication = keeping data synchronized across copies. Active-passive = one serves, others wait. Active-active = all serve traffic simultaneously.

The Analogy

Think of availability patterns like emergency backup systems in a hospital. The hospital doesn’t just have one power source—it has the main grid (primary), a diesel generator (standby), and battery backup (redundancy). When the grid fails, the generator kicks in automatically (fail-over). Critical systems like life support run on multiple power sources simultaneously (active-active), while less critical systems like hallway lights can wait for the generator to start (active-passive). The hospital pays for this redundancy because downtime literally costs lives. In system design, you’re making the same calculation: how much does downtime cost your business?

Why This Matters in Interviews

Availability patterns are the foundation of every “design a highly available system” question. Interviewers use this topic to assess whether you understand that availability isn’t free—it requires intentional architectural decisions with clear trade-offs. When you’re asked to design Netflix, Uber, or any mission-critical system, your first question should be about availability requirements (99.9% vs 99.99%), which then drives your choice of patterns. Mid-level engineers often jump straight to “use a load balancer” without explaining why or what kind. Senior engineers frame availability as a business decision: “If we need 99.95% uptime, we need at least N+1 redundancy with automated fail-over, which costs X but prevents Y in lost revenue.” This topic separates candidates who memorize architectures from those who design them.


Core Concept

Availability patterns are proven architectural approaches for keeping systems operational when components fail. The fundamental insight is that failures are inevitable—hardware crashes, networks partition, data centers lose power—so high availability requires eliminating single points of failure through redundancy. Every availability pattern is a specific strategy for organizing redundant components and coordinating their behavior during failures.

The relationship between redundancy and availability is mathematical but intuitive: if one server has 99% uptime (down 3.65 days/year), two independent servers with automatic fail-over achieve 99.99% uptime (down 52 minutes/year). This is because both must fail simultaneously for the system to be unavailable. However, redundancy alone isn’t enough—you need fail-over mechanisms to detect failures and switch traffic, plus replication to keep redundant components synchronized. These three concepts—redundancy, fail-over, and replication—form the building blocks of all availability patterns.

The core trade-off is cost versus availability. Running duplicate infrastructure doubles your compute costs. Keeping data synchronized across replicas adds latency and complexity. Active-active configurations require sophisticated load balancing and conflict resolution. Companies like Netflix and Uber accept these costs because downtime is more expensive: Netflix loses subscriber trust and revenue, Uber loses rides and driver earnings. Your job in interviews is to articulate this trade-off explicitly and choose patterns that match business requirements.

Availability Building Blocks: Redundancy, Fail-over, and Replication

graph TB
    subgraph "Availability Pattern Components"
        R["Redundancy<br/><i>Multiple copies of components</i>"]
        F["Fail-over<br/><i>Automatic switching on failure</i>"]
        Rep["Replication<br/><i>Data synchronization</i>"]
    end
    
    subgraph "Example: Database High Availability"
        Primary["Primary DB<br/><i>Serves all traffic</i>"]
        Standby["Standby DB<br/><i>Ready backup</i>"]
        Monitor["Health Monitor<br/><i>Detects failures</i>"]
        
        Primary -."Continuous replication".-> Standby
        Monitor --"Health checks"--> Primary
        Monitor --"Promotes on failure"--> Standby
    end
    
    R -."Enables".-> Primary
    R -."Enables".-> Standby
    F -."Implements".-> Monitor
    Rep -."Keeps in sync".-> Primary
    Rep -."Keeps in sync".-> Standby
    
    Result["High Availability<br/><i>System survives failures</i>"]
    Primary & Standby & Monitor --> Result

High availability requires all three mechanisms working together. Redundancy provides backup components, replication keeps them synchronized, and fail-over automatically switches traffic when failures occur. Missing any one mechanism breaks the availability guarantee.

Availability Pattern Decision Tree

flowchart TB
    Start["What's your availability SLA?"] --> SLA_Check{"SLA requirement?"}
    
    SLA_Check -->|"99%-99.9%<br/>(43 min/month)"| Cost_Check1{"Budget constrained?"}
    SLA_Check -->|"99.95%-99.99%<br/>(4 min/month)"| Consistency_Check
    SLA_Check -->|"99.999%+<br/>(26 sec/month)"| Multi_Region
    
    Cost_Check1 -->|"Yes"| AP_Single["Active-Passive<br/>Single Region<br/><i>Hot standby, 30s RTO</i>"]
    Cost_Check1 -->|"No"| AA_Single["Active-Active<br/>Single Region<br/><i>Multi-AZ, instant fail-over</i>"]
    
    Consistency_Check{"Need strong<br/>consistency?"} -->|"Yes<br/>(Financial, inventory)"| AP_Multi["Active-Passive<br/>Multi-Region<br/><i>Sync replication, 30s RTO</i>"]
    Consistency_Check -->|"No<br/>(Social, analytics)"| AA_Multi["Active-Active<br/>Multi-Region<br/><i>Eventual consistency, 0s RTO</i>"]
    
    Multi_Region["Active-Active<br/>Multi-Region + Multi-AZ<br/><i>Geographic redundancy</i>"]
    
    AP_Single -."Cost: $"-.-> Cost1["50% utilization<br/>Simple ops"]
    AA_Single -."Cost: $$"-.-> Cost2["100% utilization<br/>Complex LB"]
    AP_Multi -."Cost: $$$"-.-> Cost3["Cross-region bandwidth<br/>Sync replication latency"]
    AA_Multi -."Cost: $$$$"-.-> Cost4["Conflict resolution<br/>Global load balancing"]
    Multi_Region -."Cost: $$$$$"-.-> Cost5["Maximum redundancy<br/>Highest complexity"]

Choose availability patterns based on SLA requirements and constraints. Higher availability requires more redundancy and cost. Strong consistency requirements push toward active-passive patterns, while eventual consistency enables active-active. Most systems need 99.9%-99.95% availability, not 99.999%—avoid over-engineering.

Uber’s Hybrid Availability Architecture

graph TB
    subgraph "Trip Dispatch (Active-Active)"
        User["Rider/Driver<br/><i>Mobile App</i>"]
        GLB["Global Load Balancer<br/><i>Health-check based routing</i>"]
        
        subgraph "Availability Zone 1"
            Dispatch1["Dispatch Service"]
            DB1["(Trip DB Replica)"]
        end
        
        subgraph "Availability Zone 2"
            Dispatch2["Dispatch Service"]
            DB2["(Trip DB Replica)"]
        end
        
        subgraph "Availability Zone 3"
            Dispatch3["Dispatch Service"]
            DB3["(Trip DB Replica)"]
        end
        
        User --"1. Request ride"--> GLB
        GLB --"2a. Route (33%)"--> Dispatch1
        GLB --"2b. Route (33%)"--> Dispatch2
        GLB --"2c. Route (33%)"--> Dispatch3
        
        Dispatch1 <-."Read/Write".-> DB1
        Dispatch2 <-."Read/Write".-> DB2
        Dispatch3 <-."Read/Write".-> DB3
        
        DB1 <-."Async replication".-> DB2
        DB2 <-."Async replication".-> DB3
        DB3 <-."Async replication".-> DB1
    end
    
    subgraph "Payment System (Active-Passive)"
        Payment_LB["Payment Load Balancer"]
        Primary_Pay["Primary Payment Service<br/><i>Handles all writes</i>"]
        Standby_Pay["Standby Payment Service<br/><i>Ready for promotion</i>"]
        Primary_DB["(Payment DB Primary)"]
        Standby_DB["(Payment DB Standby)"]
        
        Dispatch1 & Dispatch2 & Dispatch3 --"3. Process payment"--> Payment_LB
        Payment_LB --"4. Route to primary"--> Primary_Pay
        Primary_Pay --"5. Write transaction"--> Primary_DB
        Primary_DB -."Sync replication<br/>(zero data loss)".-> Standby_DB
        Primary_Pay -."Heartbeat".-> Standby_Pay
        Standby_Pay -."Promote on failure<br/>(30s RTO)".-> Payment_LB
    end
    
    Note1["Why Active-Active?<br/>Downtime = lost trips<br/>2s fail-over acceptable"]-.->Dispatch2
    Note2["Why Active-Passive?<br/>Strong consistency required<br/>Prevent double-charging"]-.->Primary_Pay

Uber uses a hybrid approach: active-active for trip dispatch (eventual consistency acceptable, downtime very expensive) and active-passive for payments (strong consistency required, brief downtime acceptable). This optimizes for both availability and correctness based on component-specific requirements.

How It Works

Availability patterns work by combining three mechanisms in different configurations. First, redundancy means running multiple copies of every critical component—multiple web servers, multiple databases, multiple data centers. The redundancy level determines your availability ceiling: N+1 (one backup), N+2 (two backups), or geographic redundancy (entire regions as backups).

Second, fail-over is the process of detecting when a primary component fails and automatically routing traffic to a backup. See Fail-Over for detection mechanisms and switching strategies. The key insight is that fail-over introduces a recovery time objective (RTO)—the time between failure and restoration of service. Active-passive patterns have higher RTO (30 seconds to 5 minutes) because standbys need to warm up. Active-active patterns have near-zero RTO because backups are already serving traffic.

Third, replication keeps redundant components synchronized so they can seamlessly take over. See Replication for synchronous versus asynchronous strategies. The replication strategy determines your recovery point objective (RPO)—how much data you can lose during a failure. Synchronous replication guarantees zero data loss but adds latency. Asynchronous replication is faster but risks losing recent writes.

These mechanisms combine into two fundamental patterns: active-passive (also called master-slave or primary-backup) where one component serves traffic while others wait on standby, and active-active (also called multi-master or active-active-active) where all components serve traffic simultaneously. The choice between them depends on your availability requirements, consistency needs, and budget constraints.

Active-Passive vs Active-Active Request Flow

graph LR
    subgraph "Active-Passive Pattern"
        Client1["Client"]
        LB1["Load Balancer"]
        Primary["Primary Server<br/><i>Handles all traffic</i>"]
        Standby["Standby Server<br/><i>Idle, ready to promote</i>"]
        DB1["(Database)"]
        
        Client1 --"1. Request"--> LB1
        LB1 --"2. Route to primary"--> Primary
        Primary --"3. Read/Write"--> DB1
        Primary -."Continuous replication".-> Standby
        Standby -."On failure: promote".-> LB1
    end
    
    subgraph "Active-Active Pattern"
        Client2["Client"]
        LB2["Load Balancer"]
        Server1["Server 1<br/><i>Serves traffic</i>"]
        Server2["Server 2<br/><i>Serves traffic</i>"]
        DB2["(Database)"]
        
        Client2 --"1. Request"--> LB2
        LB2 --"2a. Route (50%)"--> Server1
        LB2 --"2b. Route (50%)"--> Server2
        Server1 --"3. Read/Write"--> DB2
        Server2 --"3. Read/Write"--> DB2
    end
    
    Note1["RTO: 30s-5min<br/>Cost: 50% utilization<br/>Consistency: Simple"]-.->Standby
    Note2["RTO: Near-zero<br/>Cost: 100% utilization<br/>Consistency: Complex"]-.->Server2

Active-passive wastes 50% capacity but provides simple consistency (only one writer). Active-active uses all capacity and has instant fail-over, but requires conflict resolution when multiple nodes write simultaneously. The choice depends on whether you prioritize cost efficiency or maximum availability.

Key Principles

principle: Eliminate Single Points of Failure explanation: Every component that can fail must have a redundant backup. This includes not just servers, but load balancers, databases, network switches, and even entire data centers. The principle applies recursively: if your load balancer is a single point of failure, you need redundant load balancers with fail-over between them. example: Uber runs multiple instances of every microservice across multiple availability zones. If one zone loses power, traffic automatically routes to healthy zones. But they also discovered their DNS provider was a single point of failure—when it went down in 2016, Uber was unreachable globally despite having redundant infrastructure. They now use multiple DNS providers with automatic fail-over.

principle: Fail-Over Must Be Automatic and Fast explanation: Manual fail-over is too slow for high availability. If your on-call engineer needs 10 minutes to notice a failure and 20 minutes to switch to backup, you’ve already violated a 99.9% SLA (43 minutes/month). Automated fail-over with health checks and circuit breakers is non-negotiable for anything above 99% availability. example: Discord uses active-active database clusters with automatic fail-over. When a database node fails, the load balancer detects it within 2 seconds via health checks and stops routing queries to it. The remaining nodes handle the load with no user-visible downtime. Manual fail-over would mean 5-10 minutes of errors for millions of users.

principle: Redundancy Costs Scale with Availability Requirements explanation: Each additional nine of availability (99% → 99.9% → 99.99%) requires exponentially more redundancy and sophistication. You need to match your pattern to your actual business needs, not over-engineer for availability you don’t require. A 99.9% SLA allows 43 minutes of downtime per month; 99.99% allows only 4 minutes. example: Netflix targets 99.99% availability for streaming because downtime directly impacts revenue and subscriber satisfaction. They run active-active across three AWS regions with automated fail-over. In contrast, their internal analytics dashboards target 99% availability—occasional downtime is acceptable because it doesn’t affect customers, and the cost savings are significant.


Deep Dive

Types / Variants

The two fundamental availability patterns—active-passive and active-active—have several variants optimized for different scenarios.

Active-Passive Variants: The simplest form is hot standby, where the backup is running and ready but not serving traffic. Warm standby means the backup is provisioned but not fully running (e.g., database is installed but not started). Cold standby means the backup must be provisioned from scratch, which can take 10-30 minutes. Hot standby is most common for databases: the primary handles all writes, and the standby replicates data continuously. When the primary fails, the standby is promoted to primary. This pattern is simple and avoids consistency issues (only one writer), but wastes 50% of your infrastructure capacity and has 30-second to 5-minute RTO.

Active-Active Variants: The most common form is multi-master replication, where multiple nodes accept writes simultaneously. This requires conflict resolution when the same data is modified in different locations. Another variant is sharded active-active, where each node owns a subset of data (e.g., users A-M on node 1, N-Z on node 2). If one node fails, its shard becomes unavailable, but the system remains partially operational. Geographic active-active runs full replicas in different regions (e.g., US-East, US-West, EU) with global load balancing. Users connect to the nearest region for low latency, and all regions serve traffic simultaneously.

Hybrid Patterns: Many production systems combine patterns. For example, active-active for reads with active-passive for writes: multiple read replicas serve queries (active-active), but only one primary accepts writes (active-passive). This is common for read-heavy workloads like social media feeds. Another hybrid is active-active with quorum writes: writes must succeed on a majority of nodes before acknowledging to the client, providing both high availability and consistency.

Geographic Active-Active Architecture

graph TB
    subgraph "Global Load Balancing"
        DNS["Global DNS<br/><i>Routes to nearest region</i>"]
        User_US["User (US)"]
        User_EU["User (EU)"]
        User_ASIA["User (Asia)"]
    end
    
    subgraph "Region: US-East"
        LB_US["Regional LB"]
        App_US1["App Server 1"]
        App_US2["App Server 2"]
        DB_US["(Primary DB)<br/><i>Users A-M</i>"]
        Cache_US["(Redis Cache)"]
    end
    
    subgraph "Region: EU-West"
        LB_EU["Regional LB"]
        App_EU1["App Server 1"]
        App_EU2["App Server 2"]
        DB_EU["(Primary DB)<br/><i>Users N-Z</i>"]
        Cache_EU["(Redis Cache)"]
    end
    
    subgraph "Region: Asia-Pacific"
        LB_ASIA["Regional LB"]
        App_ASIA1["App Server 1"]
        DB_ASIA["(Read Replica)<br/><i>All users</i>"]
        Cache_ASIA["(Redis Cache)"]
    end
    
    User_US --"Routed to US"--> DNS
    User_EU --"Routed to EU"--> DNS
    User_ASIA --"Routed to Asia"--> DNS
    
    DNS --> LB_US
    DNS --> LB_EU
    DNS --> LB_ASIA
    
    LB_US --> App_US1 & App_US2
    LB_EU --> App_EU1 & App_EU2
    LB_ASIA --> App_ASIA1
    
    App_US1 & App_US2 --> DB_US
    App_US1 & App_US2 --> Cache_US
    App_EU1 & App_EU2 --> DB_EU
    App_EU1 & App_EU2 --> Cache_EU
    App_ASIA1 --> DB_ASIA
    App_ASIA1 --> Cache_ASIA
    
    DB_US -."Async replication".-> DB_ASIA
    DB_EU -."Async replication".-> DB_ASIA
    DB_US <-."Cross-region sync<br/>(50-200ms)".-> DB_EU

Geographic active-active distributes traffic across regions for low latency and high availability. Users connect to the nearest region (US, EU, or Asia). Each region serves traffic independently. If one region fails, DNS routes users to the next nearest region. This example uses sharded databases (US owns A-M, EU owns N-Z) to avoid write conflicts, with async replication to Asia for read scaling.

Trade-offs

dimension: Cost option_a: Active-Passive: 50% infrastructure utilization (standby is idle). Lower operational complexity means fewer engineers needed. option_b: Active-Active: 100% infrastructure utilization (all nodes serve traffic). Higher operational complexity requires more sophisticated monitoring and debugging. decision_framework: Choose active-passive when availability requirements are moderate (99%-99.9%) and budget is constrained. Choose active-active when downtime is extremely expensive and you need 99.95%+ availability.

dimension: Recovery Time option_a: Active-Passive: 30 seconds to 5 minutes RTO. Standby needs to warm up, load balancers need to detect failure, DNS may need to update. option_b: Active-Active: Near-zero RTO. Failed node is immediately removed from load balancer pool; remaining nodes already handle traffic. decision_framework: If your SLA allows 99.9% availability (43 minutes downtime/month), active-passive’s 2-minute fail-over is acceptable. For 99.99% (4 minutes/month), you need active-active’s instant fail-over.

dimension: Consistency option_a: Active-Passive: Strong consistency is straightforward. Only one node accepts writes, so no conflicts. Replication lag to standby doesn’t affect reads if all reads go to primary. option_b: Active-Active: Consistency is complex. Multiple nodes accept writes, creating potential conflicts. Requires conflict resolution (last-write-wins, CRDTs, manual resolution) or eventual consistency. decision_framework: Choose active-passive for financial systems, inventory management, or anything requiring strong consistency. Choose active-active for social feeds, analytics, or systems that can tolerate eventual consistency.

dimension: Geographic Distribution option_a: Active-Passive: Primary in one region, standby in another. Cross-region replication adds 50-200ms latency. Users in distant regions experience high latency. option_b: Active-Active: Users connect to nearest region for low latency. Cross-region replication happens asynchronously in background. Better user experience globally. decision_framework: Choose active-passive for regional systems (e.g., US-only). Choose active-active for global systems with users worldwide (e.g., Netflix, Spotify).

Common Pitfalls

pitfall: Assuming Redundancy Equals Availability why_it_happens: Engineers add redundant components but forget to implement automatic fail-over. When the primary fails, the system goes down because nothing switches traffic to the backup. how_to_avoid: Always test fail-over in production-like environments. Use chaos engineering tools like Netflix’s Chaos Monkey to randomly kill components and verify automatic recovery. Measure actual RTO, not theoretical RTO.

pitfall: Ignoring Correlated Failures why_it_happens: Redundant components share dependencies (same data center, same network, same software version). When that dependency fails, all redundant components fail simultaneously. how_to_avoid: Use geographic redundancy across availability zones or regions. Stagger software deployments so all replicas aren’t running the same buggy version. Identify and eliminate shared dependencies like single DNS providers or single cloud regions.

pitfall: Over-Engineering for Availability why_it_happens: Engineers default to active-active multi-region because it sounds impressive, without calculating whether the business needs 99.99% availability or can tolerate 99.9%. how_to_avoid: Start with business requirements. Calculate the cost of downtime (lost revenue, customer churn, SLA penalties). Compare that to the cost of redundancy. See Availability in Numbers for SLA calculations. Often, active-passive in a single region is sufficient and saves millions in infrastructure costs.


Real-World Examples

company: Uber system: Trip Dispatch System usage_detail: Uber’s dispatch system matches riders with drivers in real-time. Downtime means lost trips, frustrated users, and drivers without earnings. They use active-active across multiple availability zones within each region. Each zone runs a complete copy of the dispatch service with its own database replica. When a rider requests a trip, the request goes to the nearest healthy zone. If one zone fails (power outage, network partition), the load balancer automatically routes all traffic to remaining zones within 2 seconds. The system tolerates losing an entire zone without user-visible downtime. However, they use active-passive for the payment system: only one primary database accepts payment writes to ensure strong consistency and prevent double-charging. The standby payment database replicates synchronously and can be promoted to primary in 30 seconds if needed. This hybrid approach balances availability (critical for dispatch) with consistency (critical for payments).

company: Discord system: Message Storage and Delivery usage_detail: Discord handles billions of messages daily across millions of servers. They use active-active Cassandra clusters for message storage, with replicas in multiple data centers. When a user sends a message, it’s written to three replicas simultaneously (quorum write). If one replica is slow or unavailable, the write succeeds as long as two replicas acknowledge it. Reads also use quorum: the system reads from two replicas and returns the most recent version, ensuring users see messages even if one replica is behind. This provides both high availability (system works with one replica down) and consistency (quorum ensures users see the latest data). For real-time message delivery, they use active-active WebSocket gateways across multiple regions. Users connect to the nearest gateway for low latency. If a gateway fails, clients automatically reconnect to another gateway within 5 seconds. The combination of active-active storage and active-active gateways achieves 99.99% availability for message delivery.


Interview Expectations

Mid-Level

Mid-level candidates should explain the difference between active-passive and active-active patterns with concrete examples. When designing a system, they should identify single points of failure and propose redundancy solutions. For example, “We need at least two web servers behind a load balancer so if one crashes, the other handles traffic.” They should understand that fail-over isn’t instant and estimate reasonable RTOs (“Promoting a standby database takes about 30 seconds”). They don’t need to design sophisticated fail-over mechanisms, but they should know that health checks and automatic switching are required. A common gap is forgetting to make the load balancer itself redundant—they solve one single point of failure but create another.

Senior

Senior candidates should choose availability patterns based on business requirements, not default to the most sophisticated option. They should ask clarifying questions: “What’s our availability SLA? 99.9% or 99.99%? What’s the cost of downtime?” Then they should justify their choice: “For 99.9% availability, active-passive with hot standby is sufficient and costs half as much as active-active. We can tolerate 30-second fail-over because that’s only 6 minutes of downtime per year, well within our 43-minute budget.” They should identify trade-offs explicitly: “Active-active gives us better availability but requires eventual consistency, which means users might briefly see stale data after a network partition. Is that acceptable for this use case?” They should also discuss correlated failures: “We need redundancy across availability zones, not just multiple servers in the same zone, because a zone-wide power outage would take down all our servers simultaneously.”

Staff+

Staff-plus candidates should design hybrid patterns optimized for specific components. For example, “We’ll use active-active for the API layer because it’s stateless and easy to scale horizontally. But we’ll use active-passive for the payment database because financial transactions require strong consistency, and multi-master replication would create conflict resolution complexity that’s not worth the marginal availability gain.” They should quantify trade-offs with calculations: “Active-active across three regions costs $500K/year in additional infrastructure but reduces our expected downtime from 40 minutes to 4 minutes annually. At $10K revenue per minute of downtime, that’s a $360K annual saving, so the investment pays for itself.” They should also discuss operational complexity: “Active-active requires sophisticated monitoring to detect split-brain scenarios where multiple nodes think they’re primary. We need automated fencing to prevent data corruption, which adds 3-6 months to our implementation timeline.” Finally, they should reference real-world examples: “Netflix uses active-active across three AWS regions with automated fail-over, but they also discovered that their DNS provider was a single point of failure. We should use multiple DNS providers with health-check-based fail-over.”

Common Interview Questions

How would you design a highly available system for [specific use case]? (Expect you to choose an availability pattern and justify it based on requirements)

What’s the difference between active-passive and active-active? When would you use each? (Testing whether you understand trade-offs, not just definitions)

If we need 99.99% availability, what does that mean for our architecture? (Expect you to calculate downtime budget and design redundancy accordingly)

How do you handle fail-over in an active-passive setup? (Testing whether you know about health checks, automatic switching, and RTO)

What are the consistency implications of active-active replication? (Testing whether you understand that multi-master creates conflicts)

Red Flags to Avoid

Defaulting to active-active for everything without considering cost or complexity. Shows lack of business judgment.

Not asking about availability requirements before proposing a solution. You can’t choose a pattern without knowing the SLA.

Claiming that redundancy alone provides high availability, without mentioning fail-over or health checks. Redundancy is necessary but not sufficient.

Ignoring consistency trade-offs in active-active patterns. Multi-master replication creates conflicts that must be resolved.

Not identifying single points of failure in your own design. If you propose a load balancer but don’t make it redundant, you’ve just moved the problem.


Key Takeaways

Availability patterns eliminate single points of failure through redundancy, fail-over, and replication. All three mechanisms are required—redundancy alone isn’t enough without automatic fail-over.

Active-passive patterns (one primary, one standby) are simpler and cheaper but have 30-second to 5-minute RTO. Active-active patterns (all nodes serve traffic) have near-zero RTO but require conflict resolution and cost twice as much.

Choose patterns based on business requirements, not technical sophistication. Calculate the cost of downtime versus the cost of redundancy. Often, active-passive is sufficient and saves millions in infrastructure costs.

Test fail-over regularly in production-like environments. Theoretical availability means nothing if fail-over doesn’t work when you need it. Use chaos engineering to verify automatic recovery.

Watch for correlated failures: redundant components that share dependencies (same data center, same network, same software version) can all fail simultaneously. Use geographic redundancy and stagger deployments.

Prerequisites

Availability vs Consistency - Understanding availability definitions and the CAP theorem context

Availability in Numbers - SLA calculations and downtime budgets that drive pattern selection

Next Steps

Fail-Over - Deep dive into detection mechanisms, switching strategies, and RTO optimization

Replication - Synchronous vs asynchronous replication, conflict resolution, and RPO trade-offs

Load Balancing - How load balancers enable active-active patterns and detect failed nodes

CAP Theorem - Theoretical foundation for availability-consistency trade-offs

Consistency Patterns - How consistency requirements influence availability pattern selection