High Availability: Design for 99.99% Uptime

TL;DR

High availability (HA) is the practice of designing systems to remain operational and accessible even when components fail. It’s measured in “nines” (99.9%, 99.99%, etc.) and achieved through redundancy, failover mechanisms, and eliminating single points of failure. The goal isn’t perfection—it’s maximizing uptime within acceptable cost and complexity tradeoffs.

Cheat Sheet: 99.9% = 8.76 hours downtime/year | 99.99% = 52.6 minutes/year | Key patterns: redundancy, health checks, graceful degradation, circuit breakers | Cost increases exponentially with each additional nine.

The Analogy

Think of high availability like a hospital’s emergency power system. When the main grid fails, backup generators kick in automatically so life-support machines never stop. The hospital doesn’t need 100% perfect power—that’s impossible—but it needs enough redundancy that patients never notice an outage. Similarly, HA systems use redundant components and automatic failover so users experience continuous service even when individual servers crash. Just as hospitals balance generator costs against criticality (ICU gets triple redundancy, storage closets don’t), you balance HA investment against business impact.

Why This Matters in Interviews

High availability comes up in almost every system design interview, especially for user-facing services. Interviewers want to see that you understand the difference between availability and reliability, can calculate SLA requirements, and know how to eliminate single points of failure. They’re testing whether you can make pragmatic tradeoffs—recognizing that 99.999% availability costs exponentially more than 99.9% and isn’t always worth it. Strong candidates connect HA to business requirements (“For a payment system, even 10 seconds of downtime loses revenue”) and demonstrate knowledge of real-world patterns like active-active deployments, health checks, and graceful degradation.

Core Concept

High availability is a system design characteristic that ensures services remain operational and accessible to users despite hardware failures, software bugs, network issues, or planned maintenance. Unlike reliability (which focuses on correctness) or fault tolerance (which aims for zero downtime), HA accepts that failures will occur but minimizes their impact through redundancy and rapid recovery.

The industry measures availability as uptime percentage over a time period, typically expressed in “nines.” A system with 99.9% availability (“three nines”) can be down for 8.76 hours per year, while 99.99% (“four nines”) allows only 52.6 minutes of downtime annually. Each additional nine roughly multiplies costs by 10x due to increased redundancy, monitoring complexity, and operational overhead. This exponential cost curve forces teams to align availability targets with actual business needs rather than pursuing perfection.

HA differs fundamentally from disaster recovery, which focuses on recovering from catastrophic failures over hours or days. High availability operates at the seconds-to-minutes scale, using automated failover to maintain service continuity. It’s also distinct from scalability—a highly available system might handle only modest traffic, while a scalable system might have single points of failure. The best production systems combine both properties, but they solve different problems and require different architectural patterns.

High Availability vs. Related Concepts

graph TB
    subgraph Concepts
        HA["High Availability<br/><i>Minimize downtime<br/>Fast recovery (seconds)</i>"]
        FT["Fault Tolerance<br/><i>Zero downtime<br/>Continuous operation</i>"]
        REL["Reliability<br/><i>Correctness<br/>Consistent results</i>"]
        DR["Disaster Recovery<br/><i>Catastrophic failures<br/>Recovery (hours/days)</i>"]
    end
    
    subgraph Characteristics
        HA --> C1["Accepts brief interruptions"]
        HA --> C2["Redundancy + Failover"]
        HA --> C3["99.9% - 99.99% typical"]
        
        FT --> C4["No interruptions"]
        FT --> C5["Hot standbys"]
        FT --> C6["99.999%+ target"]
        
        REL --> C7["Correct outputs"]
        REL --> C8["Data integrity"]
        
        DR --> C9["Regional failures"]
        DR --> C10["Backup restoration"]
    end

High availability focuses on minimizing downtime through fast recovery, distinct from fault tolerance (zero downtime), reliability (correctness), and disaster recovery (long-term restoration). Understanding these differences helps set appropriate availability targets.

How It Works

Step 1: Identify Single Points of Failure (SPOFs) Start by mapping every component in your system—load balancers, application servers, databases, caches, message queues—and identify which ones, if they failed, would take down the entire service. A single database instance is a classic SPOF. A load balancer without a backup is another. Even network switches and power supplies count. The goal is to eliminate or mitigate every SPOF through redundancy.

Step 2: Implement Redundancy at Each Layer For each identified SPOF, add redundant instances across failure domains. Run multiple application servers behind a load balancer. Deploy database replicas across availability zones. Use redundant load balancers with failover IPs. The key principle: failures should be isolated to the smallest possible blast radius. If one availability zone loses power, your other zones continue serving traffic.

Step 3: Add Health Checks and Monitoring Redundancy is useless if you can’t detect failures. Implement active health checks that continuously probe each component (HTTP endpoints, database connections, cache responsiveness). These checks should test actual functionality, not just “is the process running?” A database that accepts connections but can’t execute queries is effectively down. Health check failures trigger automatic remediation—removing bad instances from load balancer pools, promoting replicas, or spinning up replacements.

Step 4: Configure Automatic Failover When health checks detect failures, the system must automatically route traffic away from failed components without human intervention. Load balancers stop sending requests to unhealthy servers. Database clients automatically connect to replica instances. Message queue consumers rebalance across available brokers. The failover time—how long between failure detection and recovery—directly impacts your availability SLA. Sub-second failover requires aggressive health check intervals and fast decision-making.

Step 5: Implement Graceful Degradation Not all failures require complete service outage. Design systems to degrade gracefully when non-critical components fail. If your recommendation engine crashes, show popular items instead of erroring. If the cache is down, query the database directly (slower but functional). This “partial availability” often provides better user experience than hard failures and buys time for recovery.

Step 6: Test Failure Scenarios Continuously High availability only works if your failover mechanisms actually function under pressure. Practice chaos engineering—randomly kill servers, partition networks, fill disks—in production (carefully) or staging environments. Netflix’s Chaos Monkey pioneered this approach. Regular failure testing exposes hidden dependencies, misconfigured health checks, and race conditions that only appear during actual outages.

Six-Step High Availability Implementation Flow

graph TB
    Start(["Start: System Design"]) --> Step1
    
    Step1["Step 1: Identify SPOFs<br/><i>Map all components<br/>Find single points of failure</i>"] --> Step2
    
    Step2["Step 2: Add Redundancy<br/><i>Multiple instances<br/>Across failure domains</i>"] --> Step3
    
    Step3["Step 3: Health Checks<br/><i>Active monitoring<br/>Test actual functionality</i>"] --> Step4
    
    Step4["Step 4: Automatic Failover<br/><i>Route away from failures<br/>No manual intervention</i>"] --> Step5
    
    Step5["Step 5: Graceful Degradation<br/><i>Partial functionality<br/>Fallback strategies</i>"] --> Step6
    
    Step6["Step 6: Test Continuously<br/><i>Chaos engineering<br/>Validate failover works</i>"] --> End
    
    End(["High Availability Achieved"])
    
    Step1 -."Example: Single DB".-> E1["❌ SPOF Found"]
    Step2 -."Solution".-> E2["✓ DB + 2 Replicas"]
    Step3 -."Check".-> E3["✓ Query execution test"]
    Step4 -."Action".-> E4["✓ Auto-promote replica"]
    Step5 -."Fallback".-> E5["✓ Cache → DB direct"]
    Step6 -."Validate".-> E6["✓ Kill instances randomly"]

Implementing high availability follows a systematic six-step process from identifying single points of failure to continuous testing. Each step builds on the previous, with concrete examples showing how to eliminate SPOFs and validate recovery mechanisms.

Key Principles

Principle 1: Eliminate Single Points of Failure Every component that can fail will fail eventually. The fundamental HA principle is ensuring no single component failure takes down the entire system. This means redundancy at every layer—multiple servers, multiple databases, multiple availability zones, even multiple cloud providers for critical systems. Netflix runs across three AWS regions simultaneously, so even a complete regional outage doesn’t affect streaming. The tradeoff is complexity: more components mean more monitoring, more configuration, and more things that can go wrong in subtle ways.

Principle 2: Fail Fast and Recover Quickly When failures occur, detect them immediately and recover automatically. Slow failure detection (“is the server really down or just slow?”) extends outages unnecessarily. Set aggressive timeouts—if a database query takes more than 5 seconds, treat it as failed and try a replica. Use circuit breakers to stop sending traffic to failing components before they cascade failures to healthy ones. Stripe’s payment API fails requests to unhealthy database shards within 100ms and routes to healthy shards, maintaining 99.99% availability despite frequent individual shard issues.

Principle 3: Design for Failure Domains Failures rarely affect just one server—they often take down entire racks, availability zones, or regions. Design systems assuming correlated failures. Don’t put all your database replicas in the same datacenter. Don’t run all instances on the same hypervisor. AWS availability zones are physically separate datacenters with independent power and networking, so zone failures are uncorrelated. Uber distributes services across multiple zones and uses zone-aware load balancing to keep traffic local when possible but fail over across zones when necessary.

Principle 4: Avoid Synchronous Dependencies Every synchronous call to another service is a potential availability bottleneck. If Service A calls Service B synchronously, A’s availability can never exceed B’s availability (A_availability ≤ B_availability). For a system with three synchronous dependencies each at 99.9% availability, the overall availability is 0.999³ = 99.7%. Use asynchronous patterns (message queues, event streams) where possible, and implement timeouts and circuit breakers for necessary synchronous calls. Amazon’s checkout flow uses asynchronous order processing—placing an order succeeds immediately even if downstream fulfillment systems are temporarily unavailable.

Principle 5: Measure and Improve Continuously You can’t improve what you don’t measure. Track actual availability (uptime percentage), mean time between failures (MTBF), and mean time to recovery (MTTR). Set SLAs with error budgets—if you promise 99.9% availability, you have 43 minutes of downtime per month to “spend” on deployments, incidents, and experiments. Google’s SRE teams use error budgets to balance feature velocity against reliability: when the error budget is exhausted, all engineering effort shifts to reliability improvements until availability recovers.

Deep Dive

Types / Variants

Active-Active (Multi-Master) All redundant instances actively serve traffic simultaneously, with load distributed across them. If one instance fails, the others absorb its traffic with minimal impact. This provides the highest availability and best resource utilization since no capacity sits idle. However, it requires careful coordination to avoid conflicts—multiple instances writing to the same data need conflict resolution strategies. Cassandra uses active-active replication across datacenters, allowing writes to any node with eventual consistency. The tradeoff is complexity: you need distributed consensus, conflict resolution, and careful capacity planning to handle sudden traffic shifts during failures.

Active-Passive (Master-Standby) One instance (the active/master) handles all traffic while redundant instances (passive/standby) remain idle, ready to take over if the active fails. This is simpler to reason about—only one instance is authoritative—and works well for stateful systems like databases where coordination is expensive. Traditional MySQL replication uses active-passive: writes go to the master, reads can use replicas, and if the master fails, a replica is promoted. The downside is wasted capacity (standby instances sit idle) and failover time (promoting a standby takes seconds to minutes). It’s best for systems where consistency matters more than maximum availability.

N+1 Redundancy Run N instances to handle your load, plus 1 extra for redundancy. If any single instance fails, the remaining N instances can still handle full capacity. This is the minimum viable HA configuration and works well for stateless services. If you need 10 application servers for peak traffic, run 11. The limitation is that it only tolerates single failures—if two instances fail simultaneously, you’re under capacity. For higher availability, use N+2 or N+M redundancy. AWS Auto Scaling Groups commonly use N+1: if you need 5 instances, configure desired capacity to 6 so any single instance failure doesn’t impact performance.

Geographic Redundancy (Multi-Region) Deploy complete system replicas across geographically distributed regions (different cities or continents). This protects against datacenter fires, natural disasters, or regional network outages. Traffic is routed to the nearest healthy region using DNS-based or anycast routing. Netflix runs identical infrastructure in three AWS regions (us-east-1, us-west-2, eu-west-1), with each region capable of handling global traffic. The challenge is data consistency across regions—synchronous replication adds latency, while asynchronous replication risks data loss during regional failures. Use this for business-critical systems where even datacenter-level failures are unacceptable.

Availability Zones (Multi-AZ) A lighter-weight version of geographic redundancy, availability zones are isolated datacenters within the same metropolitan area. They provide independent power, cooling, and networking but low-latency connectivity (typically <2ms). This balances failure isolation with operational simplicity—you can run synchronous replication across zones without excessive latency. AWS RDS Multi-AZ deployments synchronously replicate to a standby in a different zone, providing automatic failover in 60-120 seconds. Use this as the default HA pattern for most production systems; it’s the sweet spot between cost and reliability.

High Availability Architecture Patterns Comparison

graph TB
    subgraph Active-Active
        AA_LB["Load Balancer"]
        AA_S1["Server 1<br/><i>Active</i>"]
        AA_S2["Server 2<br/><i>Active</i>"]
        AA_S3["Server 3<br/><i>Active</i>"]
        AA_DB[("Distributed DB<br/><i>Multi-master</i>")]
        
        AA_LB --"33% traffic"--> AA_S1
        AA_LB --"33% traffic"--> AA_S2
        AA_LB --"34% traffic"--> AA_S3
        AA_S1 & AA_S2 & AA_S3 --"Read/Write"--> AA_DB
    end
    
    subgraph Active-Passive
        AP_LB["Load Balancer"]
        AP_S1["Server 1<br/><i>Active</i>"]
        AP_S2["Server 2<br/><i>Standby</i>"]
        AP_DB_M[("Primary DB")]
        AP_DB_S[("Standby DB<br/><i>Idle</i>")]
        
        AP_LB --"100% traffic"--> AP_S1
        AP_LB -."Failover only".-> AP_S2
        AP_S1 --"Read/Write"--> AP_DB_M
        AP_DB_M --"Sync replication"--> AP_DB_S
    end
    
    subgraph Multi-Region
        MR_DNS["Global DNS<br/><i>GeoDNS routing</i>"]
        
        subgraph Region US-East
            MR_US["Complete Stack<br/><i>App + DB</i>"]
        end
        
        subgraph Region EU-West
            MR_EU["Complete Stack<br/><i>App + DB</i>"]
        end
        
        subgraph Region AP-South
            MR_AP["Complete Stack<br/><i>App + DB</i>"]
        end
        
        MR_DNS --"Route to nearest"--> MR_US
        MR_DNS --"Route to nearest"--> MR_EU
        MR_DNS --"Route to nearest"--> MR_AP
        MR_US <-."Async replication".-> MR_EU
        MR_EU <-."Async replication".-> MR_AP
    end
    
    AA_Note["✓ Best resource utilization<br/>✓ Highest availability<br/>❌ Complex coordination"] -.-> Active-Active
    AP_Note["✓ Simple to reason about<br/>✓ Strong consistency<br/>❌ Wasted capacity"] -.-> Active-Passive
    MR_Note["✓ Survives region failures<br/>✓ Low latency globally<br/>❌ Data consistency challenges"] -.-> Multi-Region

Three primary HA patterns serve different needs: Active-Active maximizes resource utilization with all instances serving traffic; Active-Passive provides simpler consistency with idle standbys; Multi-Region protects against datacenter failures but introduces data consistency challenges.

Trade-offs

Availability vs. Consistency The CAP theorem forces a fundamental tradeoff: during network partitions, you must choose between availability (accepting writes that might conflict) or consistency (rejecting writes to prevent conflicts). High availability systems often choose eventual consistency—accepting writes to any replica and reconciling conflicts later. DynamoDB uses this approach, allowing writes even when replicas can’t communicate, then using last-write-wins or custom conflict resolution. The alternative is strong consistency, where systems reject writes during partitions to prevent conflicts. This reduces availability but ensures all clients see the same data. Choose based on business requirements: shopping carts can tolerate eventual consistency (worst case: a duplicate item), but financial transactions need strong consistency.

Availability vs. Latency Higher availability often requires more network hops, health checks, and coordination, all of which add latency. Synchronous replication to three availability zones adds 2-5ms to every write. Cross-region replication adds 50-200ms. Aggressive health checks (every 100ms) consume network bandwidth and CPU. You must balance availability targets against latency requirements. For user-facing APIs, 99.9% availability with 50ms p99 latency might be better than 99.99% availability with 200ms p99 latency. Measure the business impact of both outages and slowness, then optimize for the metric that matters most.

Availability vs. Cost Each additional nine of availability roughly multiplies infrastructure costs by 10x. Going from 99.9% to 99.99% requires redundant components across multiple availability zones, more sophisticated monitoring, and larger on-call teams. Reaching 99.999% demands multi-region deployments, chaos engineering programs, and dedicated reliability engineers. For a startup’s internal admin tool, 99% availability (3.65 days downtime per year) might be perfectly acceptable. For Stripe’s payment API, 99.99% is table stakes. Calculate the cost of downtime (lost revenue, customer churn, SLA penalties) and invest in availability until the marginal cost exceeds the marginal benefit.

Automation vs. Control Automated failover provides fast recovery but can cause cascading failures if misconfigured. A bug in your health check logic might mark all instances as unhealthy, causing a complete outage. Aggressive auto-scaling during a traffic spike might overwhelm your database. Manual failover gives operators control but adds minutes to recovery time and requires 24/7 on-call staffing. The industry trend is toward automation with safety rails: automated failover for common scenarios (single instance failures) but manual approval for rare events (entire availability zone failures). Google’s SRE teams use automated playbooks for 90% of incidents, escalating only unusual situations to humans.

Simplicity vs. Resilience More redundancy means more components, more configuration, and more failure modes. A simple architecture with a single database is easy to understand and operate but has poor availability. A complex architecture with multi-region replication, automatic failover, and circuit breakers is highly available but harder to debug when things go wrong. The sweet spot depends on team maturity: a 5-person startup should prefer simple architectures with manual failover, while a 500-person company can invest in sophisticated automation. Start simple and add complexity only when downtime costs justify the operational overhead.

Common Pitfalls

Pitfall 1: Ignoring Correlated Failures Many teams achieve redundancy by running multiple instances but fail to consider correlated failures. If all your servers run the same buggy code, deploying to all instances simultaneously takes down the entire service. If all instances share the same database, a database failure affects everything. If all instances run in the same availability zone, a zone outage is a complete outage. This happens because teams test individual component failures (“what if one server crashes?”) but not systemic failures (“what if the deployment is bad?” or “what if the zone loses power?”). Avoid this by deploying in waves (canary 5%, then 50%, then 100%), using multiple availability zones, and practicing failure scenarios that affect multiple components simultaneously.

Pitfall 2: Insufficient Health Check Coverage A common mistake is implementing superficial health checks that don’t test actual functionality. A health check that returns 200 OK if the process is running doesn’t detect database connection pool exhaustion, memory leaks, or disk full conditions. The service appears healthy but can’t process requests. This causes “gray failures” where instances stay in the load balancer pool but fail most requests. Implement deep health checks that exercise critical paths: query the database, read from the cache, call downstream dependencies. But balance depth against overhead—health checks that are too expensive can cause cascading failures when traffic spikes. Netflix’s health checks test actual video streaming capabilities, not just “is the server responding?”.

Pitfall 3: Underprovisioning for Failover Teams often run just enough capacity for normal traffic, forgetting that during failures, remaining instances must handle the full load. If you run 10 instances at 80% CPU utilization and one fails, the remaining 9 instances now run at 89% CPU—still okay. But if two fail, the remaining 8 run at 100% CPU and start timing out, causing a cascading failure as health checks fail and more instances are removed. This is called a “thundering herd” failure. Provision for N-1 or N-2 capacity: if you need 10 instances, run 12 so that losing 2 still leaves you with adequate capacity. AWS Auto Scaling helps but reacts slowly (minutes to spin up new instances), so you need buffer capacity for immediate failures.

Pitfall 4: Forgetting About Stateful Components Stateless application servers are easy to make highly available—just run more of them. But stateful components like databases, caches, and message queues require careful coordination. A common mistake is treating database replicas as simple backups rather than active participants in HA. If your application always connects to the master and only uses replicas after manual failover, you haven’t achieved high availability—you’ve just made disaster recovery slightly faster. Implement automatic failover with tools like AWS RDS Multi-AZ, Patroni for PostgreSQL, or MySQL Group Replication. Test failover regularly; many teams discover their “automatic” failover doesn’t work during actual outages.

Pitfall 5: No Graceful Degradation Strategy When dependencies fail, many systems choose between full functionality or complete failure. A better approach is graceful degradation: continue serving core functionality even when non-critical components are down. If your recommendation engine fails, show popular items instead of returning 500 errors. If your cache is down, query the database directly (slower but functional). If payment processing is delayed, accept orders and process them asynchronously. Twitter’s “fail whale” was replaced by a degraded mode that shows cached timelines when the write path is overloaded. Design each feature with a fallback: what’s the minimum viable experience if this dependency is unavailable?

Math & Calculations

Availability Calculation

Availability is measured as the percentage of time a system is operational:

Availability = (Total Time - Downtime) / Total Time × 100%

For example, if a system is down for 8.76 hours in a year:

Availability = (8760 hours - 8.76 hours) / 8760 hours × 100% = 99.9%

Nines Table (Annual Downtime)

90% (“one nine”): 36.5 days
99% (“two nines”): 3.65 days
99.9% (“three nines”): 8.76 hours
99.99% (“four nines”): 52.6 minutes
99.999% (“five nines”): 5.26 minutes
99.9999% (“six nines”): 31.5 seconds

Serial System Availability

When components are in series (one depends on the other), multiply their availabilities:

System_Availability = Component1_Availability × Component2_Availability × ... × ComponentN_Availability

Example: A web application with three synchronous dependencies:

Load balancer: 99.99% (four nines)
Application servers: 99.9% (three nines)
Database: 99.95%

System_Availability = 0.9999 × 0.999 × 0.9995 = 0.9984 = 99.84%

This is only 52.6 minutes of downtime per year—worse than any individual component! This demonstrates why minimizing synchronous dependencies is critical.

Parallel System Availability

When components are in parallel (redundant), calculate the probability that ALL fail:

System_Availability = 1 - (1 - Component1_Availability) × (1 - Component2_Availability) × ...

Example: Two redundant application servers, each with 99.9% availability:

System_Availability = 1 - (1 - 0.999) × (1 - 0.999)
                    = 1 - (0.001 × 0.001)
                    = 1 - 0.000001
                    = 0.999999 = 99.9999% (four nines)

Adding just one redundant component improved availability by two nines! This is why redundancy is so powerful.

MTBF and MTTR

Availability can also be expressed using Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR):

Availability = MTBF / (MTBF + MTTR)

Example: A system fails once per month (MTBF = 720 hours) and takes 1 hour to repair (MTTR = 1 hour):

Availability = 720 / (720 + 1) = 0.9986 = 99.86%

This formula shows two paths to higher availability:

Increase MTBF (make failures less frequent) through better code quality, testing, and monitoring
Decrease MTTR (recover faster) through automation, redundancy, and runbooks

In practice, decreasing MTTR through automated failover is often easier than preventing all failures.

Error Budget Calculation

If you promise 99.9% availability, you have an error budget—allowable downtime:

Error_Budget = (1 - Availability_Target) × Time_Period

For 99.9% availability over 30 days:

Error_Budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes per month

You can “spend” this budget on deployments, experiments, or incidents. If you use it up, freeze feature development and focus on reliability.

Serial vs. Parallel Availability Calculation

graph TB
    subgraph Serial System - Dependencies Multiply
        S_Start(["User Request"])
        S_LB["Load Balancer<br/><i>99.99% available</i>"]
        S_App["App Server<br/><i>99.9% available</i>"]
        S_DB[("Database<br/><i>99.95% available</i>")]
        S_End(["Response"])
        
        S_Start --> S_LB
        S_LB --"Must succeed"--> S_App
        S_App --"Must succeed"--> S_DB
        S_DB --> S_End
        
        S_Calc["System Availability =<br/>0.9999 × 0.999 × 0.9995<br/>= 0.9984 = 99.84%<br/><br/>❌ Worse than any component!<br/>52.6 min + 8.76 hrs + 4.38 hrs<br/>= 14.2 hours downtime/year"]
    end
    
    subgraph Parallel System - Redundancy Improves
        P_Start(["User Request"])
        P_LB["Load Balancer"]
        P_App1["App Server 1<br/><i>99.9% available</i>"]
        P_App2["App Server 2<br/><i>99.9% available</i>"]
        P_End(["Response"])
        
        P_Start --> P_LB
        P_LB --"Route to either"--> P_App1
        P_LB --"Route to either"--> P_App2
        P_App1 --> P_End
        P_App2 --> P_End
        
        P_Calc["System Availability =<br/>1 - (1-0.999) × (1-0.999)<br/>= 1 - (0.001 × 0.001)<br/>= 1 - 0.000001<br/>= 0.999999 = 99.9999%<br/><br/>✓ Four nines from two three-nine components!<br/>Only 5.26 minutes downtime/year"]
    end
    
    S_Calc -.-> Serial System - Dependencies Multiply
    P_Calc -.-> Parallel System - Redundancy Improves
    
    Key["Key Insight:<br/>Serial: Multiply availabilities (gets worse)<br/>Parallel: 1 - (multiply failures) (gets better)<br/><br/>Minimize serial dependencies!<br/>Maximize parallel redundancy!"]

Serial dependencies multiply availabilities (making systems less available), while parallel redundancy exponentially improves availability. A system with three 99.9% components in series achieves only 99.7%, but two 99.9% components in parallel achieve 99.9999%—demonstrating why redundancy is so powerful.

Real-World Examples

Netflix: Multi-Region Active-Active Architecture

Netflix streams to 200+ million subscribers with 99.99% availability despite operating entirely on AWS (no on-premise datacenters). They achieve this through aggressive redundancy and chaos engineering. Every service runs in at least three AWS availability zones within a region, with automatic failover if any zone becomes unhealthy. More critically, they run complete replicas across three AWS regions (US East, US West, Europe), with each region capable of handling 100% of global traffic.

Their most interesting HA innovation is Chaos Monkey and the Simian Army—tools that randomly kill production instances, entire services, and even simulate region failures. By continuously testing failure scenarios, they ensure their failover mechanisms actually work under pressure. When AWS US-East-1 had a major outage in 2021, Netflix users experienced no disruption because traffic automatically shifted to other regions. The lesson: high availability requires not just redundancy but continuous validation that redundancy works.

Stripe: Zone-Aware Database Sharding

Stripe processes billions of dollars in payments with 99.99% availability, where even seconds of downtime means lost revenue for merchants. Their database architecture uses zone-aware sharding: each shard (a subset of customer data) has a primary in one availability zone and synchronous replicas in two other zones. When a zone fails, replicas in other zones are automatically promoted within seconds.

The clever part is their “zone-aware routing”: application servers preferentially connect to database shards in the same availability zone to minimize latency, but automatically fail over to other zones when needed. This gives them single-digit millisecond latencies during normal operation and sub-second failover during zone failures. They also implement “read-your-writes” consistency: after writing to a primary, subsequent reads from replicas are guaranteed to see that write, even during failover. This demonstrates that high availability doesn’t require sacrificing consistency—careful engineering can provide both.

Amazon: Shuffle Sharding for Blast Radius Reduction

Amazon’s Route 53 DNS service achieves 100% availability SLA (they’ve never had a complete outage) through a technique called shuffle sharding. Traditional sharding assigns each customer to specific servers, so a server failure affects all customers on that server. Shuffle sharding assigns each customer to a random subset of servers from a large pool. If you have 100 servers and assign each customer to 4 random servers, a single server failure affects only 4% of customers, and those customers still have 3 healthy servers.

The math is powerful: with 100 servers and 4-server shards, there are 3.9 million possible combinations. The probability that two customers share the exact same 4 servers is negligible. This means failures are isolated to tiny fractions of customers, and cascading failures are nearly impossible. Route 53 uses this pattern with thousands of servers globally, making complete outages mathematically improbable. The lesson: sometimes the best HA strategy isn’t just redundancy but clever assignment algorithms that minimize blast radius.

Netflix Multi-Region Active-Active Architecture

graph TB
    User["User<br/><i>Global audience</i>"]
    
    subgraph Global Layer
        DNS["Route 53 DNS<br/><i>GeoDNS routing</i>"]
        CDN["CloudFront CDN<br/><i>Edge caching</i>"]
    end
    
    subgraph Region: US-East-1
        subgraph AZ-1a
            USE_LB1["ELB"]
            USE_App1["Streaming Service"]
            USE_Chaos1["Chaos Monkey<br/><i>Random termination</i>"]
        end
        subgraph AZ-1b
            USE_App2["Streaming Service"]
        end
        subgraph AZ-1c
            USE_App3["Streaming Service"]
        end
        USE_Cassandra[("Cassandra<br/><i>Multi-master</i>")]
    end
    
    subgraph Region: US-West-2
        subgraph AZ-2a
            USW_LB["ELB"]
            USW_App1["Streaming Service"]
        end
        subgraph AZ-2b
            USW_App2["Streaming Service"]
        end
        USW_Cassandra[("Cassandra<br/><i>Multi-master</i>")]
    end
    
    subgraph Region: EU-West-1
        subgraph AZ-3a
            EU_LB["ELB"]
            EU_App1["Streaming Service"]
        end
        subgraph AZ-3b
            EU_App2["Streaming Service"]
        end
        EU_Cassandra[("Cassandra<br/><i>Multi-master</i>")]
    end
    
    User --> DNS
    DNS --"Route to nearest healthy region"--> CDN
    CDN --> USE_LB1
    CDN --> USW_LB
    CDN --> EU_LB
    
    USE_LB1 --> USE_App1 & USE_App2 & USE_App3
    USW_LB --> USW_App1 & USW_App2
    EU_LB --> EU_App1 & EU_App2
    
    USE_App1 & USE_App2 & USE_App3 --> USE_Cassandra
    USW_App1 & USW_App2 --> USW_Cassandra
    EU_App1 & EU_App2 --> EU_Cassandra
    
    USE_Cassandra <-."Async replication".-> USW_Cassandra
    USW_Cassandra <-."Async replication".-> EU_Cassandra
    EU_Cassandra <-."Async replication".-> USE_Cassandra

Netflix’s multi-region active-active architecture runs complete service stacks in three AWS regions. Cassandra provides multi-master replication across regions with asynchronous replication, allowing any region to handle reads and writes. Chaos Monkey continuously terminates instances to validate that failover works under real conditions.

Interview Expectations

Mid-Level

What You Should Know:

At the mid-level, you should understand that high availability means minimizing downtime through redundancy and automatic failover. You should be able to explain the “nines” (99.9%, 99.99%, etc.) and calculate annual downtime for each level. You should know the basic patterns: running multiple instances behind a load balancer, using database replicas, and implementing health checks. You should understand that availability and reliability are different—availability is about uptime, reliability is about correctness.

You should be able to identify obvious single points of failure in a system design and propose solutions. For example, if asked to design a URL shortener, you should recognize that a single database instance is a SPOF and suggest adding read replicas or using a managed database service with automatic failover. You should know that stateless services are easier to make highly available than stateful ones.

Bonus Points:

Mentioning specific AWS services like RDS Multi-AZ, ELB health checks, or Auto Scaling Groups
Understanding the difference between active-active and active-passive redundancy
Recognizing that health checks need to test actual functionality, not just “is the process running?”
Knowing that availability decreases multiplicatively when services depend on each other (serial system availability)
Discussing graceful degradation as an alternative to complete failure

Senior

What You Should Know:

Senior engineers should deeply understand the tradeoffs between availability, consistency, latency, and cost. You should be able to calculate system availability from component availabilities (both serial and parallel) and explain why minimizing synchronous dependencies is critical. You should know multiple HA patterns (active-active, active-passive, N+1, multi-region) and when to use each based on consistency requirements, budget, and latency constraints.

You should understand failure domains and design for uncorrelated failures. This means distributing components across availability zones, using different hardware vendors, and avoiding shared dependencies. You should know how to implement proper health checks that test critical paths without causing overhead. You should be familiar with circuit breakers, bulkheads, and other resilience patterns from the Netflix stack.

You should be able to design database HA strategies including replication topologies (master-slave, multi-master), quorum-based systems, and the CAP theorem tradeoffs. You should understand that achieving 99.99% availability requires not just redundancy but also fast failure detection, automated remediation, and continuous testing.

Bonus Points:

Discussing chaos engineering and how to test failure scenarios in production
Explaining error budgets and how they balance feature velocity against reliability
Knowing specific failure modes (split-brain, thundering herd, cascading failures) and how to prevent them
Understanding the difference between MTBF and MTTR and which is easier to optimize
Citing real-world examples from companies like Netflix, Stripe, or Amazon
Discussing the operational complexity of multi-region deployments (data consistency, latency, cost)

Staff+

What You Should Know:

Staff+ engineers should be able to design HA systems that balance business requirements, cost constraints, and operational complexity. You should be able to justify availability targets based on business impact—calculating the cost of downtime and comparing it to the cost of additional nines. You should understand that availability is a spectrum, not a binary, and different parts of a system can have different availability targets (the payment path needs 99.99%, the admin dashboard can be 99.9%).

You should be able to design sophisticated HA patterns like shuffle sharding, zone-aware routing, and multi-region active-active with conflict resolution. You should understand the operational aspects: how to build runbooks, train on-call teams, and create feedback loops that improve reliability over time. You should know how to measure availability accurately (not just uptime, but user-perceived availability) and how to set SLAs with appropriate error budgets.

You should be able to discuss the organizational aspects of HA: how to build a culture of reliability, when to create dedicated SRE teams, and how to balance feature development against reliability work. You should understand that the hardest HA problems are often organizational (teams deploying without testing, poor communication during incidents) rather than technical.

Distinguishing Signals:

Proposing availability targets based on business impact analysis, not just “we need five nines”
Designing systems that degrade gracefully under partial failures rather than failing completely
Understanding the operational complexity of HA patterns and choosing simpler solutions when appropriate
Discussing the feedback loop between incidents and reliability improvements (blameless postmortems, error budgets)
Recognizing that availability is a team sport requiring coordination between product, engineering, and operations
Citing specific failure modes from real incidents (AWS outages, GitHub incidents) and how companies responded
Understanding the economics of HA: when to invest in reliability vs. when to accept downtime
Proposing organizational structures (SRE teams, on-call rotations, incident response processes) that support HA goals

Common Interview Questions

Question 1: “How would you design a system to achieve 99.99% availability?”

60-second answer: Start by eliminating single points of failure through redundancy at every layer. Run multiple application servers across at least two availability zones with a load balancer distributing traffic. Use a managed database service with automatic failover like RDS Multi-AZ. Implement health checks that test actual functionality and configure automatic failover when instances become unhealthy. Deploy in waves (canary deployments) to catch bad releases before they affect all instances. Monitor availability continuously and set up alerts for degradation.

2-minute answer: 99.99% availability means only 52.6 minutes of downtime per year, so you need both redundancy and fast recovery. First, identify all single points of failure—load balancers, application servers, databases, caches, external dependencies. For each component, add redundancy across failure domains (availability zones). Use active-active load balancing for stateless services and active-passive with automatic failover for stateful components like databases.

Implement deep health checks that test critical paths: can the server query the database? Can it read from the cache? Can it call downstream services? Configure aggressive timeouts (fail fast) and circuit breakers to prevent cascading failures. Deploy changes gradually—5% canary, then 50%, then 100%—with automatic rollback if error rates spike. Provision for N-1 capacity so a single instance failure doesn’t overload remaining instances. Finally, practice failure scenarios regularly through chaos engineering to validate that your failover mechanisms work under pressure. The key insight is that 99.99% requires not just redundancy but also fast failure detection (sub-second health checks) and automated recovery (no manual intervention).

Red flags: Saying “just add more servers” without discussing failure domains, health checks, or failover mechanisms. Claiming you can achieve 99.99% with a single availability zone. Not considering the cost implications (99.99% costs significantly more than 99.9%). Forgetting about stateful components like databases. Not mentioning deployment strategies or how to avoid taking down all instances simultaneously.

Question 2: “What’s the difference between high availability and fault tolerance?”

60-second answer: High availability accepts that failures will occur but minimizes their impact through redundancy and fast recovery. A highly available system might have brief interruptions (seconds) during failover but remains operational. Fault tolerance aims for zero downtime—the system continues operating without interruption even during failures. HA is typically achieved through redundancy and automatic failover, while fault tolerance requires more expensive techniques like synchronous replication and hot standbys. Most systems target HA (99.9-99.99%) because fault tolerance (99.999%+) is exponentially more expensive.

2-minute answer: High availability and fault tolerance represent different points on the reliability spectrum. HA systems accept brief interruptions during component failures—when a server crashes, there might be 1-2 seconds of failed requests while health checks detect the failure and the load balancer removes it from the pool. This is acceptable for most applications and achievable at reasonable cost through redundancy and automatic failover.

Fault tolerance aims for zero user-visible interruptions. This requires techniques like synchronous replication (every write goes to multiple locations before acknowledging), hot standbys (redundant components actively processing requests in parallel), and hardware redundancy (redundant power supplies, network cards, even CPUs). Aircraft control systems and financial trading platforms use fault tolerance because even milliseconds of downtime are unacceptable.

The cost difference is dramatic. HA might require 2x infrastructure (active-passive) or 1.5x (N+1 redundancy). Fault tolerance often requires 3-5x infrastructure plus specialized hardware and software. For most business applications, HA is the right choice—the cost of 99.99% availability (52 minutes downtime per year) is justified, but 99.999% (5 minutes per year) isn’t worth 10x the cost. Choose fault tolerance only when downtime costs exceed the infrastructure investment, like payment processing or life-critical systems.

Red flags: Using the terms interchangeably. Claiming you need fault tolerance without justifying the cost. Not understanding that fault tolerance requires synchronous replication and hot standbys. Thinking HA means “no failures” rather than “fast recovery from failures.”

Question 3: “How do you calculate the availability of a system with multiple components?”

60-second answer: For components in series (one depends on the other), multiply their availabilities: System = A × B × C. For example, if your load balancer is 99.99% available and your application servers are 99.9% available, the system is 0.9999 × 0.999 = 99.89% available. For components in parallel (redundant), calculate the probability that ALL fail: System = 1 - (1-A) × (1-B). Two servers at 99.9% availability give you 1 - (0.001 × 0.001) = 99.9999% availability. This shows why redundancy is so powerful—it improves availability exponentially.

2-minute answer: System availability depends on how components are connected. For serial dependencies (synchronous calls), availabilities multiply, which means the system can never be more available than its least available component. If your web app calls three services synchronously, each at 99.9% availability, your system is 0.999³ = 99.7%—worse than any individual component. This is why minimizing synchronous dependencies is critical for HA.

For parallel components (redundancy), you calculate the probability that all redundant instances fail simultaneously. With two redundant servers at 99.9% availability each, the probability both fail is 0.001 × 0.001 = 0.000001, so system availability is 1 - 0.000001 = 99.9999%. Adding just one redundant component improved availability by three nines! This assumes failures are independent—if both servers run the same buggy code or share the same power supply, they can fail together.

In practice, systems have both serial and parallel components. A typical web app has: load balancer (99.99%) → application servers in parallel (99.9% each, two instances) → database (99.95%). Calculate the parallel component first: app servers = 1 - (0.001)² = 99.9999%. Then multiply serially: 0.9999 × 0.999999 × 0.9995 = 99.94%. The database becomes the bottleneck, which is why database HA (replication, automatic failover) is so important.

Red flags: Not understanding the difference between serial and parallel availability. Claiming you can achieve 99.99% system availability with components that are only 99% available. Not recognizing that synchronous dependencies multiply (worsen) availability. Forgetting that redundancy only helps if failures are independent.

Question 4: “Your database is a single point of failure. How would you make it highly available?”

60-second answer: Implement database replication with automatic failover. Use a managed service like AWS RDS Multi-AZ, which synchronously replicates to a standby in a different availability zone and automatically promotes the standby if the primary fails (typically 60-120 seconds). For higher availability, use read replicas across multiple zones for read traffic and implement application-level failover logic that switches to a replica if the primary becomes unavailable. For the highest availability, consider multi-region replication, though this adds complexity around data consistency and conflict resolution.

2-minute answer: Database HA requires balancing consistency, availability, and operational complexity. The simplest approach is using a managed database service with built-in HA. AWS RDS Multi-AZ synchronously replicates every write to a standby instance in a different availability zone. If the primary fails (hardware failure, zone outage, or even planned maintenance), RDS automatically promotes the standby and updates DNS within 60-120 seconds. This provides 99.95% availability with zero operational overhead—you don’t manage the replication or failover.

For higher availability or more control, implement your own replication. PostgreSQL with Patroni or MySQL with Group Replication can achieve sub-second failover. Run a primary and at least two replicas across three availability zones. Use synchronous replication to at least one replica (for consistency) and asynchronous to others (for performance). Implement health checks that test actual query execution, not just connection success. When the primary fails, use a consensus system (etcd, Consul) to elect a new primary and reconfigure replicas.

For read-heavy workloads, use read replicas to distribute load and provide redundancy. Application servers can read from any replica and only write to the primary. If the primary fails, promote a replica. The challenge is ensuring “read-your-writes” consistency—after writing to the primary, subsequent reads from replicas must see that write. This requires either synchronous replication (adds latency) or application-level logic that routes reads to the primary for a short time after writes.

Red flags: Suggesting “just add a backup database” without explaining replication or failover. Not understanding the difference between synchronous and asynchronous replication. Claiming you can achieve HA with a single database instance. Not considering the consistency implications of replication. Forgetting about connection string updates during failover (DNS, connection pooling).

Question 5: “How would you test that your high availability setup actually works?”

60-second answer: Practice chaos engineering—deliberately cause failures in production (carefully) or staging to validate that failover mechanisms work. Start by killing individual instances and verifying that traffic automatically shifts to healthy instances without user impact. Then test more severe failures: kill entire availability zones, fill disks, exhaust connection pools, introduce network latency. Automate these tests to run continuously (Netflix’s Chaos Monkey). Monitor error rates, latency, and availability during tests. The key is testing in production because staging environments often don’t replicate production complexity.

2-minute answer: High availability only works if your failover mechanisms function under pressure, and the only way to know is through continuous testing. Start with game days—scheduled exercises where you deliberately cause failures and validate recovery. Kill random application servers and verify that the load balancer removes them within seconds and traffic shifts to healthy instances. Promote a database replica and ensure applications automatically reconnect. Simulate an availability zone failure by blocking network traffic and verify that services fail over to other zones.

Graduate to automated chaos engineering. Netflix’s Chaos Monkey randomly terminates production instances during business hours, forcing teams to build resilient systems. Chaos Kong simulates entire region failures. Start conservatively—test in staging first, then production during low-traffic periods, then eventually during peak traffic. The goal is making failure testing routine rather than exceptional.

Monitor key metrics during tests: error rate (should spike briefly then recover), latency (might increase during failover), and availability (should remain above SLA). Set up alerts that fire if recovery takes too long. Document what you learn—if failover took 30 seconds instead of the expected 5 seconds, investigate why. Common discoveries: health checks were too slow, connection pools didn’t refresh, DNS caching prevented failover.

The most important insight: staging environments rarely replicate production complexity. You might have 10 instances in production but only 2 in staging. Production has complex networking, security groups, and dependencies that staging lacks. Testing in production (carefully, with safeguards) is the only way to validate that HA actually works.

Red flags: Saying “we’ll test in staging” without acknowledging staging limitations. Not having a plan for continuous testing (one-time tests aren’t enough). Not monitoring metrics during failure tests. Claiming you can’t test in production (many companies do it safely). Not learning from tests—if failover is slower than expected, that’s a bug to fix.

Red Flags to Avoid

Red Flag 1: “We’ll achieve 99.999% availability by adding more servers”

Why it’s wrong: Five nines (99.999%) means only 5.26 minutes of downtime per year—an extremely high bar that requires more than just redundancy. It demands multi-region deployments, sophisticated monitoring, automated remediation, chaos engineering, and dedicated reliability teams. The cost is exponentially higher than 99.9% or 99.99%, and most businesses can’t justify the investment. Simply adding servers doesn’t address correlated failures (bad deployments, zone outages, software bugs) that cause most outages.

What to say instead: “We should start by defining our availability target based on business impact. For most user-facing services, 99.9% or 99.99% is appropriate and achievable through redundancy across availability zones, automatic failover, and good deployment practices. If we truly need 99.999%, we’d need multi-region active-active, which adds significant complexity and cost. Let’s calculate the cost of downtime and compare it to the cost of additional nines to make an informed decision.”

Red Flag 2: “High availability means the system never goes down”

Why it’s wrong: This confuses high availability with fault tolerance or perfection. HA accepts that failures will occur but minimizes their impact through redundancy and fast recovery. Even systems with 99.99% availability have 52 minutes of downtime per year. The goal isn’t zero downtime (which is impossible and infinitely expensive) but rather acceptable downtime based on business requirements. This misconception leads to over-engineering and wasted resources.

What to say instead: “High availability means the system remains operational despite component failures, not that it never experiences any downtime. We should define an availability target based on business impact—how much does each hour of downtime cost in lost revenue, customer churn, or SLA penalties? Then we design for that target using redundancy, automatic failover, and fast recovery. For example, 99.9% availability allows 8.76 hours of downtime per year, which might be perfectly acceptable for an internal tool but unacceptable for a payment system.”

Red Flag 3: “We don’t need to test failover because we have redundancy”

Why it’s wrong: Redundancy without testing is just expensive infrastructure that might not work during actual failures. Many teams discover during real outages that their “automatic” failover doesn’t work—health checks were misconfigured, DNS caching prevented failover, or applications didn’t handle connection failures gracefully. The Netflix philosophy is “chaos engineering”—continuously testing failure scenarios in production to validate that redundancy actually works. Untested failover is a false sense of security.

What to say instead: “Redundancy is necessary but not sufficient for high availability. We need to continuously test that our failover mechanisms work under pressure. This means practicing chaos engineering—deliberately killing instances, simulating zone failures, and introducing network issues—to validate that the system recovers automatically within our SLA. We should start with game days in staging, then graduate to automated testing in production during low-traffic periods. The goal is making failure testing routine so we’re confident the system will recover during real incidents.”

Red Flag 4: “We’ll use synchronous replication for high availability”

Why it’s wrong: Synchronous replication adds latency (every write waits for acknowledgment from replicas) and can reduce availability during network partitions (if replicas are unreachable, writes fail). While it provides strong consistency, it’s often overkill for HA. Many highly available systems use asynchronous replication with eventual consistency, accepting the small risk of data loss during failures in exchange for better performance and availability. The choice depends on business requirements, not a blanket “synchronous is better.”

What to say instead: “The replication strategy depends on our consistency requirements. For financial transactions or inventory management, synchronous replication to at least one replica ensures we never lose committed data, though it adds 2-5ms latency. For user-generated content or analytics, asynchronous replication provides better performance and availability, accepting that we might lose a few seconds of data during failures. We can also use hybrid approaches—synchronous to one local replica for consistency, asynchronous to remote replicas for disaster recovery. The key is matching the replication strategy to business requirements.”

Red Flag 5: “Our health checks return 200 OK if the process is running”

Why it’s wrong: Superficial health checks don’t detect many failure modes. A process can be running but unable to process requests due to database connection pool exhaustion, memory leaks, disk full, or downstream dependency failures. This causes “gray failures” where instances appear healthy but fail most requests, leading to poor user experience and difficult-to-debug incidents. Health checks should test actual functionality, not just process liveness.

What to say instead: “Health checks should test critical paths, not just process liveness. For a web service, the health check should query the database, read from the cache, and verify that downstream dependencies are reachable. This ensures we detect failures that affect request processing, not just process crashes. However, we need to balance depth against overhead—health checks that are too expensive can cause problems during traffic spikes. A good approach is having both shallow checks (every second, just process liveness) and deep checks (every 10 seconds, test critical paths).”

Key Takeaways

High availability is about minimizing downtime through redundancy and fast recovery, not achieving perfection. Each additional nine of availability (99.9% → 99.99% → 99.999%) roughly multiplies costs by 10x, so align availability targets with actual business impact rather than pursuing maximum uptime.
Eliminate single points of failure by adding redundancy across failure domains. Run multiple instances across availability zones, use database replication with automatic failover, and ensure no single component failure takes down the entire system. Remember that failures are often correlated (bad deployments, zone outages), so design for systemic failures, not just individual component failures.
Availability decreases multiplicatively for serial dependencies and improves exponentially with parallel redundancy. A system with three synchronous dependencies at 99.9% each has only 99.7% availability (0.999³). Two redundant servers at 99.9% each provide 99.9999% availability (1 - 0.001²). Minimize synchronous dependencies and add redundancy to critical components.
Fast failure detection and automated recovery are as important as redundancy. Implement deep health checks that test actual functionality, not just process liveness. Configure aggressive timeouts and circuit breakers to fail fast. Provision for N-1 or N-2 capacity so remaining instances can handle the load during failures. The goal is sub-second failure detection and automatic recovery without human intervention.
Test failure scenarios continuously through chaos engineering. Redundancy only works if your failover mechanisms function under pressure. Practice killing instances, simulating zone failures, and introducing network issues—first in staging, then in production. Netflix’s Chaos Monkey pioneered this approach, and it’s now industry standard for achieving high availability. Untested failover is just expensive infrastructure that might not work when you need it.

Prerequisites:

Reliability vs Availability - Understanding the distinction between these concepts is fundamental before diving into HA patterns
Failure Modes - You need to understand what can fail before designing systems to handle failures
Load Balancing - Load balancers are the primary mechanism for distributing traffic across redundant instances

Related Patterns:

Failover Strategies - Deep dive into active-active, active-passive, and automated failover mechanisms
Health Checks and Monitoring - Implementing effective health checks is critical for detecting failures and triggering failover
Circuit Breakers - Preventing cascading failures when dependencies become unhealthy
Graceful Degradation - Maintaining partial functionality when components fail
Chaos Engineering - Testing that your HA mechanisms actually work under failure conditions

Implementation Details:

Database Replication - Achieving HA for stateful components requires understanding replication topologies
Multi-Region Deployments - Geographic redundancy for the highest availability requirements
Service Mesh - Modern approach to implementing HA patterns (health checks, circuit breakers, retries) at the infrastructure layer

TL;DR

The Analogy

Why This Matters in Interviews

Core Concept

How It Works

Key Principles

Deep Dive

Types / Variants

Trade-offs

Common Pitfalls

Math & Calculations

Real-World Examples

Interview Expectations

Mid-Level

Senior

Staff+

Common Interview Questions

Red Flags to Avoid

Key Takeaways

Related Topics