Availability Overview
TL;DR
Availability measures the percentage of time a system is operational and accessible to users, typically expressed in “nines” (99.9%, 99.99%, etc.). It’s the foundation of reliability engineering, directly impacting user trust and business revenue. Cheat sheet: 99.9% = 43.8 min downtime/month, 99.99% = 4.38 min/month, 99.999% = 26 sec/month. Achieve through redundancy, failover, health checks, and graceful degradation.
The Analogy
Think of availability like a 24-hour convenience store. A store with 99% availability is closed 14.4 minutes per day—annoying but survivable. A store with 99.9% availability is only closed 1.4 minutes daily—you might not even notice. But a hospital emergency room needs 99.999% availability (5 nines) because even 5 minutes of downtime per year could cost lives. The “nines” you need depend on what’s at stake when your doors are closed.
Why This Matters in Interviews
Availability comes up in nearly every system design interview because it’s the first question users ask: “Will this work when I need it?” Interviewers want to see you translate business requirements (“We can’t afford downtime during Black Friday”) into concrete SLA targets, then design systems that meet those targets. They’re testing whether you understand the cost-availability tradeoff—that each additional nine costs exponentially more—and whether you can architect redundancy without creating new failure modes. Strong candidates quantify availability impact (“3 nines means 43 minutes downtime/month, which costs us $X in lost transactions”) and propose tiered strategies rather than blanket solutions.
Core Concept
Availability is the percentage of time a system successfully responds to requests over a given period. It’s measured as uptime divided by total time, typically expressed in “nines” notation: 99.9% (three nines), 99.99% (four nines), or 99.999% (five nines). This seemingly small difference has massive implications—going from three nines to five nines reduces acceptable monthly downtime from 43.8 minutes to just 26 seconds.
Availability differs from reliability in a subtle but important way. Reliability measures the probability a system performs correctly when called upon (MTBF/MTTF metrics), while availability measures whether the system is accessible right now. A system can be highly reliable (rarely fails) but have poor availability if repairs take a long time. Conversely, a system might fail frequently but maintain high availability through rapid automatic recovery. This distinction matters because different architectural patterns optimize for each.
In production systems, availability is formalized through Service Level Agreements (SLAs) with customers and Service Level Objectives (SLOs) for internal teams. An SLA might promise 99.95% availability with financial penalties for violations, while the internal SLO targets 99.99% to provide a safety buffer. The gap between SLO and SLA is your error budget—the acceptable amount of downtime you can “spend” on deployments, experiments, and incidents without breaching customer commitments.
Availability Nines: Downtime Translation
graph LR
subgraph Availability Levels
A["99%<br/>(Two Nines)"]
B["99.9%<br/>(Three Nines)"]
C["99.99%<br/>(Four Nines)"]
D["99.999%<br/>(Five Nines)"]
end
subgraph Monthly Downtime
A --> A1["7.31 hours<br/>~7 hours"]
B --> B1["43.8 minutes<br/>~44 min"]
C --> C1["4.38 minutes<br/>~4 min"]
D --> D1["26.3 seconds<br/>~26 sec"]
end
subgraph Use Cases
A1 -.-> A2["Internal Tools<br/>Dev Environments"]
B1 -.-> B2["Social Media<br/>Content Sites"]
C1 -.-> C2["E-commerce<br/>SaaS Platforms"]
D1 -.-> D2["Payment Systems<br/>Healthcare"]
end
Each additional nine of availability reduces acceptable downtime by 10x but costs exponentially more to achieve. The choice depends on business impact—social media can tolerate 44 minutes/month of downtime, but payment systems cannot.
How It Works
Step 1: Define the availability target. Start by translating business requirements into a concrete percentage. If you’re building a payment processor, regulators might mandate 99.99% availability. If you’re building a social media feed, 99.9% might suffice because users tolerate occasional hiccups. Calculate what this means in absolute time: 99.99% allows 4.38 minutes of downtime per month, 52.56 minutes per year. Verify this aligns with business impact—if each minute of downtime costs $10,000 in lost revenue, you can justify the infrastructure investment for higher availability.
Step 2: Identify failure modes and their probabilities. Map out everything that can fail: servers (hardware failure rate ~1-5% annually), network links (fiber cuts, switch failures), data centers (power outages, natural disasters), software bugs (deployment errors, memory leaks), and dependencies (third-party APIs, databases). For each component, estimate its individual availability. A single server might have 99.9% availability, a network link 99.95%, a data center 99.99%. These compound multiplicatively—if your system depends on all three, total availability is 0.999 × 0.9995 × 0.9999 = 99.84%.
Step 3: Design redundancy to eliminate single points of failure. The fundamental technique for high availability is redundancy—running multiple copies of every critical component so one can fail without taking down the system. Deploy servers across multiple availability zones (isolated failure domains within a region) or multiple regions (geographically separated data centers). Use load balancers to distribute traffic across healthy instances, automatically routing around failures. Replicate data across multiple databases with automatic failover. The math is powerful: two independent components each with 99.9% availability give you 1 - (0.001 × 0.001) = 99.9999% combined availability.
Step 4: Implement health checks and automatic failover. Redundancy only helps if you detect failures quickly and route around them. Deploy health check endpoints that verify not just “is the process running?” but “can this instance serve traffic successfully?” Check database connectivity, dependency availability, and resource utilization. Configure load balancers to remove unhealthy instances from rotation within seconds. For stateful systems like databases, implement automatic failover with leader election—when the primary fails, a replica promotes itself to primary within 10-30 seconds. The key metric is Mean Time To Detect (MTTD) + Mean Time To Recover (MTTR)—minimize both.
Step 5: Build graceful degradation for dependency failures. Even with perfect redundancy, external dependencies will fail. Design your system to degrade gracefully rather than cascading failures. If the recommendation engine is down, show a static list instead of erroring. If the payment processor is slow, queue transactions for async processing. Use circuit breakers to stop calling failing dependencies, preventing resource exhaustion from retries. Cache aggressively so you can serve stale data when upstream systems are unavailable. Netflix’s approach is exemplary: when their personalization service fails, they fall back to generic popular content rather than showing users a blank screen.
Step 6: Monitor, measure, and iterate. Instrument every component to track actual availability. Measure both “was the system up?” (binary uptime) and “was the system usable?” (request success rate, latency percentiles). A system that’s technically running but returning 50% errors or taking 30 seconds per request is effectively unavailable. Track your error budget consumption in real-time—if you’re burning through your monthly budget in the first week, you need to freeze risky changes. Run chaos engineering experiments (deliberately inject failures) to verify your redundancy actually works before you need it in production.
High Availability Architecture Pattern
graph LR
User["Users"] --"1. HTTP Request"--> DNS["DNS/Route53<br/><i>Global Load Balancing</i>"]
subgraph Region: us-east-1
DNS --"2. Route to healthy region"--> ALB1["Application<br/>Load Balancer"]
subgraph AZ-1a
ALB1 --"3. Health check passed"--> App1["App Server 1<br/><i>99.9% available</i>"]
App1 --"5. Read/Write"--> DB_Primary[("Primary DB<br/><i>Multi-AZ</i>")]
end
subgraph AZ-1b
ALB1 --"3. Health check passed"--> App2["App Server 2<br/><i>99.9% available</i>"]
App2 --"5. Read/Write"--> DB_Primary
DB_Standby[("Standby DB<br/><i>Sync Replication</i>")]
end
App1 & App2 --"4. Cache lookup"--> Cache["Redis Cache<br/><i>Reduces DB load</i>"]
DB_Primary -."Automatic failover<br/>30-60 sec"..-> DB_Standby
end
subgraph Region: us-west-2
DNS -."Failover if us-east-1 down".-> ALB2["Backup Region<br/><i>Warm Standby</i>"]
end
Monitor["CloudWatch<br/>Monitoring"] -."Health checks every 5s".-> ALB1
Monitor -."Alert on failures".-> OnCall["On-Call Engineer"]
A typical high-availability architecture eliminates single points of failure through multi-AZ deployment, load balancing, database replication, and health checks. Two app servers with 99.9% availability each provide 99.9999% combined availability (1 - 0.001²).
Key Principles
Principle 1: Availability compounds multiplicatively, not additively. When components are in series (one depends on the other), their availabilities multiply. If your application (99.9% available) depends on a database (99.95% available) and a cache (99.99% available), total availability is 0.999 × 0.9995 × 0.9999 = 99.84%. This is why microservices architectures can struggle with availability—each additional dependency in the critical path reduces overall availability. Example: Uber’s early architecture had payment processing depend on 7 sequential services. Even if each had 99.9% availability, the chain was only 99.3% available (0.999^7), causing frequent payment failures. They redesigned to make services parallel and independent where possible.
Principle 2: Redundancy must be truly independent to be effective. Two servers in the same rack share a power supply and network switch—not independent. Two availability zones in the same region share some infrastructure—better, but not perfect. Two regions share the same cloud provider’s control plane—even better. True independence means different failure domains: separate power grids, network providers, even different cloud vendors for critical systems. Example: When AWS’s us-east-1 region had a major outage in 2017, companies with multi-region architectures stayed up while single-region deployments went dark. But companies that used us-east-1 for their control plane (DNS, load balancing) still failed because their failover mechanism itself was down.
Principle 3: Fast detection and recovery matter more than preventing all failures. You cannot prevent all failures—hardware fails, software has bugs, humans make mistakes. What matters is how quickly you detect and recover. A system that fails once per month but recovers in 10 seconds (99.996% availability) beats a system that fails once per year but takes 2 hours to recover (99.977% availability). Example: Google’s approach to server failures is instructive—they assume servers will fail constantly and design for rapid detection (health checks every 2 seconds) and instant failover (traffic reroutes in <10 seconds). This lets them use cheaper commodity hardware while maintaining higher availability than competitors using expensive “reliable” servers.
Principle 4: Availability is a spectrum, not a binary state. A system isn’t just “up” or “down”—it can be partially degraded. Maybe 90% of requests succeed, or latency is 10x normal, or only some features work. Measure availability as request success rate, not just process uptime. A server that’s running but returning errors is unavailable. Example: Twitter’s “fail whale” was a form of graceful degradation—when overloaded, they showed a friendly error page instead of timing out. This maintained some availability (users knew the service was temporarily down) rather than leaving them wondering if their internet was broken.
Principle 5: The last nine is exponentially harder than the first. Going from 99% to 99.9% availability might require adding a load balancer and a second server—relatively cheap. Going from 99.9% to 99.99% requires multi-AZ deployment, automated failover, and 24/7 on-call—much more expensive. Reaching 99.999% demands multi-region active-active architecture, chaos engineering, and a full SRE team—exponentially more expensive. Example: Stripe targets 99.99% availability for their API because payment processing demands high reliability, but they explicitly don’t target 99.999% because the cost (multi-region writes with strong consistency) would be prohibitive and unnecessary for their use case. They’ve calculated that 4.38 minutes of monthly downtime is acceptable given their error budget and customer expectations.
Deep Dive
Types / Variants
Active-Active High Availability: Multiple instances actively serve traffic simultaneously, with load distributed across all of them. When one fails, the others absorb its traffic with minimal disruption. This provides the highest availability and best resource utilization since all capacity is always in use. However, it requires sophisticated load balancing, session management (sticky sessions or stateless design), and data synchronization if state is involved. Use this for stateless services like web servers, API gateways, or read-heavy databases. Example: Netflix runs active-active across multiple AWS regions—traffic is always served from multiple regions simultaneously, so a region failure only reduces capacity, not availability.
Active-Passive (Hot Standby): A primary instance serves all traffic while one or more standby instances remain ready to take over immediately upon failure. The standby is “hot” (fully provisioned and synchronized) but idle. This is simpler than active-active because there’s no split-brain risk or data synchronization complexity during normal operation. However, you’re paying for idle capacity, and failover takes 10-60 seconds while health checks detect the failure and promote the standby. Use this for stateful systems where split-brain would be catastrophic, like databases or leader-election systems. Example: PostgreSQL with streaming replication—the primary handles all writes while replicas stay synchronized but idle, ready to promote if the primary fails.
Active-Passive (Warm Standby): Similar to hot standby, but the passive instance is only partially provisioned—maybe it’s running but with smaller instance sizes, or it’s in a different region with higher latency. Upon failure, the standby must scale up before taking traffic, adding 5-15 minutes to recovery time. This reduces costs compared to hot standby while maintaining better availability than cold standby. Use this when you can tolerate brief degraded performance during failover, like internal tools or batch processing systems. Example: Many companies run their production database in us-east-1 with a warm standby in us-west-2 at half the capacity, accepting that a failover means temporarily slower queries until they scale up.
Active-Passive (Cold Standby): The standby exists only as configuration and data backups—it must be provisioned from scratch during a failure. Recovery takes 30 minutes to several hours, making this suitable only for disaster recovery, not high availability. However, it’s the cheapest option since you only pay for storage, not compute. Use this for systems where extended downtime is acceptable, like analytics pipelines or development environments. Example: Many startups keep daily database snapshots in S3 as cold standby—if their primary database is destroyed, they can restore from backup, but they accept hours of downtime and some data loss.
Multi-Region Active-Active: The ultimate availability pattern—traffic is served from multiple geographic regions simultaneously, with data replicated across all regions. A region failure is invisible to users because traffic automatically routes to healthy regions. This provides the highest availability (can survive entire region outages) and best performance (users connect to the nearest region). However, it’s extremely complex and expensive, requiring global load balancing, cross-region data replication with conflict resolution, and careful management of data consistency vs. availability tradeoffs. Use this only for mission-critical systems where even minutes of downtime are unacceptable. Example: Amazon.com runs active-active across multiple regions—when us-east-1 went down, most users never noticed because their traffic was already being served from other regions.
Active-Active vs Active-Passive Deployment Patterns
graph TB
subgraph Active-Active Multi-Region
AA_User["Users"] --> AA_DNS["Global DNS<br/><i>Route53 Latency-Based</i>"]
subgraph US Region
AA_DNS --"50% traffic"--> AA_LB1["Load Balancer"]
AA_LB1 --> AA_App1["App Servers<br/><i>Serving traffic</i>"]
AA_App1 --> AA_DB1[("Database<br/><i>Multi-master replication</i>")]
end
subgraph EU Region
AA_DNS --"50% traffic"--> AA_LB2["Load Balancer"]
AA_LB2 --> AA_App2["App Servers<br/><i>Serving traffic</i>"]
AA_App2 --> AA_DB2[("Database<br/><i>Multi-master replication</i>")]
end
AA_DB1 <-."Bidirectional sync<br/>Conflict resolution".-> AA_DB2
AA_Benefit["✓ Highest availability<br/>✓ Best resource utilization<br/>✓ Low latency globally<br/>✗ Complex data sync<br/>✗ Expensive"]
end
subgraph Active-Passive Hot Standby
AP_User["Users"] --> AP_DNS["DNS<br/><i>Failover on health check</i>"]
subgraph Primary Region
AP_DNS --"100% traffic"--> AP_LB1["Load Balancer"]
AP_LB1 --> AP_App1["App Servers<br/><i>Serving all traffic</i>"]
AP_App1 --> AP_DB1[("Primary DB<br/><i>Handles all writes</i>")]
end
subgraph Standby Region
AP_LB2["Load Balancer<br/><i>Idle, ready</i>"]
AP_App2["App Servers<br/><i>Idle, ready</i>"]
AP_DB2[("Standby DB<br/><i>Streaming replication</i>")]
AP_LB2 -.-> AP_App2 -.-> AP_DB2
end
AP_DB1 --"Continuous replication"--> AP_DB2
AP_DNS -."Failover: 30-60 sec".-> AP_LB2
AP_Benefit["✓ Simpler than active-active<br/>✓ No split-brain risk<br/>✓ Fast failover<br/>✗ Wasted capacity<br/>✗ Brief downtime on failover"]
end
Active-active serves traffic from all regions simultaneously (highest availability, complex), while active-passive keeps standby resources idle until failover (simpler, brief downtime). Choose based on availability requirements and operational complexity tolerance.
Trade-offs
Availability vs. Consistency: The CAP theorem forces a choice during network partitions—you can have high availability (always accept writes) or strong consistency (reject writes that can’t be synchronized), but not both. High availability systems like Dynamo or Cassandra choose AP (available + partition-tolerant), accepting that different nodes might temporarily have different data. Banking systems choose CP (consistent + partition-tolerant), rejecting writes during partitions to prevent conflicting transactions. Decision framework: If stale data is dangerous (financial transactions, inventory counts), choose consistency. If downtime is dangerous (social media, monitoring), choose availability. Many systems use different choices for different data—user profiles can be eventually consistent, but payment balances must be strongly consistent.
Availability vs. Cost: Each additional nine of availability costs exponentially more. Going from 99% to 99.9% might cost 2x (add redundancy), 99.9% to 99.99% might cost 5x (multi-AZ, automation), and 99.99% to 99.999% might cost 10x (multi-region, dedicated SRE team). Calculate the business value of availability—if one hour of downtime costs $100K, then spending $1M/year to reduce downtime from 87 hours (99%) to 8.7 hours (99.9%) is justified. But if downtime only costs $1K/hour, that same investment isn’t worth it. Decision framework: Start with the minimum viable availability, measure actual impact of outages, then invest incrementally. Don’t over-engineer for theoretical scenarios.
Availability vs. Latency: Techniques that improve availability often hurt latency. Synchronous replication to multiple data centers ensures data isn’t lost but adds network round-trip time to every write. Health checks and retries add overhead to every request. Multi-region active-active means some users connect to distant regions. Decision framework: For read-heavy workloads, use async replication and caching to maintain both availability and latency. For write-heavy workloads, choose based on user expectations—social media posts can tolerate async replication (eventual consistency, low latency), but payment confirmations need sync replication (immediate consistency, higher latency).
Availability vs. Operational Complexity: Simple architectures are easier to operate but have lower availability. A single server is simple but has no redundancy. Adding load balancing, auto-scaling, multi-region failover, and chaos engineering dramatically increases operational complexity—more things to configure, monitor, and debug. Complex systems have more failure modes, and ironically, the complexity added to improve availability can itself cause outages. Decision framework: Increase complexity only when justified by measured availability needs. A startup with 100 users doesn’t need multi-region active-active. Use managed services (RDS, ELB, CloudFront) to get availability without operational burden. Automate everything—manual failover procedures will fail when you’re stressed at 3am.
Planned Downtime vs. Unplanned Downtime: Some availability strategies trade planned downtime (scheduled maintenance windows) for reduced unplanned downtime (surprise outages). Blue-green deployments eliminate deployment downtime but require 2x capacity. Rolling updates maintain availability during deploys but take longer and risk partial failures. Decision framework: Modern systems should target zero planned downtime—users don’t care if your outage was “scheduled.” Use techniques like blue-green deployments, feature flags, and database migrations with backward compatibility. Reserve maintenance windows only for truly unavoidable changes like data center moves.
Common Pitfalls
Pitfall 1: Measuring availability by server uptime instead of user experience. Teams often track “was the server process running?” but that’s not what users care about. A server that’s running but returning 500 errors, or responding in 30 seconds instead of 100ms, is effectively unavailable. Why it happens: Server uptime is easy to measure with simple health checks, while user experience requires instrumentation of actual requests. How to avoid: Measure availability as the percentage of requests that succeed within acceptable latency thresholds. Use synthetic monitoring to continuously test critical user flows. Track SLIs (Service Level Indicators) like “99th percentile latency < 200ms” and “error rate < 0.1%” rather than just “server is up.”
Pitfall 2: Assuming redundancy works without testing it. Many teams deploy redundant systems but never verify the failover actually works. Then during a real outage, they discover the standby database is weeks out of sync, or the load balancer’s health checks are misconfigured, or the failover script has a bug. Why it happens: Testing failover is disruptive and scary—what if the test itself causes an outage? So teams defer it indefinitely. How to avoid: Practice chaos engineering—regularly inject failures in production to verify recovery works. Start small (kill one server) and gradually increase scope (fail an entire AZ). Netflix’s Chaos Monkey randomly terminates instances in production, forcing teams to build resilient systems. Make failover testing part of your deployment pipeline, not a quarterly exercise.
Pitfall 3: Creating hidden dependencies that negate redundancy. You deploy your application across three availability zones for redundancy, but all three zones depend on a single database in one zone—you’ve created a single point of failure. Or you use multi-region deployment but your authentication service is single-region, so a region failure still takes down the whole system. Why it happens: Dependencies are often invisible until you map them explicitly. A service might not directly call the database but depend on a cache that depends on the database. How to avoid: Draw dependency graphs for critical paths and identify single points of failure. For each dependency, ask “what happens if this fails?” and “is this dependency itself redundant?” Use techniques like bulkheading to isolate failures—if the recommendation service is down, the checkout flow should still work.
Pitfall 4: Ignoring the availability of deployment and operational tools. Your application has 99.99% availability, but your deployment pipeline depends on a CI/CD system with 99% availability. When the CI/CD system is down, you can’t deploy fixes, turning a minor bug into a major outage. Or your monitoring system goes down during an incident, leaving you blind. Why it happens: Teams focus on application availability but treat operational tools as less critical. How to avoid: Apply the same availability standards to your operational tools as your application. Your monitoring system should be more available than what it monitors—use a separate monitoring stack or external service. Your deployment system should have redundancy and failover. Keep a “break glass” manual deployment procedure for when automation fails.
Pitfall 5: Optimizing for the wrong availability metric. A team targets 99.99% monthly availability, so they’re fine with one 4-minute outage per month. But if that outage happens during Black Friday, it costs millions. Or they achieve 99.99% average availability but have terrible tail latencies—99% of requests are fast but 1% take 10 seconds, making the system unusable for those users. Why it happens: Simple availability percentages hide important details about when failures occur and who they affect. How to avoid: Use time-windowed SLOs (99.99% availability in any 24-hour period, not just monthly average) to prevent clustering of downtime. Track availability per customer segment—if 1% of users see 50% error rates, your aggregate 99.5% availability metric hides a crisis. Consider business-critical time windows—an e-commerce site might need 99.99% availability during holiday shopping but can tolerate 99.9% in January.
Hidden Single Points of Failure
graph TB
subgraph Apparent Redundancy - Looks Good
A_User["Users"] --> A_LB["Load Balancer<br/><i>Redundant</i>"]
subgraph AZ-1
A_LB --> A_App1["App Server 1"]
end
subgraph AZ-2
A_LB --> A_App2["App Server 2"]
end
subgraph AZ-3
A_LB --> A_App3["App Server 3"]
end
A_App1 & A_App2 & A_App3 --> A_SPOF["❌ Single Database<br/><i>in AZ-1 only</i>"]
A_Result["Problem: All 3 app servers<br/>depend on one database.<br/>AZ-1 failure = total outage"]
end
subgraph Hidden Dependencies - Worse
H_User["Users"] --> H_DNS["DNS<br/><i>in us-east-1</i>"]
subgraph US Region
H_DNS --> H_LB1["Load Balancer"]
H_LB1 --> H_App1["App Servers"]
H_App1 --> H_DB1[("Database")]
end
subgraph EU Region
H_DNS -."Failover route".-> H_LB2["Load Balancer"]
H_LB2 --> H_App2["App Servers"]
H_App2 --> H_DB2[("Database")]
end
H_App1 & H_App2 --> H_Auth["❌ Auth Service<br/><i>Single region</i>"]
H_Result["Problem: Multi-region deployment<br/>but auth service is single-region.<br/>Auth failure = global outage"]
end
subgraph True Redundancy - Fixed
T_User["Users"] --> T_DNS["Multi-region DNS<br/><i>Route53 + Cloudflare</i>"]
subgraph US Region
T_DNS --> T_LB1["Load Balancer"]
T_LB1 --> T_App1["App Servers"]
T_App1 --> T_DB1[("Primary DB")]
T_DB1_R[("Replica DB")]
T_DB1 -."Failover".-> T_DB1_R
T_App1 --> T_Auth1["Auth Service<br/><i>Regional</i>"]
end
subgraph EU Region
T_DNS --> T_LB2["Load Balancer"]
T_LB2 --> T_App2["App Servers"]
T_App2 --> T_DB2[("Primary DB")]
T_DB2_R[("Replica DB")]
T_DB2 -."Failover".-> T_DB2_R
T_App2 --> T_Auth2["Auth Service<br/><i>Regional</i>"]
end
T_Result
end
**
Math & Calculations
Calculating Availability from Uptime:
Availability = (Total Time - Downtime) / Total Time × 100%
Example: A system runs for 30 days (43,200 minutes) with 43.8 minutes of downtime. Availability = (43,200 - 43.8) / 43,200 × 100% = 99.9%
Converting Nines to Downtime:
- 90% (one nine): 36.5 days/year, 3.65 days/month, 16.8 hours/week
- 99% (two nines): 3.65 days/year, 7.31 hours/month, 1.68 hours/week
- 99.9% (three nines): 8.77 hours/year, 43.8 minutes/month, 10.1 minutes/week
- 99.99% (four nines): 52.6 minutes/year, 4.38 minutes/month, 1.01 minutes/week
- 99.999% (five nines): 5.26 minutes/year, 26.3 seconds/month, 6.05 seconds/week
- 99.9999% (six nines): 31.6 seconds/year, 2.63 seconds/month, 0.605 seconds/week
Calculating Compound Availability (Serial Dependencies):
When components are in series (one depends on another), multiply their availabilities:
Total Availability = A₁ × A₂ × A₃ × … × Aₙ
Example: An application (99.95%) calls an API gateway (99.99%) which calls a database (99.9%). Total = 0.9995 × 0.9999 × 0.999 = 0.9984 = 99.84%
This is why microservices can hurt availability—each additional hop reduces overall availability.
Calculating Compound Availability (Parallel Redundancy):
When components are in parallel (redundant), calculate the probability that ALL fail:
Total Availability = 1 - (1 - A₁) × (1 - A₂) × … × (1 - Aₙ)
Example: Two independent servers, each with 99.9% availability. Total = 1 - (1 - 0.999) × (1 - 0.999) = 1 - (0.001 × 0.001) = 1 - 0.000001 = 99.9999%
This shows the power of redundancy—two 99.9% components give you 99.9999% (five nines)!
Calculating Required Redundancy for Target Availability:
If you need availability A_target and each component has availability A_component, how many redundant components do you need?
Solve: A_target = 1 - (1 - A_component)ⁿ
Example: You need 99.99% availability using servers with 99% individual availability. 0.9999 = 1 - (1 - 0.99)ⁿ 0.9999 = 1 - (0.01)ⁿ (0.01)ⁿ = 0.0001 n × log(0.01) = log(0.0001) n = log(0.0001) / log(0.01) = 2
You need 2 redundant servers.
Calculating MTBF and MTTR Impact:
Availability = MTBF / (MTBF + MTTR)
Where:
- MTBF = Mean Time Between Failures
- MTTR = Mean Time To Repair
Example: A system fails once every 1000 hours (MTBF) and takes 1 hour to repair (MTTR). Availability = 1000 / (1000 + 1) = 0.999 = 99.9%
To improve to 99.99%, you could either:
- Increase MTBF to 10,000 hours (fail less often), or
- Decrease MTTR to 6 minutes (recover faster)
Reducing MTTR is often easier than increasing MTBF, which is why automated failover is so valuable.
Calculating Error Budget:
Error Budget = (1 - SLO) × Time Period
Example: 99.9% SLO for a 30-day month. Error Budget = (1 - 0.999) × 30 days = 0.001 × 43,200 minutes = 43.2 minutes
If you’ve had 20 minutes of downtime so far this month, you have 23.2 minutes remaining. If you burn through your error budget early, you should freeze risky changes (new features, large refactors) and focus on reliability improvements.
Serial vs Parallel Availability Calculation
graph TB
subgraph Serial Dependencies - Multiply
S1["Load Balancer<br/>99.99% available"]
S2["App Server<br/>99.95% available"]
S3["Database<br/>99.9% available"]
S1 --> S2 --> S3
S3 --> SResult["Total: 0.9999 × 0.9995 × 0.999<br/><b>= 99.84% available</b><br/><i>Each dependency reduces availability</i>"]
end
subgraph Parallel Redundancy - Compound
P1["Server A<br/>99% available"]
P2["Server B<br/>99% available"]
PCalc["Probability both fail:<br/>0.01 × 0.01 = 0.0001"]
P1 & P2 --> PCalc
PCalc --> PResult["Total: 1 - 0.0001<br/><b>= 99.99% available</b><br/><i>Redundancy dramatically improves availability</i>"]
end
subgraph Real System - Combined
R_LB["Load Balancer<br/>99.99%"]
R_App1["App 1<br/>99%"]
R_App2["App 2<br/>99%"]
R_DB1[("DB Primary<br/>99.9%")]
R_DB2[("DB Replica<br/>99.9%")]
R_LB --> R_App1 & R_App2
R_App1 & R_App2 --> R_DB1 & R_DB2
R_DB1 & R_DB2 --> RCalc1["DB: 1 - 0.001² = 99.9999%"]
R_App1 & R_App2 --> RCalc2["App: 1 - 0.01² = 99.99%"]
RCalc1 & RCalc2 --> RResult["Total: 0.9999 × 0.9999 × 0.999999<br/><b>= 99.98% available</b>"]
end
Serial dependencies multiply availabilities (reducing total availability), while parallel redundancy compounds them (dramatically improving availability). Real systems combine both patterns—redundant components in parallel, connected in series.
Real-World Examples
Netflix: Multi-Region Active-Active with Chaos Engineering. Netflix runs one of the world’s most available streaming services, serving 230+ million subscribers with 99.99%+ availability. Their architecture is active-active across three AWS regions (us-east-1, us-west-2, eu-west-1), with each region capable of handling 100% of traffic. They use Route53 for global load balancing with health checks that route traffic away from degraded regions within seconds. What’s fascinating is their approach to testing—they created Chaos Monkey (randomly terminates instances), Chaos Kong (simulates entire region failures), and a suite of chaos engineering tools that continuously inject failures in production. This forces their engineers to build truly resilient systems because they know failures will happen. During the 2017 AWS us-east-1 outage that took down much of the internet, Netflix stayed up because their systems were already battle-tested against region failures. They’ve published their error budget approach: each team gets a budget of acceptable downtime, and if they exceed it, they must pause feature development to focus on reliability.
Stripe: 99.99% Availability Through Operational Excellence. Stripe’s payment API maintains 99.99% availability (4.38 minutes of downtime per month) despite processing billions of dollars in transactions. They achieve this not through exotic architecture but through operational discipline. Every API endpoint has a detailed SLO with latency and error rate targets. They use a sophisticated deployment system called Sorbet that deploys changes to 1% of servers, monitors error rates and latency for 15 minutes, then gradually rolls out to 100% over several hours—any anomaly triggers automatic rollback. Their database architecture uses PostgreSQL with streaming replication across multiple availability zones, with automated failover tested weekly. What’s interesting is their transparency—they publish real-time availability metrics and detailed postmortems for every incident. They’ve learned that most outages aren’t caused by infrastructure failures but by bad deploys, so they’ve invested heavily in deployment safety: feature flags, gradual rollouts, automated rollbacks, and extensive pre-production testing. They also maintain a “break glass” manual payment processing system that can handle critical transactions even if their primary API is down.
Amazon.com: Availability as a Competitive Advantage. Amazon’s retail site is legendary for availability—they’ve calculated that every 100ms of additional latency costs them 1% in sales, and downtime during Prime Day or Black Friday would cost millions per minute. They run active-active across multiple regions with sophisticated traffic management that routes users to the nearest healthy region. Their architecture uses extensive caching (CloudFront CDN, application-level caching, database query caching) so that even if backend services fail, users can still browse products and add to cart using cached data. They pioneered the concept of “static stability”—critical paths like checkout should work even if all dependencies are down, using cached data and queuing writes for later processing. During Prime Day 2018, when their internal systems were overloaded, most users never noticed because the customer-facing site degraded gracefully—recommendations were slower, but core shopping functionality remained fast. They’ve also invested in predictive scaling, using machine learning to pre-scale infrastructure before traffic spikes, ensuring they have capacity when it matters most. Their approach to availability is ruthlessly pragmatic: they measure the revenue impact of every minute of downtime, then invest in reliability improvements with the highest ROI.
Interview Expectations
Mid-Level
What you should know: Define availability in nines notation and convert to downtime (99.9% = 43.8 min/month). Explain the difference between availability and reliability. Describe basic redundancy patterns (active-active, active-passive) and when to use each. Understand that availability compounds multiplicatively for serial dependencies. Know how to use health checks and load balancers to route around failures. Explain the concept of an SLA and why it matters.
Bonus points: Calculate compound availability for a multi-tier system (web server → app server → database). Discuss the tradeoff between availability and cost—why you don’t always need five nines. Mention specific AWS services for high availability (ELB, Auto Scaling Groups, Multi-AZ RDS). Explain how caching improves availability by reducing dependency on backend services. Understand that deployment strategy affects availability (blue-green vs. rolling updates).
Senior
What you should know: Design a complete high-availability architecture for a specific system (e.g., payment processing, video streaming). Explain the CAP theorem and how it forces tradeoffs between availability and consistency. Calculate the number of redundant components needed to achieve a target availability. Discuss failure detection strategies (health checks, heartbeats, circuit breakers) and their tradeoffs. Design graceful degradation strategies for when dependencies fail. Understand error budgets and how they guide operational decisions. Explain why multi-region is necessary for five nines but expensive.
Bonus points: Discuss the operational complexity of high availability—monitoring, on-call, runbooks, chaos engineering. Explain how to measure availability from the user’s perspective, not just server uptime (request success rate, latency SLIs). Design for “static stability”—systems that work even when dependencies are down. Discuss the availability implications of different database architectures (single-leader, multi-leader, leaderless). Mention real-world examples (Netflix’s Chaos Monkey, Amazon’s Prime Day preparation). Explain how to balance availability with other concerns like security (patching requires downtime) and cost (redundancy is expensive).
Staff+
What you should know: Architect organization-wide availability strategies, including how to set SLOs for different service tiers and allocate error budgets. Design multi-region active-active systems with conflict resolution for writes. Explain the economics of availability—how to calculate the ROI of reliability investments and justify them to leadership. Design availability strategies that account for correlated failures (entire region outages, BGP hijacks, DNS failures). Understand the operational maturity required for high availability (SRE practices, incident management, blameless postmortems). Design systems that maintain availability during large-scale migrations or refactors.
Distinguishing signals: Discuss availability in terms of business impact, not just technical metrics—translate nines into revenue, user trust, and competitive advantage. Explain how to build a culture of reliability (error budgets, chaos engineering, production readiness reviews). Design availability strategies that evolve with the company—what works at 100 users doesn’t work at 100 million. Discuss the availability implications of organizational structure (Conway’s Law—your architecture reflects your org chart). Explain how to balance innovation velocity with reliability—too much focus on availability slows down feature development. Mention specific incidents you’ve managed and what you learned (“We had a 3-hour outage because our failover database was weeks out of sync—now we test failover weekly”). Discuss emerging availability challenges (multi-cloud, edge computing, regulatory requirements for data residency).
Common Interview Questions
Q1: How would you design a system with 99.99% availability?
60-second answer: I’d start by calculating that 99.99% allows 4.38 minutes of downtime per month. Deploy the application across multiple availability zones with a load balancer distributing traffic. Use auto-scaling to replace failed instances automatically. Deploy the database with Multi-AZ replication for automatic failover. Implement health checks that remove unhealthy instances from rotation within 10 seconds. Use caching aggressively to reduce database load and maintain availability if the database is slow. Monitor request success rate and latency, not just server uptime.
2-minute detailed answer: First, I’d identify all single points of failure and eliminate them through redundancy. Deploy application servers across at least two availability zones (separate failure domains) with an Application Load Balancer distributing traffic. Configure health checks every 5 seconds with a 10-second timeout—unhealthy instances are removed from rotation immediately. Use Auto Scaling Groups to maintain a minimum instance count and automatically replace failed instances. For the database, use RDS Multi-AZ with synchronous replication—if the primary fails, a replica promotes to primary within 30-60 seconds. Implement connection pooling and retries with exponential backoff to handle transient failures. Add Redis caching for frequently accessed data so the application can serve requests even if the database is slow. Use CloudFront CDN for static assets to reduce load on origin servers. Set up CloudWatch alarms for error rate and latency spikes. Implement circuit breakers for external dependencies—if a third-party API is failing, stop calling it and use cached data or graceful degradation. Finally, practice chaos engineering by randomly terminating instances to verify the system actually recovers as designed. The key is that availability comes from rapid detection and recovery, not preventing all failures.
Red flags: Saying “use a bigger server” (doesn’t help with availability), claiming you can achieve 99.99% with a single instance (impossible—hardware fails), ignoring the database (often the hardest component to make highly available), not mentioning health checks or monitoring (how do you know it’s working?), or designing for perfect reliability instead of rapid recovery.
Q2: What’s the difference between availability and reliability?
60-second answer: Reliability is the probability a system performs correctly when called upon, typically measured as Mean Time Between Failures (MTBF). Availability is the percentage of time a system is operational, calculated as uptime / total time. A system can be reliable but have poor availability if it takes a long time to repair when it fails. Conversely, a system might fail frequently but maintain high availability through rapid automatic recovery. In practice, availability = MTBF / (MTBF + MTTR).
2-minute detailed answer: Reliability measures how often a system fails—a reliable system has a high MTBF (Mean Time Between Failures). A server that fails once per year is more reliable than one that fails once per month. Availability measures what percentage of time the system is accessible to users. It’s calculated as (Total Time - Downtime) / Total Time. The key insight is that availability depends on both how often you fail (MTBF) and how quickly you recover (MTTR—Mean Time To Repair). The formula is Availability = MTBF / (MTBF + MTTR). Example: System A fails once per 1000 hours but takes 10 hours to repair. Availability = 1000 / (1000 + 10) = 99.01%. System B fails once per 100 hours but has automatic failover that recovers in 1 minute. Availability = 100 / (100 + 0.017) = 99.98%. System B is less reliable (fails 10x more often) but more available (recovers 600x faster). In modern systems, we optimize for availability through rapid recovery rather than trying to prevent all failures. This is why Netflix uses cheap commodity servers that fail frequently but has automated failover—it’s more cost-effective than buying expensive “reliable” hardware.
Red flags: Using the terms interchangeably, claiming they’re the same thing, not understanding MTBF and MTTR, or not recognizing that rapid recovery can compensate for frequent failures.
Q3: How do you calculate the availability of a system with multiple dependencies?
60-second answer: For serial dependencies (one depends on another), multiply their availabilities. If an app (99.9%) calls a database (99.95%), total availability is 0.999 × 0.9995 = 99.85%. For parallel redundancy (multiple independent components), calculate the probability all fail: 1 - (1 - A₁) × (1 - A₂). Two servers with 99% availability each give 1 - (0.01 × 0.01) = 99.99% combined. This is why microservices can hurt availability—each additional dependency reduces overall availability unless you add redundancy.
2-minute detailed answer: There are two cases. For serial dependencies where one component depends on another, availabilities multiply. Example: A user request hits a load balancer (99.99%), then an application server (99.9%), then a database (99.95%). Total availability = 0.9999 × 0.999 × 0.9995 = 99.84%. Each additional dependency in the critical path reduces overall availability. This is the “microservices availability problem”—if you have 10 services in a chain, each with 99.9% availability, total availability is only 0.999^10 = 99.0%. For parallel redundancy where you have multiple independent components and only one needs to work, calculate the probability that all fail, then subtract from 1. Example: Two independent database replicas, each 99% available. Probability both fail = 0.01 × 0.01 = 0.0001. Availability = 1 - 0.0001 = 99.99%. This shows the power of redundancy. In real systems, you have both patterns. A typical web app might have: Load balancer (99.99%) → [App Server 1 (99%) OR App Server 2 (99%)] → [Database Primary (99.9%) OR Database Replica (99.9%)]. Calculate the parallel components first: App servers = 1 - (0.01 × 0.01) = 99.99%. Database = 1 - (0.001 × 0.001) = 99.9999%. Then multiply: 0.9999 × 0.9999 × 0.999999 = 99.98%. The key insight is that redundancy dramatically improves availability, but only if the redundant components have truly independent failure modes.
Red flags: Adding availabilities instead of multiplying them, not understanding the difference between serial and parallel dependencies, claiming you can achieve higher availability than your least available component without redundancy, or not recognizing that dependencies must be truly independent for redundancy to work.
Q4: What’s an error budget and how do you use it?
60-second answer: An error budget is the acceptable amount of downtime or errors allowed while still meeting your SLO. If your SLO is 99.9% availability, your error budget is 0.1% of time—43.8 minutes per month. Teams can “spend” this budget on risky deploys, experiments, or planned maintenance. If you burn through your budget early, you freeze risky changes and focus on reliability. It balances innovation velocity with reliability—you don’t need to be perfect, just meet your SLO.
2-minute detailed answer: An error budget is the difference between 100% and your SLO, representing the acceptable amount of failure. If your SLO is 99.9% availability, your error budget is 0.1%—about 43.8 minutes of downtime per month. This budget can be “spent” on anything that might cause downtime: deploying new features, running experiments, planned maintenance, or incidents. The key insight is that 100% reliability is impossible and trying to achieve it slows down innovation. Error budgets formalize the tradeoff: as long as you’re meeting your SLO, you have budget to spend on innovation. If you exceed your error budget (say, you’ve had 50 minutes of downtime this month), you’ve broken your SLO and must freeze risky changes until next month. In practice, teams track error budget consumption in real-time. If you’re burning 50% of your monthly budget in the first week, that’s a signal to slow down deploys and investigate reliability issues. Google’s SRE teams use this to balance feature velocity with reliability—product managers want to ship fast, SREs want stability, and the error budget is the objective metric that settles disputes. If you have error budget remaining, ship that risky feature. If you’re out of budget, focus on reliability improvements. Some teams even allocate error budget across different types of changes—50% for planned deploys, 30% for experiments, 20% reserved for unexpected incidents. This prevents teams from spending their entire budget on planned changes and having nothing left when an incident occurs.
Red flags: Not understanding the relationship between SLO and error budget, claiming you should never have any downtime (unrealistic and stifles innovation), not tracking error budget consumption, or treating all downtime equally (1 minute during Black Friday is not the same as 1 minute at 3am on a Tuesday).
Red Flags to Avoid
Red Flag 1: “We’ll achieve 99.999% availability by using really reliable servers.” Why it’s wrong: No hardware is that reliable. Even enterprise-grade servers fail—hard drives crash, memory modules fail, power supplies die. Individual server availability is typically 99-99.9% at best. You cannot achieve five nines through hardware reliability alone. What to say instead: “We’ll achieve 99.999% availability through redundancy and rapid failover. We’ll deploy across multiple availability zones with load balancing, so when a server fails—and it will—traffic automatically routes to healthy instances within seconds. The math works: two independent servers with 99.9% availability each give us 1 - (0.001 × 0.001) = 99.9999% combined availability.”
Red Flag 2: “Availability and uptime are the same thing.” Why it’s wrong: Uptime measures whether the server process is running. Availability measures whether users can successfully use the system. A server can be “up” but returning errors, or responding so slowly that it’s unusable. Measuring uptime gives you false confidence. What to say instead: “We measure availability from the user’s perspective—what percentage of requests succeed within acceptable latency thresholds. Our SLI is ‘99.9% of requests return 2xx status codes within 200ms.’ A server that’s running but returning 500 errors or taking 10 seconds to respond is unavailable from the user’s perspective, even though it’s technically ‘up.’”
Red Flag 3: “We have a backup server, so we’re highly available.” Why it’s wrong: Having a backup doesn’t guarantee high availability unless you’ve tested failover, automated it, and minimized MTTR. Many teams discover during a real outage that their backup is out of sync, or failover takes 2 hours of manual work, or the backup can’t handle production load. Untested redundancy is worthless. What to say instead: “We have automated failover with health checks every 5 seconds. When the primary fails, the load balancer removes it from rotation within 10 seconds and traffic shifts to the backup. We test this monthly using chaos engineering—we deliberately kill the primary and verify that users experience no errors. We’ve measured our MTTR at 15 seconds, which gives us the availability we need.”
Red Flag 4: “We’ll just add more servers to improve availability.” Why it’s wrong: Adding servers improves capacity and redundancy, but doesn’t address other availability risks like database failures, network partitions, bad deploys, or correlated failures (all servers running the same buggy code). You can have 100 servers and still have zero availability if your single database fails. What to say instead: “We need to identify and eliminate single points of failure across the entire stack. Yes, we’ll add redundant application servers, but we also need Multi-AZ database replication, redundant network paths, and deployment strategies that prevent bad code from taking down all instances simultaneously. We’ll use gradual rollouts and automated rollbacks to catch bugs before they affect all servers.”
Red Flag 5: “Our SLA is 99.9%, so occasional downtime is fine.” Why it’s wrong: While technically true, this ignores the business impact of when downtime occurs. One hour of downtime during Black Friday costs far more than one hour at 3am on a random Tuesday. Users don’t care about monthly averages—they care about whether the system works when they need it. What to say instead: “Our SLA is 99.9% monthly availability, but we set our internal SLO at 99.95% to provide a buffer. We also track availability during business-critical time windows separately—we need 99.99% during peak shopping hours. We use time-windowed SLOs (99.9% in any 24-hour period) to prevent clustering all our downtime on one bad day. And we track availability per customer segment—if 1% of users see 50% errors, our aggregate 99.5% metric hides a serious problem.”
Key Takeaways
-
Availability measures the percentage of time a system is operational and accessible, typically expressed in “nines” (99.9%, 99.99%, etc.). Each additional nine costs exponentially more and reduces acceptable downtime dramatically—99.9% allows 43.8 minutes/month, while 99.99% allows just 4.38 minutes/month.
-
Achieve high availability through redundancy and rapid recovery, not by preventing all failures. Deploy across multiple availability zones, use load balancers to route around failures, implement health checks for fast detection, and automate failover to minimize MTTR. The formula Availability = MTBF / (MTBF + MTTR) shows that reducing recovery time is often more effective than reducing failure frequency.
-
Availability compounds multiplicatively for serial dependencies and improves dramatically with parallel redundancy. A system with three serial dependencies (99.9% each) has only 99.7% total availability, while two parallel redundant components (99% each) achieve 99.99% combined. This is why microservices can hurt availability and why redundancy is so powerful.
-
Measure availability from the user’s perspective, not server uptime. Track request success rate and latency percentiles, not just “is the process running?” A server returning errors or responding slowly is effectively unavailable. Use SLIs like “99.9% of requests succeed within 200ms” rather than simple uptime metrics.
-
Use error budgets to balance reliability with innovation velocity. If your SLO is 99.9%, you have 0.1% error budget (43.8 min/month) to spend on risky deploys and experiments. Track consumption in real-time—if you exceed your budget, freeze risky changes and focus on reliability improvements. This creates an objective framework for reliability vs. velocity tradeoffs.
Related Topics
Prerequisites: Reliability Patterns Overview • Fault Tolerance Basics • Load Balancing • Health Checks
Related Concepts: CAP Theorem • Redundancy Patterns • Failover Strategies • Circuit Breakers • Graceful Degradation
Next Steps: SLA vs SLO vs SLI • Multi-Region Architecture • Chaos Engineering • Disaster Recovery • Monitoring and Observability