Availability in Numbers: SLA & Nines Explained
TL;DR
Availability quantifies system uptime as a percentage, commonly expressed in “nines” (99.9%, 99.99%, etc.). Each additional nine represents 10x less downtime but exponentially higher cost and complexity. Understanding availability math—including how components combine in sequence (multiply) versus parallel (redundancy)—is essential for setting realistic SLAs and making informed architectural trade-offs.
Cheat Sheet: 99.9% = 8.7h/year downtime, 99.99% = 52min/year, 99.999% = 5.3min/year. Sequential components multiply availability (worse), parallel components add redundancy (better).
The Analogy
Think of availability like airline on-time performance. If an airline claims “99% on-time,” that sounds great until you realize it means 3-4 delayed flights per year if you fly weekly. Now imagine your connecting flight also has 99% reliability—suddenly your end-to-end trip reliability drops to 98% (99% × 99%). But if the airline has two planes ready for your route (parallel redundancy), your chances of getting a working plane jump to 99.99%. The math of nines works exactly like this: each dependency multiplies your risk, while each backup option divides it.
Why This Matters in Interviews
Availability numbers come up when discussing SLAs, incident response, and architecture decisions. Interviewers want to see that you understand the business impact of downtime (not just the technical definition), can calculate composite availability across dependencies, and recognize when pursuing additional nines isn’t worth the cost. Strong candidates translate percentages into user-facing impact (“99.9% means 40 users see errors per hour at our scale”) and connect availability targets to architectural patterns like redundancy, failover, and graceful degradation. This topic often bridges into discussions about monitoring, incident response, and cost-benefit analysis.
Core Concept
Availability measures the proportion of time a system successfully responds to requests under normal conditions. It’s expressed as a percentage calculated by dividing uptime by total time: Availability = Uptime / (Uptime + Downtime). In practice, availability is described using “nines notation”—a system with 99.9% availability has “three nines,” while 99.99% has “four nines.” This seemingly small difference in percentages translates to massive differences in acceptable downtime: three nines allows 8.7 hours of downtime per year, while four nines permits only 52 minutes.
The reason we obsess over these decimal places is that availability directly impacts user trust, revenue, and regulatory compliance. For a payment processor like Stripe, every minute of downtime means millions in lost transaction volume and damaged merchant relationships. For a social media platform, downtime means users migrating to competitors. The challenge isn’t just achieving high availability—it’s doing so cost-effectively, because each additional nine typically requires 10x more investment in redundancy, monitoring, and operational complexity.
Availability isn’t a single system property—it’s an emergent characteristic of how components combine. A system with ten microservices, each at 99.9% availability, doesn’t achieve 99.9% end-to-end availability if those services are called sequentially. Understanding availability math helps you make architectural decisions: should you add redundancy, reduce dependencies, or accept lower availability for non-critical paths? These calculations turn abstract percentages into concrete design constraints.
Availability Nines: Downtime Translation
graph LR
subgraph Availability Levels
A["99%<br/><i>Two Nines</i>"]
B["99.9%<br/><i>Three Nines</i>"]
C["99.99%<br/><i>Four Nines</i>"]
D["99.999%<br/><i>Five Nines</i>"]
end
subgraph Annual Downtime
A1["87.6 hours/year<br/><i>3.65 days</i>"]
B1["8.7 hours/year<br/><i>~1 workday</i>"]
C1["52 minutes/year<br/><i>~1 hour</i>"]
D1["5.3 minutes/year<br/><i>~5 minutes</i>"]
end
subgraph Cost & Complexity
A2["Baseline<br/><i>Single region</i>"]
B2["10x Cost<br/><i>Multi-AZ</i>"]
C2["100x Cost<br/><i>Multi-region</i>"]
D2["1000x Cost<br/><i>Active-active global</i>"]
end
A --> A1
B --> B1
C --> C1
D --> D1
A1 --> A2
B1 --> B2
C1 --> C2
D1 --> D2
Each additional nine reduces downtime by 10x but increases cost exponentially. The jump from 99.9% to 99.99% means going from 8.7 hours to 52 minutes of annual downtime, but requires multi-region deployment and significantly higher operational complexity.
How It Works
Step 1: Calculate Individual Component Availability
Start by measuring or estimating each component’s availability using historical data. If your database had 2 hours of downtime last year, its availability was (8760 - 2) / 8760 = 99.977%. Track both planned maintenance windows and unplanned outages separately—some SLAs exclude scheduled maintenance from availability calculations, while others don’t. Use monitoring systems to collect uptime data automatically rather than relying on manual incident reports, which often miss brief outages.
Step 2: Model Sequential Dependencies
When components are in sequence (request must pass through A, then B, then C), multiply their availability percentages. If your API gateway (99.95%), application server (99.9%), and database (99.99%) are sequential, total availability is 0.9995 × 0.999 × 0.9999 = 99.84%. This is worse than any individual component—a critical insight that explains why microservice architectures can paradoxically reduce availability despite each service being highly available. Every additional hop in the request path multiplies risk.
Step 3: Model Parallel Redundancy
When components are in parallel (system works if ANY component succeeds), use the formula: Availability_total = 1 - (1 - A₁) × (1 - A₂) × ... × (1 - Aₙ). If you have two load balancers, each with 99.9% availability, total availability is 1 - (0.001 × 0.001) = 99.9999%. This dramatic improvement is why redundancy is the primary tool for achieving high availability. However, true parallel redundancy requires that failures are independent—if both load balancers share the same power supply or network switch, they’re not truly parallel.
Step 4: Calculate Composite System Availability
Map your entire request path, identifying which components are sequential and which have redundancy. For example, a typical web request might be: Load Balancer (parallel) → API Gateway (sequential) → Service A (parallel) → Database (parallel with read replicas). Calculate each layer’s availability, then multiply sequential layers together. This reveals your weakest links—often the components without redundancy.
Step 5: Translate to Business Impact
Convert availability percentages to downtime windows and user impact. If you serve 10,000 requests per second and have 99.9% availability, approximately 10 requests per second will fail during outages. At 99.99%, only 1 request per second fails. Multiply by your average transaction value to estimate revenue impact. This translation from percentages to dollars is what justifies infrastructure investments to non-technical stakeholders.
Step 6: Set Realistic SLA Targets
Your SLA should be lower than your measured availability to provide a buffer for unexpected incidents. If your system achieves 99.95% availability, commit to a 99.9% SLA. This gives you room for maintenance and incident response without breaching contracts. Also consider that your SLA can’t exceed the availability of your dependencies—if you rely on a third-party API with 99.9% availability, you can’t promise customers 99.99%.
Sequential vs Parallel Availability Calculation
graph TB
subgraph Sequential Components - Multiply Availability
S1["Load Balancer<br/>99.95%"]
S2["API Gateway<br/>99.9%"]
S3["Service A<br/>99.9%"]
S4["Database<br/>99.99%"]
S1 --"Request flow"-->S2
S2 -->S3
S3 -->S4
Result1["Total: 0.9995 × 0.999 × 0.999 × 0.9999<br/>=<b>99.74%</b><br/><i>Worse than any component!</i>"]
S4 -.->Result1
end
subgraph Parallel Components - Redundancy Formula
P1A["Database Replica 1<br/>99.9%"]
P1B["Database Replica 2<br/>99.9%"]
Client["Client Request"]
Client --"Route to any"-->P1A
Client --"Route to any"-->P1B
Result2["Total: 1 - (0.001 × 0.001)<br/>=<b>99.9999%</b><br/><i>Dramatic improvement!</i>"]
P1A -.->Result2
P1B -.->Result2
end
Sequential components multiply availability (making it worse), while parallel redundancy uses the formula 1 - (1-A₁)×(1-A₂) to achieve dramatic improvements. This explains why microservices reduce availability and why redundancy is essential.
Key Principles
Principle 1: Availability Compounds Multiplicatively in Sequential Systems
When components are chained together, their availability percentages multiply, and the result is always worse than the weakest link. A system with five microservices, each at 99.9% availability, achieves only 99.5% end-to-end availability (0.999^5). This principle explains why monolithic architectures often have higher availability than microservices despite being less resilient to individual component failures—they have fewer sequential dependencies. In practice, this means you should minimize the number of synchronous hops in critical request paths. Netflix learned this lesson when they found that adding more microservices actually decreased availability until they implemented aggressive circuit breaking and fallback mechanisms.
Principle 2: Redundancy Provides Exponential Availability Gains
Adding a second redundant component doesn’t just double your availability—it squares your reliability. Two components at 99% availability in parallel achieve 99.99% (1 - 0.01²). Three components reach 99.9999%. This non-linear improvement is why high-availability systems always use redundancy: multiple availability zones, multiple data centers, multiple load balancers. However, redundancy only works if failures are truly independent. Google’s SRE teams discovered that seemingly independent systems often share hidden dependencies (same power grid, same software version, same configuration management system), which can cause correlated failures that defeat redundancy.
Principle 3: Each Additional Nine Costs 10x More
Moving from 99% to 99.9% availability might require adding database replication. Moving from 99.9% to 99.99% might require multi-region deployment. Moving from 99.99% to 99.999% might require active-active data centers with automated failover and extensive chaos engineering. The cost—in infrastructure, operational complexity, and engineering time—grows exponentially. Amazon’s approach is to ask: “What’s the business value of this nine?” For their retail site, 99.99% is critical because downtime directly impacts revenue. For their internal admin tools, 99.9% is acceptable because the cost of additional nines exceeds the productivity loss from occasional downtime.
Principle 4: Availability Must Be Measured End-to-End from User Perspective
Your database might have 99.99% availability, but if your monitoring system is down, users experience outages you don’t even detect. True availability is measured by successful user requests, not component uptime. This is why companies like Stripe use synthetic monitoring—automated scripts that simulate real user behavior and measure end-to-end availability from multiple geographic locations. They’ve found that component-level metrics can show 100% uptime while users experience failures due to network issues, DNS problems, or CDN misconfigurations. The SLA should reflect what users actually experience, not what your internal dashboards report.
Principle 5: Planned Maintenance Counts Toward Availability Unless Explicitly Excluded
Many teams are surprised when their SLA calculations include scheduled maintenance windows. If you take your system down for 4 hours every month for database upgrades, that’s 99.4% availability—not the 99.9% you promised customers. This is why high-availability systems invest in zero-downtime deployment strategies: blue-green deployments, rolling updates, database migration tools that work on live systems. GitHub learned this lesson when they realized their maintenance windows were causing them to miss their 99.95% SLA target. They invested heavily in online schema migrations and hot-swappable components, which allowed them to eliminate maintenance windows entirely and actually exceed their SLA.
Composite System Availability: Real Architecture Example
graph TB
subgraph Layer 1: Load Balancing - Parallel
LB1["Load Balancer 1<br/>99.9%"]
LB2["Load Balancer 2<br/>99.9%"]
L1Result["Layer 1 Total:<br/>99.9999%"]
LB1 & LB2 -.->L1Result
end
subgraph Layer 2: API Gateway - Single Point
API["API Gateway<br/>99.95%<br/><i>BOTTLENECK</i>"]
L2Result["Layer 2 Total:<br/>99.95%"]
API -.->L2Result
end
subgraph Layer 3: App Servers - Parallel
APP1["App Server 1<br/>99.9%"]
APP2["App Server 2<br/>99.9%"]
APP3["App Server 3<br/>99.9%"]
L3Result["Layer 3 Total:<br/>99.9999999%"]
APP1 & APP2 & APP3 -.->L3Result
end
subgraph Layer 4: Database - Parallel
DB1[("Primary DB<br/>99.99%")]
DB2[("Read Replica<br/>99.99%")]
L4Result["Layer 4 Total:<br/>99.999999%"]
DB1 & DB2 -.->L4Result
end
User["User Request"] --"1"-->LB1
User --"1"-->LB2
LB1 & LB2 --"2"-->API
API --"3"-->APP1
API --"3"-->APP2
API --"3"-->APP3
APP1 & APP2 & APP3 --"4"-->DB1
APP1 & APP2 & APP3 --"4"-->DB2
FinalCalc["<b>System Total Availability:</b><br/>0.999999 × 0.9995 × 0.999999999 × 0.99999999<br/>=<b>99.95%</b><br/><br/><i>Bottleneck: API Gateway (single component)<br/>Adding redundancy here would improve to 99.9999%</i>"]
L1Result & L2Result & L3Result & L4Result -."Multiply sequential layers"..->FinalCalc
Real systems combine sequential and parallel components. Calculate each layer’s availability (parallel components use redundancy formula), then multiply sequential layers. The single API Gateway at 99.95% becomes the bottleneck—no amount of redundancy in other layers can overcome it.
Deep Dive
Types / Variants
Variant 1: Service-Level Availability (Single Component)
This is the simplest form: measuring uptime for a single service or component. You track when the service is responding to health checks and serving requests successfully. Most monitoring tools (Datadog, New Relic, Prometheus) calculate this automatically by tracking HTTP 200 responses versus 5xx errors. Use this for: individual microservices, databases, load balancers, or any component with a clear success/failure signal. Pros: Easy to measure, clear ownership, directly actionable. Cons: Doesn’t reflect user experience if the component is part of a larger system. Example: Your authentication service has 99.95% availability, but if it’s called by ten other services, each of those services’ availability is capped at 99.95%.
Variant 2: End-to-End Availability (User Journey)
This measures availability from the user’s perspective: can they complete their intended action? For an e-commerce site, this means: load homepage → search for product → add to cart → checkout → payment confirmation. You multiply the availability of each step in the journey. Use this for: customer-facing SLAs, business-critical workflows, and understanding real user impact. Pros: Reflects actual user experience, aligns with business metrics, reveals hidden dependencies. Cons: Complex to measure, requires synthetic monitoring, harder to debug when it degrades. Example: Uber measures end-to-end availability for “request ride → driver accepts → complete trip” because that’s what matters to users, not whether individual microservices are up.
Variant 3: Regional Availability (Geographic Distribution)
This tracks availability separately for different geographic regions or availability zones. A system might have 99.99% availability in US-East but only 99.9% in EU-West due to infrastructure differences. Use this for: multi-region deployments, compliance requirements (GDPR, data residency), and understanding geographic failure patterns. Pros: Reveals regional weaknesses, helps with capacity planning, supports regional SLAs. Cons: Requires sophisticated monitoring infrastructure, complicates SLA calculations, may hide global issues. Example: Netflix tracks availability per AWS region and can shift traffic away from degraded regions automatically, maintaining overall availability even when individual regions have issues.
Variant 4: Weighted Availability (Business-Critical vs Best-Effort)
Not all features are equally important. Weighted availability assigns different importance to different endpoints or features. A payment API might target 99.99%, while a recommendation engine targets 99.9%. Use this for: systems with mixed criticality, cost optimization, and realistic SLA negotiations. Pros: Focuses investment on what matters, allows graceful degradation, more honest about system capabilities. Cons: Requires clear business prioritization, complex to communicate, may create internal conflicts about what’s “critical.” Example: Twitter’s core timeline must be highly available, but their trending topics feature can degrade without violating SLAs because it’s not business-critical.
Variant 5: Scheduled vs Unscheduled Availability
Some SLAs exclude planned maintenance windows from availability calculations, while others include them. Scheduled availability only counts unplanned outages, while unscheduled availability counts everything. Use this for: SLA negotiations, maintenance planning, and setting realistic expectations with stakeholders. Pros: Allows for necessary maintenance, more forgiving SLA targets, aligns with industry standards. Cons: Can be gamed by declaring outages “scheduled,” requires advance notice to customers, doesn’t reflect user experience during maintenance. Example: AWS’s SLA for EC2 is 99.99% but excludes scheduled maintenance with 14 days notice, while Google Cloud’s SLA includes all downtime, forcing them to invest more in zero-downtime maintenance procedures.
Trade-offs
Dimension 1: Higher Availability vs Cost
Option A (Higher Availability): Pursue 99.99% or 99.999% availability through multi-region redundancy, automated failover, extensive monitoring, and 24/7 on-call teams. Option B (Lower Cost): Accept 99.9% availability with single-region deployment, manual failover, and business-hours support. Decision Framework: Calculate the cost of downtime in lost revenue, customer churn, and brand damage. If one hour of downtime costs $100K but achieving an additional nine costs $50K/month, the math favors higher availability. However, if you’re a B2B SaaS tool with forgiving customers and downtime costs $5K/hour, 99.9% is probably sufficient. Stripe chose Option A because payment processing downtime directly impacts merchant revenue. A personal blog should choose Option B because the cost of redundancy far exceeds the impact of occasional downtime.
Dimension 2: Redundancy vs Complexity
Option A (More Redundancy): Deploy across multiple availability zones, use active-active databases, implement automatic failover, and maintain hot standbys. Option B (Simpler Architecture): Single-zone deployment with cold backups and manual failover procedures. Decision Framework: Redundancy improves availability but introduces operational complexity, configuration drift, and new failure modes (split-brain scenarios, replication lag, failover bugs). Choose Option A when downtime is unacceptable and you have the operational maturity to manage complexity. Choose Option B when you’re a small team, your system isn’t business-critical, or you’re still iterating on product-market fit. Instagram initially chose Option B (single MySQL instance) to move fast, then migrated to Option A as they scaled. Premature redundancy would have slowed their early growth.
Dimension 3: Synchronous vs Asynchronous Dependencies
Option A (Synchronous): Every request waits for all dependencies to respond before returning to the user. Availability is the product of all dependencies. Option B (Asynchronous): Use message queues, eventual consistency, and background jobs to decouple dependencies. Availability is higher because failures don’t block user requests. Decision Framework: Synchronous is simpler to reason about and provides immediate consistency, but tanks availability. Asynchronous is more complex but allows graceful degradation. Choose Option A for financial transactions, inventory management, or any workflow requiring strong consistency. Choose Option B for analytics, notifications, recommendations, or features where eventual consistency is acceptable. Amazon’s checkout is synchronous (must verify inventory and charge card), but their product recommendations are asynchronous (can show stale data without impacting purchase flow).
Dimension 4: Tight SLA vs Loose SLA
Option A (Tight SLA): Promise 99.99% availability with financial penalties for breaches. Option B (Loose SLA): Promise 99.9% availability or “best effort” with no penalties. Decision Framework: Tight SLAs command premium pricing and build customer trust but require significant infrastructure investment and create legal/financial risk. Loose SLAs are cheaper to deliver but may lose enterprise customers who need guarantees. Choose Option A when selling to enterprises, competing on reliability, or operating in regulated industries. Choose Option B when serving startups, offering freemium products, or still proving product-market fit. Salesforce built their brand on tight SLAs (“trust” is literally their domain name), while many developer tools offer loose SLAs because their customers value features over uptime guarantees.
Dimension 5: Preventive Maintenance vs Reactive Fixes
Option A (Preventive): Regular maintenance windows, proactive upgrades, chaos engineering, and extensive testing. Causes short, planned outages but prevents longer unplanned ones. Option B (Reactive): Run systems until they break, then fix issues as they arise. No planned downtime but higher risk of catastrophic failures. Decision Framework: Preventive maintenance improves long-term availability but requires downtime or zero-downtime deployment capabilities. Reactive approaches maximize short-term uptime but accumulate technical debt that eventually causes major incidents. Choose Option A for mature systems with predictable load and high availability requirements. Choose Option B for rapidly evolving systems where the cost of maintenance exceeds the risk of outages. Google’s SRE teams practice Option A with extensive chaos engineering, while many startups practice Option B until a major incident forces them to invest in preventive measures.
Availability vs Cost Decision Framework
graph TB
Start["New Feature:<br/>Payment Processing API"]
Q1{"Business Impact<br/>of Downtime?"}
Start --> Q1
Q1 --"High: $100K+/hour<br/>lost revenue"-->Target99_99["Target: 99.99%<br/><i>52 min/year downtime</i>"]
Q1 --"Medium: $10K/hour<br/>customer support"-->Target99_9["Target: 99.9%<br/><i>8.7 hours/year downtime</i>"]
Q1 --"Low: Internal tool<br/>minimal impact"-->Target99["Target: 99%<br/><i>3.6 days/year downtime</i>"]
Target99_99 --> Arch99_99["Architecture:<br/>• Multi-region active-active<br/>• Auto-failover databases<br/>• 24/7 on-call team<br/>• Chaos engineering<br/><br/><b>Cost: $500K/year</b>"]
Target99_9 --> Arch99_9["Architecture:<br/>• Multi-AZ deployment<br/>• Database replication<br/>• Business hours support<br/>• Basic monitoring<br/><br/><b>Cost: $50K/year</b>"]
Target99 --> Arch99["Architecture:<br/>• Single-zone deployment<br/>• Daily backups<br/>• Manual failover<br/>• Email alerts<br/><br/><b>Cost: $5K/year</b>"]
Arch99_99 --> ROI99_99["ROI Analysis:<br/>Downtime cost: $100K/hour<br/>52 min/year = $87K saved<br/>vs 99.9% (8.7 hours = $870K)<br/><br/><b>Savings: $783K/year</b><br/><i>Investment justified ✓</i>"]
Arch99_9 --> ROI99_9["ROI Analysis:<br/>Downtime cost: $10K/hour<br/>8.7 hours/year = $87K<br/>vs 99% (87 hours = $870K)<br/><br/><b>Savings: $783K/year</b><br/><i>Investment justified ✓</i>"]
Arch99 --> ROI99["ROI Analysis:<br/>Downtime cost: $1K/hour<br/>87 hours/year = $87K<br/>Additional investment not justified<br/><br/><b>Accept 99% availability</b><br/><i>Cost-effective choice ✓</i>"]
Availability targets should be driven by business impact, not engineering perfectionism. Calculate the cost of downtime versus the cost of achieving each availability level. Payment processing justifies 99.99%, while internal tools may be fine with 99%.
Common Pitfalls
Pitfall 1: Measuring Component Uptime Instead of User-Facing Availability
Why it happens: It’s easier to monitor individual services (“database is up”) than to measure end-to-end user journeys (“users can complete checkout”). Teams celebrate 99.99% database uptime while users experience 99.5% checkout success rates due to network issues, API gateway failures, or client-side errors. How to avoid it: Implement synthetic monitoring that simulates real user behavior from multiple geographic locations. Track business metrics (successful transactions, completed signups) alongside technical metrics (service uptime). Netflix’s approach is to measure “stream starts per second” as their primary availability metric because that’s what users care about, not whether individual microservices are responding to health checks. Set up alerts based on user-facing metrics, not just component health.
Pitfall 2: Ignoring Correlated Failures in Redundant Systems
Why it happens: Teams deploy redundant components and assume they’re truly independent, but they share hidden dependencies: same software version, same configuration, same network switch, same power supply, or same deployment pipeline. When that shared dependency fails, all “redundant” components fail simultaneously, and your calculated 99.9999% availability becomes 99% reality. How to avoid it: Map all shared dependencies explicitly. Use different software versions across redundant components (canary deployments). Deploy across multiple availability zones and regions. Test failure scenarios regularly with chaos engineering—actually kill components and verify that redundancy works. Google’s SRE teams discovered that their “redundant” load balancers all used the same configuration management system, which became a single point of failure. They now use different configuration systems for different redundancy layers.
Pitfall 3: Setting Availability Targets Without Considering Cost
Why it happens: Product managers demand “five nines” without understanding the exponential cost curve. Engineering teams either over-invest in availability for non-critical systems or under-invest and miss SLA targets. How to avoid it: Calculate the business cost of downtime (lost revenue, customer churn, support costs) and compare it to the infrastructure cost of achieving each availability level. Create a cost-benefit matrix showing that 99.9% costs $X, 99.99% costs $10X, and 99.999% costs $100X. Then ask: “What’s the ROI of each additional nine?” Amazon’s approach is to assign different availability targets to different services based on business impact. Their retail site needs 99.99%, but their internal HR system is fine with 99.9% because the cost of higher availability exceeds the productivity loss from occasional downtime.
Pitfall 4: Forgetting That Availability Is a Rate, Not a State
Why it happens: Teams think “we have 99.9% availability” as a permanent characteristic, but availability is measured over time windows. You might have 99.99% availability over a year but 95% availability during Black Friday when traffic spikes. How to avoid it: Measure availability across multiple time windows (hourly, daily, weekly, monthly, yearly) and identify patterns. Set different SLA targets for peak vs off-peak hours if appropriate. Use percentile-based metrics (p50, p99, p99.9) to understand the distribution of availability, not just the average. Stripe discovered that their yearly availability was 99.95%, but during payment processing peaks (end of month), it dropped to 99.8%. They now provision capacity based on peak load and measure availability separately for high-traffic periods.
Pitfall 5: Treating All Downtime Equally
Why it happens: Availability calculations treat a 5-minute outage at 3 AM the same as a 5-minute outage during peak business hours, but the user impact is vastly different. A 1-hour outage during Black Friday might cost more than 10 hours of downtime spread across quiet periods. How to avoid it: Implement weighted availability metrics that account for traffic volume and business impact. Track “user-minutes of downtime” (downtime × concurrent users) rather than just clock time. Set stricter SLA targets for peak hours. Example: Uber’s availability SLA is stricter during Friday/Saturday nights (peak ride demand) than Tuesday mornings. They provision extra capacity and have more on-call engineers during high-impact windows. This approach aligns technical metrics with business reality and helps prioritize incident response.
Component Uptime vs User-Facing Availability
graph TB
subgraph Internal Monitoring View
DB[("Database<br/><b>99.99% uptime</b><br/><i>Health checks passing</i>")]
API["API Server<br/><b>99.95% uptime</b><br/><i>Responding to pings</i>"]
LB["Load Balancer<br/><b>100% uptime</b><br/><i>All instances healthy</i>"]
Dashboard["Internal Dashboard:<br/><b>System Status: GREEN ✓</b><br/>All components operational"]
DB & API & LB -.->Dashboard
end
subgraph User Experience Reality
User1["User in US<br/><i>Success: 99.5%</i>"]
User2["User in EU<br/><i>Success: 98.2%</i>"]
User3["User in Asia<br/><i>Success: 97.8%</i>"]
Issues["Hidden Failure Points:<br/>• CDN serving stale content<br/>• DNS resolution failures<br/>• Network routing issues<br/>• Client-side JS errors<br/>• API rate limiting<br/>• Slow queries timing out<br/>• Cross-region latency"]
User1 & User2 & User3 -."Experience failures".->Issues
end
Synthetic["Synthetic Monitoring<br/><i>Simulates real user journeys</i>"]
Synthetic --"1. Load homepage"-->LB
LB --"2. API call"-->API
API --"3. Query data"-->DB
DB --"4. Return response"-->API
API --"5. Render page"-->LB
LB --"6. Measure success"-->Synthetic
RealMetric["<b>Actual Availability: 99.1%</b><br/><i>Measured end-to-end from user perspective</i><br/><br/>Gap: 0.89% = 78 hours/year<br/>of failures invisible to internal monitoring"]
Synthetic -."Reveals true availability".->RealMetric
Issues -."Explains gap".->RealMetric
Component uptime (database responding to health checks) doesn’t equal user-facing availability. Users experience failures from network issues, CDN problems, DNS failures, and client-side errors that internal monitoring mis
Math & Calculations
Formula 1: Basic Availability Calculation
Availability = Uptime / (Uptime + Downtime)
Variables:
- Uptime: Time the system was operational and serving requests successfully
- Downtime: Time the system was unavailable or failing requests
- Both measured in the same units (hours, minutes, seconds)
Worked Example: Your service had 3 hours of downtime last month (720 hours total). Availability = 717 / 720 = 0.99583 = 99.583% (approximately “two and a half nines”). This translates to 35.6 hours of downtime per year if the pattern continues.
Formula 2: Sequential Component Availability (Multiplication)
Availability_total = A₁ × A₂ × A₃ × ... × Aₙ
Variables:
- A₁, A₂, A₃, Aₙ: Availability of each component in the request path (expressed as decimals, not percentages)
- Components must be in sequence (request must pass through all of them)
Worked Example: Your request path is Load Balancer (99.95%) → API Gateway (99.9%) → Service A (99.9%) → Database (99.99%). Total availability = 0.9995 × 0.999 × 0.999 × 0.9999 = 0.9974 = 99.74%. Notice this is worse than any individual component—you’ve lost 0.26 percentage points (about 23 hours per year) just from having four sequential hops. This is why microservice architectures often have lower availability than monoliths despite each service being highly available.
Formula 3: Parallel Component Availability (Redundancy)
Availability_total = 1 - (1 - A₁) × (1 - A₂) × ... × (1 - Aₙ)
Variables:
- A₁, A₂, Aₙ: Availability of each redundant component (expressed as decimals)
- Components must be truly parallel (system works if ANY component succeeds)
- Assumes failures are independent
Worked Example: You have two database replicas, each with 99.9% availability. If failures are independent, total availability = 1 - (1 - 0.999) × (1 - 0.999) = 1 - (0.001 × 0.001) = 1 - 0.000001 = 0.999999 = 99.9999% (“five nines”). Adding a third replica: 1 - (0.001)³ = 99.9999999% (“nine nines”). This dramatic improvement is why redundancy is the primary tool for high availability. However, if both replicas share the same power supply (correlated failure), the actual availability is much lower.
Formula 4: Converting Availability Percentage to Downtime
Downtime = Total_Time × (1 - Availability)
Variables:
- Total_Time: The time period you’re measuring (year = 8760 hours, month = 730 hours, week = 168 hours)
- Availability: Expressed as a decimal (99.9% = 0.999)
Worked Example: You promise 99.95% availability. Annual downtime = 8760 hours × (1 - 0.9995) = 8760 × 0.0005 = 4.38 hours per year = 262.8 minutes per year. Monthly downtime = 730 × 0.0005 = 0.365 hours = 21.9 minutes per month. This helps you set realistic incident response targets: if you have 99.95% availability and experience an outage, you have about 22 minutes per month of “budget” before breaching your SLA.
Formula 5: Composite System Availability (Mixed Sequential and Parallel)
Availability_total = A_layer1 × A_layer2 × ... × A_layerN
where each layer might be:
- Single component: A_layer = A_component
- Parallel components: A_layer = 1 - ∏(1 - A_i)
Worked Example: Your system has:
- Layer 1: Two load balancers in parallel, each 99.9% available
- Layer 2: API Gateway (single component), 99.95% available
- Layer 3: Three application servers in parallel, each 99.9% available
- Layer 4: Two database replicas in parallel, each 99.99% available
Calculate each layer:
- Layer 1: 1 - (0.001)² = 99.9999%
- Layer 2: 99.95% (single component)
- Layer 3: 1 - (0.001)³ = 99.9999999%
- Layer 4: 1 - (0.0001)² = 99.999999%
Total: 0.999999 × 0.9995 × 0.999999999 × 0.99999999 = 0.9995 = 99.95%
Notice that Layer 2 (the single API Gateway at 99.95%) is the bottleneck. No amount of redundancy in other layers can overcome this single point of failure. This calculation reveals where to invest in redundancy: adding a second API Gateway would improve total availability to 99.9999%, while adding more application servers or databases has negligible impact.
Formula 6: SLA Buffer Calculation
SLA_target = Measured_Availability - Safety_Buffer
Safety_Buffer = (Measured_Availability - SLA_target) × Total_Time
Worked Example: Your system achieves 99.97% availability (measured over 6 months). You want to set an SLA with a safety buffer for unexpected incidents. If you set SLA at 99.95%, your buffer is 0.02% = 0.0002 × 8760 hours = 1.75 hours per year. This means you can have 1.75 hours of additional downtime beyond your typical performance before breaching the SLA. This buffer accounts for black swan events, maintenance windows, and measurement uncertainty. AWS typically sets SLA targets 0.01-0.05% below their measured availability to provide this cushion.
Real-World Examples
Example 1: Amazon’s 99.99% S3 Availability SLA
Amazon S3 promises 99.99% availability (52 minutes of downtime per year) backed by financial credits if they miss the target. To achieve this, they replicate data across multiple availability zones within a region, use erasure coding to tolerate disk failures, and implement automated failover at multiple layers. The interesting detail: their internal target is actually 99.995% (26 minutes per year) to provide a buffer for unexpected incidents. They measure availability per-region and track it using synthetic monitoring that continuously uploads and downloads objects from multiple locations. When they experienced a major outage in 2017 that lasted 4 hours, they paid out millions in SLA credits and published a detailed post-mortem. This incident taught them that even with extensive redundancy, correlated failures (in this case, a typo in a maintenance command that took down more servers than intended) can defeat availability guarantees. They now use gradual rollouts for all maintenance operations and have stricter limits on how many servers can be affected by a single command.
Example 2: Netflix’s Regional Failover Strategy
Netflix targets 99.99% availability for their streaming service, which serves 200+ million subscribers globally. They achieve this through active-active deployment across multiple AWS regions. Each region can handle 100% of traffic, and they continuously shift small percentages of traffic between regions to test failover mechanisms. Their availability calculation is complex: they measure “stream starts per second” as the primary metric because that’s what users care about. If a user can’t start a stream, that’s downtime, even if all backend services are “up.” The interesting detail: they discovered that their calculated availability (based on component uptime) was 99.99%, but user-experienced availability was only 99.95% due to client-side issues, CDN problems, and ISP routing failures. This led them to implement extensive client-side monitoring and to measure availability from the user’s perspective, not just from their data centers. They also practice chaos engineering extensively—randomly killing services in production to verify that redundancy works. This revealed that some of their “redundant” components shared hidden dependencies that caused correlated failures.
Example 3: Stripe’s Tiered Availability Approach
Stripe, as a payment processor, has different availability targets for different API endpoints. Their core payment processing API targets 99.99% availability (52 minutes per year) because downtime directly impacts merchant revenue. However, their reporting and analytics APIs target 99.9% (8.7 hours per year) because merchants can tolerate occasional delays in viewing reports. This tiered approach allows them to optimize costs—the payment API runs in multiple regions with active-active replication and automated failover, while the reporting API runs in a single region with daily backups. The interesting detail: they calculate availability separately for different geographic regions and different customer tiers. Enterprise customers get 99.995% availability (26 minutes per year) through dedicated infrastructure, while standard customers get 99.99%. They also exclude planned maintenance from their SLA calculations but provide 14 days notice and perform maintenance during low-traffic windows. When they experienced a 2-hour outage in 2019, they published a detailed incident report explaining that a database migration script had a bug that caused cascading failures across multiple services. This taught them to test all migration scripts in production-like environments and to implement better circuit breakers to prevent cascading failures. They now measure availability using synthetic transactions that simulate real payment flows every 30 seconds from 20+ global locations.
Interview Expectations
Mid-Level
What you should know: Understand the basic availability formula (uptime / total time) and be able to convert percentages to downtime (99.9% = 8.7 hours/year). Explain how sequential components multiply availability (making it worse) and how parallel redundancy improves it. Recognize that 99.9% vs 99.99% isn’t just a 0.09% difference—it’s 10x less downtime. Be able to discuss why high availability matters for business-critical systems and give examples of when you’d choose 99.9% vs 99.99% based on cost-benefit analysis.
Bonus points: Calculate composite availability for a simple system with 3-4 components. Explain the difference between component uptime and user-facing availability. Discuss how you’d measure availability in practice (monitoring, synthetic tests, SLA tracking). Mention that redundancy only works if failures are independent and give an example of correlated failures.
Senior
What you should know: Everything from mid-level, plus: Calculate availability for complex systems with mixed sequential and parallel components. Explain the exponential cost curve of achieving additional nines and make data-driven decisions about when to invest in higher availability. Discuss trade-offs between tight SLAs (99.99% with penalties) vs loose SLAs (99.9% best-effort). Understand how to set SLA targets with safety buffers below measured availability. Explain different types of availability (service-level, end-to-end, regional, weighted) and when to use each. Discuss how availability interacts with other system properties (latency, consistency, cost).
Bonus points: Share real examples of availability incidents you’ve handled and how you improved system availability afterward. Explain chaos engineering and how you’d test that redundancy actually works. Discuss the difference between planned and unplanned downtime and strategies for zero-downtime deployments. Calculate the business impact of downtime (lost revenue, customer churn) and use it to justify infrastructure investments. Explain how microservices can paradoxically reduce availability despite each service being highly available, and mitigation strategies (circuit breakers, fallbacks, async communication).
Staff+
What you should know: Everything from senior, plus: Design availability strategies for entire organizations, not just individual systems. Set company-wide availability standards and SLA frameworks. Make strategic decisions about when to pursue additional nines and when to accept lower availability. Design systems that gracefully degrade under partial failures rather than failing completely. Understand the relationship between availability, incident response, and organizational culture (blameless post-mortems, on-call rotations, SRE practices). Explain how to measure and improve availability across distributed teams and multiple services. Discuss the economics of availability: how to calculate ROI of reliability investments and communicate trade-offs to executives.
Distinguishing signals: You’ve designed availability strategies for systems at scale (millions of users, billions of requests). You can explain how companies like Netflix, Amazon, and Google achieve high availability and the specific techniques they use. You’ve made hard decisions about accepting lower availability to ship faster or investing heavily in reliability for business-critical systems. You understand the organizational aspects: how to build a culture of reliability, how to structure on-call rotations, how to balance feature development with reliability work. You can design SLA frameworks that align technical metrics with business outcomes and negotiate SLAs with customers or partners. You’ve handled major incidents and led post-mortem processes that resulted in systemic improvements.
Common Interview Questions
Question 1: “How would you design a system to achieve 99.99% availability?”
Concise answer (60 seconds): I’d start by eliminating single points of failure through redundancy at every layer: multiple availability zones, load balancers, application servers, and database replicas. Implement automated health checks and failover so the system can recover from failures without manual intervention. Use circuit breakers to prevent cascading failures. Deploy across multiple regions if the budget allows. Set up comprehensive monitoring to detect issues before users notice them. Calculate the availability of each component and ensure the composite availability meets the target—if any component is below 99.99%, it needs redundancy.
Detailed answer (2 minutes): First, I’d map the entire request path and identify all sequential dependencies. Each dependency multiplies risk, so I’d minimize synchronous calls—use async communication where possible. For critical path components, I’d implement redundancy: deploy across at least two availability zones with automatic failover. Use managed services (like RDS Multi-AZ) where available because they handle failover automatically. Implement health checks at every layer and use load balancers that automatically route around unhealthy instances. Add circuit breakers to prevent cascading failures—if a dependency is slow or failing, fail fast rather than waiting for timeouts. Set up synthetic monitoring that simulates real user journeys every minute from multiple locations to measure end-to-end availability. Create runbooks for common failure scenarios so on-call engineers can respond quickly. Most importantly, practice chaos engineering—regularly test that your redundancy actually works by killing components in production. I’d also set the internal availability target at 99.995% to provide a buffer for unexpected incidents, and measure availability from the user’s perspective (successful transactions) not just component uptime.
Red flags: Focusing only on infrastructure without discussing monitoring, incident response, or testing. Claiming you can achieve 99.99% without redundancy or multi-AZ deployment. Not considering the cost implications or discussing trade-offs. Ignoring the difference between component availability and end-to-end user experience.
Question 2: “Your system has three microservices in sequence, each with 99.9% availability. What’s the total availability?”
Concise answer (60 seconds): The total availability is 99.9% × 99.9% × 99.9% = 99.7%. When components are in sequence, you multiply their availability percentages, which always makes things worse. This is a key insight about microservices—adding more services actually reduces availability unless you add redundancy or use async communication. To improve this, I’d add redundancy to each service (multiple instances across availability zones) or redesign the architecture to reduce sequential dependencies.
Detailed answer (2 minutes): Total availability is 0.999³ = 0.997 = 99.7%. This means about 26 hours of downtime per year instead of 8.7 hours if it were a single service. This is a fundamental problem with microservices—each additional hop in the request path multiplies risk. To improve this, I have several options: First, add redundancy to each service. If each service has two instances in parallel, each at 99.9%, the per-service availability becomes 99.9999%, and total availability becomes 99.9997%. Second, reduce sequential dependencies by using async communication where possible—if Service C doesn’t need to respond synchronously, use a message queue. Third, implement circuit breakers and fallbacks so that if Service B fails, Service A can return a degraded response rather than failing completely. Fourth, consider whether you really need three separate services—sometimes a monolith has better availability than microservices because it has fewer network hops. The key insight is that availability compounds multiplicatively in sequential systems, so you need to be very intentional about your architecture.
Red flags: Getting the math wrong (adding instead of multiplying, or not converting percentages to decimals). Not recognizing that this is worse than any individual service. Not discussing solutions to improve availability. Claiming microservices always improve availability without acknowledging this trade-off.
Question 3: “How do you decide between 99.9% and 99.99% availability for a new feature?”
Concise answer (60 seconds): I’d calculate the business cost of downtime versus the infrastructure cost of achieving each level. For 99.9%, you get 8.7 hours of downtime per year; for 99.99%, only 52 minutes. If the feature is business-critical (payment processing, core user flows), the cost of downtime likely justifies 99.99%. If it’s a nice-to-have feature (analytics, recommendations), 99.9% is probably sufficient. I’d also consider operational complexity—99.99% requires multi-AZ deployment, automated failover, and 24/7 on-call, which might not be worth it for a non-critical feature.
Detailed answer (2 minutes): I’d start by calculating the business impact of downtime. If the feature processes $1M per hour in transactions, 8 hours of downtime costs $8M per year. If achieving 99.99% instead of 99.9% costs $500K/year in infrastructure and engineering time, the ROI is clear. But if the feature is an internal admin tool used by 10 employees, the productivity loss from 8 hours of downtime is maybe $5K, so investing $500K in higher availability makes no sense. I’d also consider user expectations—payment processing must be highly available because merchants depend on it, but a recommendation engine can degrade gracefully. Next, I’d look at dependencies—if the feature relies on a third-party API with 99.9% availability, I can’t promise 99.99% no matter how much I invest. I’d also consider the team’s operational maturity—achieving 99.99% requires sophisticated monitoring, automated failover, chaos engineering, and 24/7 on-call. If we don’t have those capabilities, we’ll fail to meet the target and damage customer trust. Finally, I’d set the SLA slightly below our target (if we achieve 99.99%, promise 99.95%) to provide a buffer for unexpected incidents. The key is aligning technical decisions with business priorities and being honest about what we can realistically deliver.
Red flags: Not considering cost or business impact—just saying “higher is always better.” Not recognizing that availability depends on dependencies. Not discussing operational requirements (monitoring, on-call, incident response). Promising availability targets without understanding what it takes to achieve them.
Question 4: “Your database has 99.99% availability, but users are experiencing 99.5% availability. What could be wrong?”
Concise answer (60 seconds): The database is just one component in the request path. Users experience end-to-end availability, which is the product of all components: load balancer, API gateway, application servers, database, network, CDN, and even client-side code. If any of these has lower availability or if there are network issues, DNS problems, or client-side errors, users experience failures even though the database is up. I’d implement synthetic monitoring that simulates real user journeys to identify where failures are occurring.
Detailed answer (2 minutes): This is a classic example of measuring component uptime instead of user-facing availability. The database might be responding to health checks, but users could be experiencing failures at multiple other points: The load balancer might be dropping connections under high load. The API gateway might have rate limiting that’s too aggressive. Application servers might be timing out due to slow database queries (the database is “up” but slow). Network issues between availability zones could cause intermittent failures. DNS resolution could be failing for some users. The CDN might be serving stale or corrupted content. Client-side JavaScript errors could prevent requests from being sent. To diagnose this, I’d implement synthetic monitoring that simulates complete user journeys (load page → make API call → process response) from multiple geographic locations. I’d track the success rate at each layer to identify where failures occur. I’d also look at error logs, latency percentiles (p99, p99.9), and user-reported issues. Often, the problem is that we’re measuring availability from our data center’s perspective (“is the database responding to health checks?”) rather than from the user’s perspective (“can users complete their intended action?”). The fix is to measure what users actually experience and to track availability end-to-end, not just component by component.
Red flags: Assuming the database is the problem without investigating other components. Not understanding the difference between component uptime and end-to-end availability. Not mentioning monitoring or diagnostics. Focusing only on backend systems without considering network, CDN, or client-side issues.
Question 5: “How would you improve availability without increasing costs?”
Concise answer (60 seconds): Focus on reducing sequential dependencies and improving operational practices rather than adding hardware. Use async communication where possible to decouple services. Implement circuit breakers and fallbacks so failures don’t cascade. Improve monitoring and alerting to catch issues faster. Create runbooks for common incidents to reduce mean time to recovery. Practice chaos engineering to find weaknesses before they cause outages. Optimize slow database queries to reduce timeouts. These operational improvements can significantly boost availability without major infrastructure investments.
Detailed answer (2 minutes): There are several ways to improve availability without adding redundancy or infrastructure. First, reduce sequential dependencies—if you have five microservices in sequence, each at 99.9%, you get 99.5% total availability. Redesign to use async communication where possible, or combine services to reduce network hops. Second, implement circuit breakers and fallbacks. If a dependency fails, fail fast and return a degraded response rather than waiting for timeouts. This prevents cascading failures and improves user-perceived availability. Third, improve operational practices: better monitoring catches issues before users notice them, runbooks reduce mean time to recovery, and blameless post-mortems prevent repeat incidents. Fourth, optimize performance—slow queries that timeout look like availability issues to users. If you reduce p99 latency from 5 seconds to 500ms, you’ll see fewer timeouts and higher availability. Fifth, practice chaos engineering to find weaknesses in your system. Randomly kill services in staging to verify that your error handling works. Sixth, implement graceful degradation—if the recommendation engine fails, show a default list rather than an error page. Finally, review your SLA—if you’re promising 99.99% but only achieving 99.95%, either improve the system or adjust the SLA to match reality. Sometimes the cheapest way to improve availability is to set more realistic expectations.
Red flags: Only suggesting adding more servers or redundancy (the question specifically asks about not increasing costs). Not considering operational improvements or architectural changes. Not mentioning monitoring, incident response, or chaos engineering. Focusing only on infrastructure without considering software improvements.
Red Flags to Avoid
Red Flag 1: “We have 99.99% availability because our database has 99.99% uptime.”
Why it’s wrong: Component uptime doesn’t equal system availability. Even if your database is perfect, users experience failures from load balancers, application servers, network issues, DNS problems, CDN failures, and client-side errors. Availability is measured end-to-end from the user’s perspective, not component by component. What to say instead: “Our database has 99.99% uptime, but we measure system availability end-to-end by tracking successful user transactions. We use synthetic monitoring to simulate real user journeys and measure availability from multiple geographic locations. Our current end-to-end availability is 99.95%, which accounts for all components in the request path.”
Red Flag 2: “Adding more microservices will improve availability because each service can scale independently.”
Why it’s wrong: More microservices means more sequential dependencies, which multiplies failure risk. If you have five services in sequence, each at 99.9% availability, total availability is 99.5%—worse than a monolith at 99.9%. Microservices can improve scalability and development velocity, but they often reduce availability unless you add significant redundancy and implement circuit breakers. What to say instead: “Microservices can improve scalability, but they reduce availability because each service adds a sequential dependency. To maintain high availability with microservices, we need redundancy at each layer, circuit breakers to prevent cascading failures, and async communication where possible to decouple dependencies. Sometimes a monolith actually has better availability than microservices because it has fewer network hops.”
Red Flag 3: “We’ll just add redundancy everywhere to achieve 99.999% availability.”
Why it’s wrong: Redundancy only works if failures are independent. If your “redundant” components share a power supply, network switch, software version, or configuration system, they can fail simultaneously. Also, redundancy adds operational complexity—more components to monitor, more failure modes to handle, more configuration drift. And each additional nine costs exponentially more. What to say instead: “Redundancy improves availability, but only if failures are truly independent. We need to map all shared dependencies—power, network, software versions, configuration systems—and ensure they’re not single points of failure. We also need to practice chaos engineering to verify that redundancy actually works. And we need to consider the cost—each additional nine typically costs 10x more in infrastructure and operational complexity. We should target the availability level that makes business sense, not just maximize nines.”
Red Flag 4: “Availability is an infrastructure problem—the ops team will handle it.”
Why it’s wrong: Availability is a system-wide property that depends on architecture, code quality, operational practices, and organizational culture. Developers who write code that doesn’t handle errors gracefully, or who deploy without testing, or who create tight coupling between services, are creating availability problems that no amount of infrastructure can fix. What to say instead: “Availability is everyone’s responsibility. Developers need to write code that handles failures gracefully, implements circuit breakers and timeouts, and degrades gracefully under load. Architects need to design systems with minimal sequential dependencies and appropriate redundancy. Ops teams need to implement monitoring, automated failover, and incident response processes. And the organization needs a culture of reliability—blameless post-mortems, chaos engineering, and balancing feature development with reliability work.”
Red Flag 5: “We achieved 99.99% availability last month, so we can promise that to customers.”
Why it’s wrong: One month of data isn’t enough to set an SLA. You might have gotten lucky with no major incidents. Also, your SLA should be lower than your measured availability to provide a buffer for unexpected incidents. If you promise exactly what you’ve achieved, the first incident will breach your SLA and trigger penalties. What to say instead: “We’ve measured 99.99% availability over the past six months, which gives us confidence in our system’s reliability. However, we should set our SLA at 99.95% to provide a buffer for unexpected incidents, seasonal traffic spikes, and maintenance windows. This buffer protects both us and our customers—we’re less likely to breach the SLA, and customers get a realistic expectation of what we can deliver. We can always exceed our SLA, which builds trust, but we should never promise more than we can consistently deliver.”
Key Takeaways
-
Availability is measured in nines, and each nine matters exponentially: 99.9% allows 8.7 hours of downtime per year, 99.99% allows 52 minutes, and 99.999% allows 5.3 minutes. The difference between three nines and five nines is 100x less downtime, but typically 100x more cost in infrastructure and operational complexity.
-
Sequential components multiply availability (making it worse), while parallel redundancy improves it: If three services in sequence each have 99.9% availability, total availability is 99.7%. But if two redundant components each have 99.9% availability, total availability is 99.9999%. This math explains why microservices can reduce availability and why redundancy is essential for high availability.
-
Measure availability from the user’s perspective, not component uptime: Your database might have 99.99% uptime, but users could experience 99.5% availability due to network issues, API gateway failures, or client-side errors. Use synthetic monitoring to simulate real user journeys and measure end-to-end success rates.
-
Set SLA targets below your measured availability to provide a buffer: If your system achieves 99.99% availability, promise 99.95% to customers. This buffer accounts for unexpected incidents, seasonal traffic spikes, and maintenance windows. You can always exceed your SLA (which builds trust), but breaching it damages customer relationships and triggers financial penalties.
-
Each additional nine costs exponentially more—make data-driven decisions about when it’s worth it: Calculate the business cost of downtime (lost revenue, customer churn, brand damage) and compare it to the infrastructure cost of achieving each availability level. Sometimes 99.9% is sufficient, and investing in 99.99% would cost more than the downtime it prevents. Align availability targets with business priorities, not engineering perfectionism.
Related Topics
Prerequisites: Before diving into availability numbers, you should understand Availability Patterns for the foundational concepts of redundancy, failover, and replication strategies. Also review Reliability vs Availability to understand how these related but distinct concepts interact.
Related Concepts: After mastering availability calculations, explore Load Balancing to understand how traffic distribution affects availability, and Health Checks and Monitoring to learn how to measure and track availability in practice. Also see Circuit Breakers for preventing cascading failures that tank availability.
Advanced Topics: Once you’re comfortable with availability math, study Chaos Engineering to learn how to test that your redundancy actually works, and SLA/SLO/SLI to understand how to formalize availability commitments. For organizational aspects, see Incident Response and On-Call Best Practices.