Availability Monitoring: Uptime & SLA Tracking
TL;DR
Availability monitoring tracks whether your system is operational and accessible to users over time, measuring uptime percentages (99.9%, 99.99%) and generating historical statistics. Unlike health checks that show current status, availability monitoring answers “how reliable has this system been?” by aggregating uptime data, detecting outages, and calculating SLA compliance. Cheat sheet: Availability = Uptime / (Uptime + Downtime). Monitor from multiple locations, track per-component availability, alert on SLA violations, and maintain historical trends for capacity planning and incident post-mortems.
The Analogy
Think of availability monitoring like a security camera system for a retail store. Health monitoring is the live feed showing what’s happening right now—is the door open? Are lights on? Availability monitoring is the recorded footage and analytics: “The store was open 99.5% of business hours last month, with three outages totaling 4 hours.” You use the recordings to identify patterns (“We always go down during deployment windows”), prove compliance (“We met our 99.9% SLA”), and investigate incidents (“What exactly failed at 3 AM last Tuesday?”). Just as store managers review footage to improve operations, engineering teams use availability data to make systems more reliable.
Why This Matters in Interviews
Availability monitoring comes up when discussing production operations, SLAs, or incident response. Interviewers want to see that you understand the difference between real-time health checks and historical availability tracking, know how to measure and report uptime, and can design monitoring that catches outages quickly while minimizing false positives. Senior candidates should discuss multi-region monitoring strategies, composite availability calculations, and how availability data drives architectural decisions. This topic often bridges into reliability engineering, alerting strategies, and post-incident analysis.
Core Concept
Availability monitoring is the practice of continuously tracking whether your system and its components are operational and accessible to users, then aggregating this data into meaningful uptime statistics. While health monitoring tells you the current state (“Is the API responding right now?”), availability monitoring answers historical questions: “What percentage of time was the API available last month?” and “Did we meet our 99.9% SLA?”
This distinction matters because availability is a promise you make to users and a metric you report to stakeholders. When Stripe promises 99.99% API availability, they’re committing to less than 52 minutes of downtime per year. Availability monitoring is the system that measures whether they kept that promise. It combines synthetic monitoring (probing your system from external locations), log analysis (tracking successful vs failed requests), and incident tracking (recording when and why outages occurred).
The data from availability monitoring serves multiple purposes: proving SLA compliance to customers, identifying reliability trends for capacity planning, providing context during incident post-mortems, and informing architectural decisions. If your database has 99.5% availability but your API has 99.9%, the database is your reliability bottleneck. This insight—visible only through historical availability tracking—drives investment in database redundancy.
Availability Monitoring vs Health Monitoring
graph LR
subgraph Health Monitoring
HC[Health Check]
Status[Current Status]
LB[Load Balancer]
HC -->|Every 5s| Status
Status -->|Healthy/Unhealthy| LB
end
subgraph Availability Monitoring
Synthetic[Synthetic Monitors]
RUM[Real User Monitoring]
Aggregator[Data Aggregator]
Metrics[(Time Series DB)]
Dashboard[SLA Dashboard]
Synthetic -->|Success/Fail| Aggregator
RUM -->|Request Metrics| Aggregator
Aggregator -->|Store| Metrics
Metrics -->|99.9% uptime| Dashboard
end
HC -.->|Feeds into| Synthetic
Health monitoring provides real-time status for operational decisions (routing, alerting), while availability monitoring aggregates historical data to measure reliability over time (SLA compliance, trend analysis). Both use similar data sources but serve different purposes.
How It Works
Step 1: Define what “available” means for each component. For an API, availability might mean “responds to health check within 500ms with HTTP 200.” For a database, it might be “accepts connections and executes simple query within 1 second.” For a payment processor, it could be “successfully processes test transaction.” These definitions must be precise because they determine what counts as uptime vs downtime. Netflix defines availability as “can a user press play and start streaming within 3 seconds?”—a user-centric definition that encompasses their entire stack.
Step 2: Deploy synthetic monitors from multiple locations. Run automated checks every 30-60 seconds from geographically distributed locations (US-East, US-West, EU, Asia). Multiple locations prevent false positives from network issues and reveal regional outages. Each monitor attempts the availability check and records success/failure with timestamp. Datadog and Pingdom use this approach, maintaining monitoring infrastructure in 20+ global locations.
Step 3: Aggregate real user traffic data. Synthetic monitors show external availability, but real user requests reveal the actual experience. Track the ratio of successful requests to total requests (success rate) and the percentage of requests that complete within your SLA threshold (e.g., < 500ms). A 99.9% synthetic availability but 95% real-user success rate indicates your monitors aren’t catching real problems.
Step 4: Calculate availability percentages over time windows. For each component, compute: Availability = (Total Time - Downtime) / Total Time. Track this across multiple windows: last hour, last 24 hours, last 7 days, last 30 days, last 90 days. Store these as time-series data. When a monitor detects failure, start a downtime period. When it recovers, end the period and add the duration to your downtime counter.
Step 5: Implement composite availability tracking. Your overall system availability depends on multiple components. If your API depends on a database, cache, and authentication service, your effective availability is the product of their individual availabilities (assuming serial dependency). If Database = 99.9%, Cache = 99.95%, Auth = 99.99%, then API ≤ 99.84%. This math explains why distributed systems struggle to achieve high availability—every dependency reduces your ceiling.
Step 6: Generate alerts on SLA violations and trends. Alert when current availability drops below SLA thresholds (“We’re at 99.85% this month, need 99.9%”) or when downtime accumulates too quickly (“We’ve used 50% of our monthly error budget in one week”). Trend-based alerts catch degradation before it becomes critical: “Availability has decreased 0.1% each week for three weeks.”
Step 7: Correlate availability data with incidents and deployments. When availability drops, automatically link to incident reports, deployment logs, and infrastructure changes. This correlation turns raw availability data into actionable insights. If every availability dip correlates with a deployment, you have a deployment process problem. If dips correlate with traffic spikes, you have a scaling problem.
Multi-Location Availability Monitoring Flow
graph TB
subgraph Geographic Monitoring Locations
US_East["US-East Monitor<br/><i>Every 30s</i>"]
US_West["US-West Monitor<br/><i>Every 30s</i>"]
EU["Europe Monitor<br/><i>Every 30s</i>"]
Asia["Asia Monitor<br/><i>Every 30s</i>"]
end
subgraph Target System
LB[Load Balancer]
API1[API Server 1]
API2[API Server 2]
DB[(Database)]
LB --> API1
LB --> API2
API1 & API2 --> DB
end
subgraph Availability Calculation
Collector[Metrics Collector]
Calculator[Availability Calculator]
Alerts[Alert Manager]
Collector -->|Aggregate checks| Calculator
Calculator -->|"Uptime: 99.87%<br/>Target: 99.9%"| Alerts
end
US_East & US_West & EU & Asia -->|"1. HTTP GET /health"| LB
LB -->|"2. 200 OK (or timeout)"| US_East & US_West & EU & Asia
US_East & US_West & EU & Asia -->|"3. Report result"| Collector
Alerts -.->|"4. Page on-call<br/>if < 99.9%"| OnCall[On-Call Engineer]
Availability monitoring deploys synthetic checks from multiple geographic locations to detect regional outages and prevent false positives from network issues. Each location independently probes the system, and results are aggregated to calculate uptime percentages and trigger alerts when availability drops below SLA thresholds.
Key Principles
Principle 1: Monitor from the user’s perspective, not just internal health. Internal health checks might show all servers responding, but if your load balancer is misconfigured, users can’t reach them. Availability monitoring must test the complete user path from external networks. Google’s Site Reliability Engineering team uses “black-box monitoring” that treats the system as users see it, complementing internal “white-box monitoring.” Example: Shopify monitors checkout availability by running actual test purchases every minute from multiple countries, not just pinging their API.
Principle 2: Availability is multiplicative across dependencies. When components depend on each other serially, their availabilities multiply. A system with five 99.9% components has at most 99.5% availability (0.999^5). This math drives architectural decisions: reduce dependencies, add redundancy, or accept lower availability. Example: AWS designs services to minimize cross-service dependencies. S3 doesn’t depend on EC2, so S3 availability isn’t reduced by EC2 issues. When dependencies are unavoidable, they over-provision to maintain high composite availability.
Principle 3: Distinguish between planned and unplanned downtime. A 99.9% SLA typically excludes planned maintenance windows, but your monitoring should track both. Unplanned downtime indicates reliability problems; planned downtime indicates operational maturity (or lack thereof). Example: Stripe reports “API availability excluding scheduled maintenance” and “total availability including maintenance” separately. If you’re taking weekly maintenance windows, your architecture may need improvement to support zero-downtime deployments.
Principle 4: Availability monitoring must be independent of the system being monitored. If your monitoring runs on the same infrastructure as your application, both fail together and you lose visibility during outages. Example: Netflix runs their monitoring infrastructure (Atlas) on separate AWS accounts and regions from their streaming infrastructure. When their main region has issues, monitoring continues operating and alerting. Many companies learned this lesson the hard way when their monitoring system went down with their application.
Principle 5: Track availability per customer segment and geography. Aggregate availability numbers hide regional outages and customer-specific issues. A 99.9% global availability might mask 95% availability in Asia or 90% availability for enterprise customers. Example: Cloudflare tracks availability separately for each of their 300+ data centers and for different customer tiers (free, pro, enterprise). This granularity reveals that a DDoS attack affected only free-tier customers in Europe, allowing targeted response and accurate customer communication.
Composite Availability Calculation
graph LR
User[User Request]
subgraph Serial Dependencies
Gateway["API Gateway<br/>99.95% available"]
Auth["Auth Service<br/>99.9% available"]
Business["Business Logic<br/>99.9% available"]
DB[("Database<br/>99.95% available")]
end
Result["End-to-End<br/>Availability"]
Calculation["0.9995 × 0.999 × 0.999 × 0.9995<br/>= 0.9970 = 99.7%"]
User -->|"1. Request"| Gateway
Gateway -->|"2. Authenticate"| Auth
Auth -->|"3. Process"| Business
Business -->|"4. Query"| DB
DB -->|"5. Response"| Result
Gateway & Auth & Business & DB -.->|"Multiply availabilities"| Calculation
Calculation -.->|"System ceiling"| Result
subgraph Parallel Redundancy
DB1[("DB Instance 1<br/>99.9%")]
DB2[("DB Instance 2<br/>99.9%")]
Combined["Combined: 99.9999%<br/>1 - (0.001 × 0.001)"]
DB1 & DB2 -.-> Combined
end
Availability is multiplicative across serial dependencies—each component in the request path reduces overall availability. Five 99.9% services yield only 99.5% end-to-end availability. Parallel redundancy improves availability: two 99.9% instances in active-active configuration achieve 99.9999% combined availability.
Deep Dive
Types / Variants
Synthetic monitoring (active probing) involves running automated checks against your system from external locations at regular intervals. You simulate user actions—API calls, page loads, transaction flows—and measure success/failure. When to use: Essential for measuring external availability and catching issues before users report them. Pros: Detects problems 24/7 even during low traffic, tests specific user flows, provides consistent baseline. Cons: Doesn’t reflect real user experience, can miss issues that only appear under load, costs money for monitoring infrastructure. Example: Pingdom checks your homepage every 60 seconds from 100+ locations worldwide, alerting if response time exceeds 2 seconds or HTTP status isn’t 200.
Real User Monitoring (RUM) tracks actual user requests and calculates availability from real traffic patterns. You instrument your application to report success/failure for every request, then aggregate into availability percentages. When to use: Provides ground truth about user experience, essential for understanding actual availability vs synthetic. Pros: Shows real user experience, catches issues synthetic monitoring misses, no false positives from monitoring infrastructure. Cons: Requires traffic to detect issues, can’t test specific flows, delayed detection during low-traffic periods. Example: Datadog RUM tracks every request to Shopify’s checkout API, calculating that 99.97% of real user requests succeeded last month, even though synthetic monitoring showed 99.99% (synthetic monitors missed a subtle bug affecting only mobile users).
Composite availability monitoring tracks how component availability combines into system-level availability. You model dependencies between services and calculate effective availability based on architecture. When to use: For distributed systems where multiple services must work together, essential for understanding true system reliability. Pros: Reveals reliability bottlenecks, guides architectural decisions, explains why system availability is lower than component availability. Cons: Requires accurate dependency modeling, complex to maintain as architecture evolves. Example: Uber models their ride request flow as: App → API Gateway → Matching Service → Driver Service → Database. If each has 99.9% availability, the end-to-end flow has at most 99.5% availability, explaining why they invest heavily in redundancy.
SLA-based monitoring focuses specifically on contractual availability commitments, often with different thresholds for different customer tiers. You track availability against specific SLA definitions and alert when at risk of violation. When to use: When you have formal SLAs with customers, especially in B2B SaaS. Pros: Directly measures business commitments, triggers appropriate escalation, provides data for SLA credits. Cons: May not reflect full user experience, can create perverse incentives (gaming the metrics). Example: AWS tracks EC2 instance availability against their 99.99% SLA, automatically issuing service credits when monthly availability drops below threshold. They define availability as “instance running and reachable,” excluding customer-initiated stops.
Multi-dimensional availability tracking breaks down availability by region, customer tier, feature, or time period to reveal patterns hidden in aggregate numbers. When to use: For global services with diverse customers, essential for identifying targeted issues. Pros: Reveals regional outages, customer-specific problems, and feature-level reliability. Cons: Increases monitoring complexity and data volume, requires careful aggregation to avoid alert fatigue. Example: Cloudflare tracks availability separately for each PoP (point of presence), customer tier, and protocol (HTTP, DNS, CDN). During a recent incident, they identified that only HTTP/2 traffic in their London PoP was affected—aggregate metrics showed 99.9% availability, but London HTTP/2 customers saw 90%.
Trade-offs
Monitoring frequency: High-frequency (every 10-30 seconds) vs Low-frequency (every 5-10 minutes). High-frequency monitoring detects outages faster and provides more accurate availability calculations (a 30-second outage registers in high-frequency but might be missed by low-frequency). However, it generates more monitoring traffic, costs more, and can trigger false positives from transient network issues. Low-frequency monitoring reduces costs and false positives but delays detection and loses granularity. Decision framework: Use high-frequency for critical user-facing services with strict SLAs (APIs, payment processing). Use low-frequency for internal services, batch jobs, or systems with looser SLAs. Stripe monitors their payment API every 10 seconds but their internal analytics pipeline every 5 minutes.
Monitoring scope: Comprehensive (every component) vs Critical-path (user-facing only). Comprehensive monitoring tracks every service, database, queue, and cache, providing complete visibility but generating massive data volume and alert noise. Critical-path monitoring focuses only on components directly affecting users, reducing noise but potentially missing root causes. Decision framework: Start with critical-path monitoring for user-facing availability, then add comprehensive monitoring for internal debugging. Use different alert thresholds: page for critical-path issues, ticket for comprehensive monitoring. Netflix monitors their streaming path (play button → video delivery) with strict SLAs, but monitors internal recommendation systems with looser thresholds.
Availability calculation: Uptime-based vs Request-based. Uptime-based availability measures time: “Was the service up for 99.9% of the month?” Request-based availability measures requests: “Did 99.9% of requests succeed?” Uptime-based is simpler and matches traditional SLAs, but a 1-minute outage during peak traffic affects more users than a 1-minute outage at 3 AM. Request-based reflects actual user impact but requires more instrumentation. Decision framework: Use uptime-based for infrastructure components (databases, load balancers) and external SLA reporting. Use request-based for application-level availability and internal SLOs. Google uses request-based availability for their SLOs because it better reflects user experience.
Alerting strategy: Threshold-based vs Trend-based. Threshold-based alerts fire when availability drops below a fixed number (“Alert when < 99.9%”). Trend-based alerts fire when availability degrades over time (“Alert when availability decreases 0.1% per day for 3 days”). Threshold alerts catch acute problems but miss slow degradation. Trend alerts catch degradation early but can false-positive on normal variance. Decision framework: Use threshold alerts for SLA violations and critical outages. Use trend alerts for capacity planning and gradual degradation. Combine both: threshold alerts page on-call, trend alerts create tickets for investigation. Datadog’s anomaly detection uses both: immediate alerts for drops below 99% and warnings for unusual downward trends.
Downtime attribution: Automatic vs Manual. Automatic attribution uses heuristics to classify downtime causes (deployment, infrastructure failure, dependency outage). Manual attribution requires engineers to categorize each incident during post-mortems. Automatic is faster and consistent but less accurate. Manual is accurate but time-consuming and inconsistent. Decision framework: Use automatic attribution for real-time dashboards and quick analysis. Require manual attribution for significant outages and SLA reporting. Reconcile monthly: automatic attribution for operational metrics, manual review for business reporting. PagerDuty automatically tags incidents with suspected causes but requires manual confirmation for post-incident reviews.
Monitoring Frequency Tradeoff Analysis
graph TB
subgraph High-Frequency Monitoring: Every 10-30s
HF_Pro1["✓ Detects outages faster<br/>(30s vs 5min)"]
HF_Pro2["✓ More accurate availability<br/>calculation"]
HF_Pro3["✓ Catches brief outages"]
HF_Con1["✗ Higher monitoring costs"]
HF_Con2["✗ More false positives<br/>from transient issues"]
HF_Con3["✗ Increased load on<br/>monitored systems"]
end
subgraph Low-Frequency Monitoring: Every 5-10min
LF_Pro1["✓ Lower costs"]
LF_Pro2["✓ Fewer false positives"]
LF_Pro3["✓ Minimal system load"]
LF_Con1["✗ Delayed detection<br/>(5-10min lag)"]
LF_Con2["✗ Misses brief outages"]
LF_Con3["✗ Less accurate<br/>availability data"]
end
Decision{"Decision<br/>Framework"}
Critical["Critical Path Services<br/><i>Payment API, Auth</i><br/>→ High-Frequency"]
Internal["Internal Services<br/><i>Analytics, Batch Jobs</i><br/>→ Low-Frequency"]
Decision --> Critical
Decision --> Internal
Example["Example: Stripe<br/>Payment API: every 10s<br/>Analytics: every 5min"]
Critical & Internal -.-> Example
High-frequency monitoring (10-30s) detects outages faster and provides accurate availability data but costs more and generates false positives. Low-frequency monitoring (5-10min) reduces costs and noise but delays detection. Use high-frequency for critical user-facing services with strict SLAs, low-frequency for internal services with looser requirements.
Common Pitfalls
Pitfall 1: Monitoring only from a single location or region. Teams deploy monitoring from their primary datacenter or a single cloud region, then miss regional outages or network partitions. A DNS issue affecting Asia goes undetected because all monitors run in US-East. Why it happens: Single-location monitoring is simpler to set up and cheaper to operate. Teams don’t consider that users are globally distributed. How to avoid: Deploy synthetic monitors from at least 3-5 geographically diverse locations matching your user distribution. If 30% of users are in Europe, run monitors from European locations. Use third-party monitoring services (Pingdom, Datadog) that provide global monitoring infrastructure. Cloudflare learned this lesson when a BGP issue affected only Asian traffic—their US-based monitors showed 100% availability while Asian users couldn’t connect.
Pitfall 2: Confusing availability with latency or performance. A service that responds slowly (2-second response times) but successfully might show 100% availability in monitoring, even though users consider it “down.” Availability monitoring that only checks HTTP 200 status misses performance degradation. Why it happens: Traditional availability definitions focus on binary up/down states. Teams separate availability monitoring from performance monitoring. How to avoid: Include latency thresholds in availability definitions. A request that takes >5 seconds counts as unavailable even if it eventually succeeds. Track “good requests” (successful AND fast) vs “total requests.” Google’s SRE book recommends defining availability as “the proportion of requests that are successful AND meet latency SLOs.” Shopify considers checkout “unavailable” if page load exceeds 3 seconds, even with HTTP 200.
Pitfall 3: Not accounting for scheduled maintenance in availability calculations. Teams report 99.5% availability without noting that 0.3% was planned maintenance, making the system look less reliable than it is. Conversely, teams exclude all maintenance from calculations, hiding the operational burden of frequent maintenance windows. Why it happens: Ambiguity about whether SLAs include or exclude maintenance. Desire to show better availability numbers. How to avoid: Clearly define and document whether your SLA includes maintenance. Report both numbers: “99.7% availability including scheduled maintenance, 99.9% excluding maintenance.” Work toward zero-downtime deployments to eliminate maintenance windows entirely. AWS explicitly excludes “scheduled maintenance” from SLA calculations but limits it to 4 hours per year. If you’re taking weekly maintenance windows, your architecture needs improvement.
Pitfall 4: Monitoring infrastructure sharing fate with the application. Your monitoring system runs on the same servers, network, or cloud account as your application. When the application fails, monitoring fails too, leaving you blind during outages. Why it happens: Simpler to deploy monitoring alongside the application. Cost savings from shared infrastructure. Lack of consideration for correlated failures. How to avoid: Run monitoring infrastructure separately from application infrastructure—different servers, different network paths, ideally different cloud accounts or providers. Use external monitoring services as a backup. Implement “dead man’s switch” alerts that fire if monitoring stops reporting. Netflix runs their monitoring (Atlas) in separate AWS accounts from their streaming infrastructure. When their main region has issues, monitoring continues operating.
Pitfall 5: Alert fatigue from false positives and transient failures. Monitoring alerts on every brief blip—a single failed health check, a momentary network hiccup—training engineers to ignore alerts. When a real outage occurs, the alert gets lost in noise. Why it happens: Overly sensitive thresholds, single-check alerting (one failure triggers alert), monitoring infrastructure issues. How to avoid: Require multiple consecutive failures before alerting (“Alert after 3 failed checks in 90 seconds”). Use percentage-based thresholds (“Alert when <95% of checks succeed over 5 minutes”) rather than single-failure alerts. Implement alert suppression during known maintenance. Regularly review and tune alert thresholds based on false positive rates. PagerDuty recommends aiming for <5% false positive rate on critical alerts. Stripe requires 3 consecutive failures from 2 different monitoring locations before paging on-call.
Math & Calculations
Availability percentage calculation: The fundamental formula is Availability = Uptime / (Uptime + Downtime), expressed as a percentage. For a 30-day month (43,200 minutes), if your service was down for 43 minutes: Availability = (43,200 - 43) / 43,200 = 43,157 / 43,200 = 0.999 = 99.9%.
Downtime allowance from availability targets: Each “nine” in your availability target corresponds to a maximum downtime budget. For 99.9% availability (“three nines”), you can have at most 0.1% downtime. Over different time periods:
- Per year: 365 days × 24 hours × 60 minutes × 0.001 = 525.6 minutes = 8.76 hours
- Per month: 30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes
- Per week: 7 days × 24 hours × 60 minutes × 0.001 = 10.08 minutes
For 99.99% (“four nines”): 52.56 minutes per year, 4.32 minutes per month. For 99.999% (“five nines”): 5.26 minutes per year, 26 seconds per month. This math explains why each additional nine is exponentially harder—you’re reducing your error budget by 10x.
Composite availability calculation: When services depend on each other serially, multiply their availabilities. If Service A (99.9%) calls Service B (99.95%) which calls Service C (99.99%), the end-to-end availability is: 0.999 × 0.9995 × 0.9999 = 0.9984 = 99.84%. This is why distributed systems struggle with high availability—every dependency reduces your ceiling.
For parallel redundancy (active-active), availability improves: If you have two independent instances each with 99.9% availability, the probability both fail is 0.001 × 0.001 = 0.000001, so availability is 1 - 0.000001 = 99.9999%. This math drives redundancy decisions: two 99.9% components in parallel achieve higher availability than one 99.99% component.
Error budget calculation: If your SLA is 99.9% availability, your error budget is 0.1% of time. For a 30-day month: 43,200 minutes × 0.001 = 43.2 minutes. If you’ve had 20 minutes of downtime so far this month, you’ve consumed 20/43.2 = 46% of your error budget with 54% remaining. Google’s SRE teams use error budgets to balance reliability and feature velocity: when error budget is healthy, ship faster; when depleted, focus on reliability.
Worked example—Uber ride request flow: Consider Uber’s ride request path with these component availabilities:
- Mobile app → API Gateway: 99.95%
- API Gateway → Matching Service: 99.9%
- Matching Service → Driver Service: 99.9%
- Driver Service → Database: 99.95%
End-to-end availability: 0.9995 × 0.999 × 0.999 × 0.9995 = 0.9970 = 99.7%. Over a 30-day month, this allows 43,200 × 0.003 = 129.6 minutes of downtime. If Uber promises 99.9% availability (43.2 minutes), they’re missing their SLA by 86.4 minutes per month. This math explains why they invest heavily in redundancy and why their architecture minimizes serial dependencies.
Availability Targets and Downtime Budgets
graph TB
subgraph Availability Targets
Target1["99.9% - Three Nines<br/><i>Standard SLA</i>"]
Target2["99.99% - Four Nines<br/><i>High Availability</i>"]
Target3["99.999% - Five Nines<br/><i>Mission Critical</i>"]
end
subgraph Monthly Downtime Budget: 30 days
Budget1["43.2 minutes<br/>0.1% of 43,200 min"]
Budget2["4.32 minutes<br/>0.01% of 43,200 min"]
Budget3["26 seconds<br/>0.001% of 43,200 min"]
end
subgraph Annual Downtime Budget
Annual1["8.76 hours/year"]
Annual2["52.6 minutes/year"]
Annual3["5.26 minutes/year"]
end
Target1 --> Budget1 --> Annual1
Target2 --> Budget2 --> Annual2
Target3 --> Budget3 --> Annual3
Cost["Cost & Complexity<br/>increases ~10x<br/>per nine"]
Target1 & Target2 & Target3 -.-> Cost
Example["Example: Current Month<br/>20 min downtime used<br/>43.2 min budget<br/>= 46% consumed<br/>54% remaining"]
Budget1 -.-> Example
Each additional ‘nine’ in availability reduces allowable downtime by 10x: 99.9% allows 43 minutes/month, 99.99% allows 4.3 minutes, 99.999% allows 26 seconds. The cost and architectural complexity increase exponentially. Error budgets track consumed vs remaining downtime to balance reliability investment with feature velocity.
Real-World Examples
Netflix: Multi-layered availability monitoring for global streaming. Netflix monitors availability at multiple layers: CDN edge servers, API services, recommendation engines, and the complete “press play to streaming” user flow. They run synthetic monitors from 50+ global locations every 30 seconds, simulating the complete user experience from login to video playback. Their monitoring revealed that aggregate 99.9% availability masked regional issues—Europe had 99.5% availability during peak hours due to CDN capacity constraints. This insight drove investment in European CDN infrastructure. Netflix also pioneered “chaos engineering” (intentionally breaking components) to verify that their availability monitoring accurately detects failures. They discovered their monitoring missed certain failure modes, leading to improved health check designs. Their approach: measure availability from the user’s perspective (“Can I watch a show?”) rather than component perspective (“Is the API responding?”).
Stripe: Request-based availability for payment processing SLAs. Stripe promises 99.99% API availability, which allows only 4.32 minutes of downtime per month. They monitor availability using both synthetic checks (automated API calls every 10 seconds from 20+ locations) and real request tracking (measuring success rate of actual customer API calls). Their monitoring distinguishes between different types of failures: network errors, timeout errors, and application errors, each counted differently in availability calculations. Stripe’s most interesting practice is “availability per customer”—they track whether each customer experienced the promised 99.99% availability, not just aggregate availability. During one incident, aggregate availability was 99.97% (within SLA), but 5% of customers experienced 95% availability due to a database shard issue. This granular monitoring revealed that aggregate metrics can hide customer-specific problems. Stripe now reports both aggregate and per-customer availability, alerting when any significant customer segment falls below SLA.
AWS: Availability zones and composite availability tracking. AWS designs their infrastructure around availability zones (AZs)—isolated datacenters within a region. They monitor availability at multiple levels: individual server, AZ, region, and service. Their EC2 SLA promises 99.99% availability for instances deployed across multiple AZs, but only 99.5% for single-AZ deployments. This 0.49% difference (21 hours vs 43 hours of downtime per year) drives customer architecture decisions. AWS’s monitoring revealed an interesting pattern: most customer-impacting outages weren’t complete service failures but partial degradations—one AZ failing while others continued operating. This insight led to improved monitoring that tracks “blast radius” (what percentage of customers are affected) rather than just binary up/down status. During the 2021 US-East-1 outage, AWS’s monitoring showed that while the region had 95% availability, services using multiple AZs maintained 99.9% availability, validating their architecture recommendations.
Interview Expectations
Mid-Level
What you should know: Explain the difference between health monitoring (current status) and availability monitoring (historical uptime). Describe how to calculate availability percentages and what different “nines” mean (99.9% = 43 minutes downtime/month). Discuss basic monitoring approaches: synthetic checks from external locations and tracking real request success rates. Explain why you need to monitor from multiple geographic locations. Understand that availability is multiplicative across dependencies—if Service A depends on Service B, overall availability is A × B.
Bonus points: Discuss the difference between uptime-based and request-based availability calculations, and when each is appropriate. Explain error budgets and how they’re calculated from SLA targets. Describe how to set appropriate alert thresholds to avoid false positives (multiple consecutive failures, percentage-based thresholds). Mention that monitoring infrastructure should be independent from the application being monitored. Give examples of how availability monitoring revealed issues in systems you’ve worked on.
Senior
What you should know: Everything from mid-level, plus: Design a complete availability monitoring system including synthetic monitoring, RUM, and composite availability tracking. Explain how to model dependencies and calculate end-to-end availability for complex distributed systems. Discuss strategies for achieving high availability (99.99%+): redundancy, failover, graceful degradation. Describe how to track availability per customer segment, region, and feature to reveal issues hidden in aggregate metrics. Explain the relationship between availability monitoring and SLA management, including how to generate SLA reports and handle SLA credits.
Bonus points: Discuss advanced monitoring strategies like anomaly detection and trend-based alerting. Explain how to balance monitoring costs with coverage—where to invest in high-frequency monitoring vs lower-frequency. Describe how availability data drives architectural decisions (“Our database is the reliability bottleneck at 99.5%, limiting overall system to 99.7%”). Discuss the operational challenges of maintaining high availability: deployment strategies, change management, incident response. Give examples of how you’ve used availability data to improve system reliability or justify infrastructure investments. Explain how to handle the tradeoff between availability and consistency in distributed systems.
Staff+
What you should know: Everything from senior, plus: Design availability monitoring for multi-region, globally distributed systems with complex dependencies. Explain how to establish availability targets (SLOs/SLAs) based on business requirements and technical constraints. Discuss the organizational aspects: how availability monitoring integrates with incident management, on-call rotations, and post-mortem processes. Describe how to build monitoring that scales to thousands of services without overwhelming teams with alerts. Explain strategies for improving availability: chaos engineering, game days, progressive rollouts, automated remediation.
Distinguishing signals: Discuss the economic tradeoffs of availability—why 99.99% might not be worth the cost compared to 99.9% for certain services. Explain how to design monitoring that detects novel failure modes, not just known issues. Describe how to build a culture of reliability using availability data: error budgets, blameless post-mortems, reliability reviews. Discuss the limitations of availability as a metric—how it can be gamed, what it doesn’t measure (correctness, data loss), and what complementary metrics are needed. Give examples of how you’ve designed monitoring strategies that evolved with system complexity, or how you’ve used availability data to drive organizational change. Explain how to balance reliability investment with feature development velocity.
Common Interview Questions
Q1: How would you design availability monitoring for a new API service?
60-second answer: Deploy synthetic monitors from 3-5 geographic locations matching user distribution, checking the API every 30-60 seconds. Define “available” as “responds within 500ms with HTTP 200.” Instrument the API to track real request success rates. Calculate availability as successful requests / total requests over rolling time windows (hour, day, week, month). Alert when availability drops below SLA threshold (e.g., 99.9%) or when downtime accumulates too quickly. Store availability data in time-series database for historical analysis and dashboards.
2-minute detailed answer: Start by defining availability from the user’s perspective—not just “API responds” but “API responds successfully within acceptable latency.” Deploy synthetic monitors from multiple locations (US-East, US-West, Europe, Asia) to catch regional issues. Each monitor hits a representative endpoint every 30 seconds, recording success/failure and latency. For real user monitoring, instrument the API to emit metrics for every request: success/failure, latency, error type. Calculate availability using both uptime (was the service reachable?) and request-based (what percentage of requests succeeded?) methods. Track availability over multiple time windows: real-time (last 5 minutes), short-term (last 24 hours), and SLA period (last 30 days). Set up alerting with multiple thresholds: immediate page if availability drops below 95%, warning if trending toward SLA violation (e.g., 99.85% with 99.9% target). Ensure monitoring infrastructure is independent—run monitors from different cloud accounts or use third-party services. Build dashboards showing current availability, historical trends, and error budget consumption. Integrate with incident management to automatically create incidents when availability drops.
Red flags: Only monitoring from a single location. Defining availability purely as “service responds” without latency requirements. Not distinguishing between planned and unplanned downtime. Running monitoring on the same infrastructure as the application. Alerting on single failed checks rather than sustained degradation.
Q2: Your system has 99.5% availability but the SLA requires 99.9%. How do you improve it?
60-second answer: First, analyze availability data to identify the reliability bottleneck—which component has the lowest availability? If it’s a single database, add redundancy (primary-replica, multi-AZ). If it’s deployment-related downtime, implement zero-downtime deployments (blue-green, rolling updates). If it’s dependency failures, add circuit breakers and graceful degradation. Calculate the impact of each improvement: moving the database from 99.5% to 99.9% might lift overall system availability to 99.8%. Prioritize changes with the biggest availability impact.
2-minute detailed answer: Start with root cause analysis using availability monitoring data. Break down the 0.4% gap (173 minutes per month) by incident type: How much is deployment-related? Dependency failures? Infrastructure issues? Application bugs? For each category, calculate the potential improvement. If 100 minutes are deployment-related, implementing zero-downtime deployments could recover most of that. If 50 minutes are database failures, adding multi-AZ redundancy (improving database availability from 99.5% to 99.95%) would help. For dependency failures, implement circuit breakers, retries with exponential backoff, and graceful degradation—if a recommendation service fails, show generic recommendations rather than failing the entire request. For infrastructure issues, move to multi-region active-active architecture. Calculate composite availability: if you improve the database from 99.5% to 99.95% and eliminate deployment downtime, what’s the new end-to-end availability? Use error budget math to prioritize: which improvements give the most availability gain for the least effort? Implement changes incrementally, measuring availability improvement after each change. Consider whether 99.9% is the right target—maybe 99.95% is needed for buffer, or maybe 99.8% is acceptable with SLA renegotiation.
Red flags: Trying to improve availability without understanding root causes. Assuming you need 99.99% components to achieve 99.9% system availability (over-engineering). Not considering the cost-benefit tradeoff of availability improvements. Ignoring operational practices (deployment process, change management) and only focusing on technical architecture.
Q3: How do you prevent false positives in availability monitoring?
60-second answer: Require multiple consecutive failures before alerting—three failed checks over 90 seconds, not a single failure. Monitor from multiple locations and require failures from at least two locations (prevents network-specific issues from triggering alerts). Use percentage-based thresholds over time windows (“alert when <95% of checks succeed over 5 minutes”) rather than single-check alerts. Implement alert suppression during known maintenance windows. Regularly review false positive rates and tune thresholds.
2-minute detailed answer: False positives erode trust in monitoring and lead to alert fatigue, so preventing them is critical. First, implement multi-check confirmation: require 3 consecutive failures over 90 seconds before alerting. This filters transient network blips and brief service restarts. Second, use geographic redundancy: require failures from at least 2 of your 5 monitoring locations before alerting. A single location failure might be a network issue, but multiple locations failing indicates a real problem. Third, use statistical thresholds: instead of alerting on any failure, alert when success rate drops below 95% over a 5-minute window. This tolerates occasional failures while catching sustained degradation. Fourth, implement smart alert suppression: automatically suppress alerts during scheduled maintenance, known deployment windows, or when upstream dependencies are already alerting (no need to alert on 50 services if the root cause database is already alerting). Fifth, use anomaly detection for trend-based alerts: alert when availability deviates significantly from historical patterns, not just when it crosses a fixed threshold. Sixth, implement alert escalation: first alert creates a ticket, second alert (sustained issue) pages on-call. Finally, regularly review false positive rates—aim for <5% false positives on critical alerts—and tune thresholds based on actual data. PagerDuty’s research shows that teams with >20% false positive rates start ignoring alerts, leading to missed real incidents.
Red flags: Alerting on single failed health checks. Not accounting for transient network issues. Running all monitoring from a single location. Not distinguishing between severity levels (everything pages on-call). Never reviewing or tuning alert thresholds after initial setup.
Q4: How does availability monitoring differ from health monitoring?
60-second answer: Health monitoring shows current status (“Is the service up right now?”) while availability monitoring tracks historical uptime (“What percentage of time was the service up this month?”). Health monitoring is real-time and used for immediate alerting and load balancing decisions. Availability monitoring aggregates data over time and is used for SLA reporting, trend analysis, and capacity planning. You need both: health monitoring to detect and respond to issues, availability monitoring to measure reliability and drive improvements.
2-minute detailed answer: Health monitoring is about the present moment—it answers “Can I route traffic to this server right now?” Load balancers use health checks every few seconds to decide which servers receive requests. If a health check fails, the server is immediately removed from rotation. Health monitoring is binary (healthy/unhealthy) and focused on operational decisions. Availability monitoring is about historical patterns—it answers “How reliable has this service been over time?” It aggregates health check results, real user requests, and incident data into uptime percentages over hours, days, and months. Availability monitoring is used for SLA compliance reporting, identifying reliability trends (“Availability has decreased 0.1% each month for three months”), capacity planning (“We need more redundancy to achieve 99.99%”), and post-incident analysis. The data sources overlap but serve different purposes: the same health check that tells a load balancer “this server is unhealthy” also contributes to availability calculations showing “the service was down for 5 minutes this hour.” In practice, you need both: health monitoring for immediate operational decisions (routing, alerting, auto-scaling) and availability monitoring for strategic decisions (architecture improvements, SLA negotiations, reliability investments). Google’s SRE practices distinguish between “monitoring” (real-time health) and “observability” (historical analysis including availability), emphasizing that both are essential for reliable systems.
Red flags: Thinking they’re the same thing. Using only health monitoring without tracking historical availability. Not understanding that availability monitoring requires aggregation over time. Believing you can achieve high availability just by having good health checks.
Q5: Your availability monitoring shows 99.9% but customers are complaining about downtime. What’s wrong?
60-second answer: Several possibilities: (1) Monitoring only checks from your datacenter, missing regional outages affecting customers. (2) Monitoring checks a simple health endpoint, missing complex user flows that are failing. (3) Monitoring doesn’t account for latency—slow responses count as “up” but feel like downtime to users. (4) Aggregate availability hides customer-specific issues—overall 99.9% might mask 95% for some customers. (5) Monitoring misses partial degradation—service is “up” but returning errors for some requests. Solution: Add real user monitoring, track availability per customer segment, include latency in availability definition.
2-minute detailed answer: This discrepancy between monitoring and user experience is common and reveals monitoring blind spots. First, check geographic coverage: if you’re monitoring only from US-East but have global customers, you’re missing regional issues. A DNS problem affecting Asia won’t show in US-based monitors. Add monitoring from all regions where you have users. Second, examine what you’re monitoring: a simple health check endpoint might return 200 OK while complex user flows (checkout, search, video playback) are failing. Implement synthetic monitoring that tests complete user journeys, not just server liveness. Third, review your availability definition: if you’re only checking HTTP status codes, slow responses (2-5 seconds) count as “available” even though users experience them as failures. Redefine availability to include latency: “successful response within 500ms.” Fourth, implement real user monitoring (RUM) to track actual user requests, not just synthetic checks. RUM often reveals issues synthetic monitoring misses—mobile-specific bugs, authentication problems, edge cases. Fifth, break down aggregate availability by customer segment: per-customer, per-region, per-feature. Overall 99.9% might hide that 5% of customers experienced 95% availability due to a database shard issue. Sixth, check for partial degradation: your service might be “up” (responding to health checks) but returning errors for 10% of requests. Track success rate, not just uptime. Finally, correlate monitoring data with customer complaints: when customers report issues, what did monitoring show? This analysis reveals monitoring gaps and drives improvements.
Red flags: Dismissing customer complaints because monitoring shows good availability. Not monitoring from customer locations. Defining availability without considering user experience. Not tracking real user requests, only synthetic checks. Not breaking down aggregate metrics to find customer-specific issues.
Red Flags to Avoid
Red flag 1: “Availability and uptime are the same thing.” While related, they’re not identical. Uptime measures whether a service is running; availability measures whether it’s accessible and functional for users. A service can have 100% uptime (servers running) but 95% availability (network issues preventing user access). What to say instead: “Availability measures user-facing accessibility over time, calculated as successful requests divided by total requests, or uptime divided by total time. It’s broader than uptime because it accounts for network issues, performance degradation, and partial failures that don’t show up in simple uptime metrics.”
Red flag 2: “We achieve 99.99% availability by having really good servers.” High availability comes from architecture (redundancy, failover, graceful degradation), not just reliable hardware. Even the most reliable server eventually fails; high availability requires surviving those failures. What to say instead: “High availability requires architectural patterns like redundancy (multiple servers), geographic distribution (multi-region), automatic failover, circuit breakers, and graceful degradation. Even with 99.99% reliable servers, you need multiple servers in parallel to achieve system-level 99.99% availability. The math is: two 99.9% servers in active-active configuration give 99.9999% availability (1 - 0.001²).”
Red flag 3: “We don’t need availability monitoring because we have health checks.” Health checks show current status; availability monitoring tracks historical reliability. You need both. Health checks enable operational decisions (routing, alerting); availability monitoring enables strategic decisions (SLA compliance, reliability investment). What to say instead: “Health checks and availability monitoring serve different purposes. Health checks provide real-time status for operational decisions—load balancers use them to route traffic, alerting systems use them to page on-call. Availability monitoring aggregates historical data to measure reliability over time, track SLA compliance, identify trends, and guide architectural improvements. We need both: health checks for immediate response, availability monitoring for long-term reliability.”
Red flag 4: “Our monitoring shows 100% availability, so we’re perfectly reliable.” 100% availability in monitoring often means your monitoring has blind spots, not that your system is perfect. Real systems have failures; monitoring should catch them. What to say instead: “If monitoring shows 100% availability but we know systems have failures, it suggests monitoring gaps. We should validate monitoring by: (1) Checking if it catches known issues from past incidents. (2) Running chaos engineering experiments to verify monitoring detects injected failures. (3) Comparing synthetic monitoring with real user monitoring—if synthetic shows 100% but RUM shows 99.5%, we’re missing real issues. (4) Reviewing customer complaints to see if users report problems monitoring missed. Perfect availability is suspicious; we should investigate whether monitoring is comprehensive enough.”
Red flag 5: “We’ll just add more nines to the SLA—99.999% sounds better than 99.9%.” Each additional nine requires exponentially more effort and cost. 99.9% allows 43 minutes downtime per month; 99.99% allows 4.3 minutes; 99.999% allows 26 seconds. The architectural complexity and operational burden increase dramatically. What to say instead: “Availability targets should be based on business requirements and cost-benefit analysis, not arbitrary numbers. Each nine costs roughly 10x more to achieve: 99.9% might require basic redundancy, 99.99% requires multi-AZ architecture, 99.999% requires multi-region active-active with sophisticated failover. We should determine what availability users actually need—for many services, 99.9% is sufficient and 99.99% isn’t worth the cost. Google’s SRE book recommends setting SLOs based on user pain: what availability level causes users to complain or leave? That’s your target, not the highest number possible.”
Key Takeaways
-
Availability monitoring tracks historical uptime and accessibility, answering “How reliable has this system been?” by aggregating data into percentages (99.9%, 99.99%) over time periods. It differs from health monitoring (current status) by providing long-term reliability metrics for SLA compliance, trend analysis, and architectural decisions.
-
Monitor from the user’s perspective using both synthetic checks and real user monitoring (RUM). Synthetic monitors probe your system from multiple geographic locations to catch issues proactively. RUM tracks actual user requests to reveal real-world availability. Combine both: synthetic for consistent baseline and early detection, RUM for ground truth.
-
Availability is multiplicative across dependencies: When services depend on each other serially, multiply their availabilities. Five 99.9% services in a chain yield at most 99.5% end-to-end availability (0.999^5). This math drives architectural decisions—reduce dependencies, add redundancy, or accept lower availability. Use composite availability tracking to identify reliability bottlenecks.
-
Each “nine” in availability requires exponentially more effort: 99.9% allows 43 minutes downtime per month, 99.99% allows 4.3 minutes, 99.999% allows 26 seconds. The cost and complexity increase roughly 10x per nine. Set availability targets based on business requirements and user pain, not arbitrary numbers. Use error budgets to balance reliability investment with feature velocity.
-
Prevent false positives through multi-check confirmation, geographic redundancy, and statistical thresholds. Require multiple consecutive failures from multiple locations before alerting. Use percentage-based thresholds over time windows (“alert when <95% success over 5 minutes”) rather than single-failure alerts. Regularly review false positive rates and tune thresholds—aim for <5% false positives on critical alerts to maintain trust in monitoring.
Related Topics
Prerequisites: Health Checks (understand real-time status monitoring before historical availability tracking), Metrics and Logging (availability monitoring relies on metrics collection and aggregation), SLAs and SLOs (availability targets are defined in SLAs/SLOs)
Related concepts: Alerting Strategies (availability monitoring drives alerts for SLA violations and degradation), Incident Management (availability data is used during incident response and post-mortems), Observability (availability monitoring is one pillar of observability)
Next steps: Reliability Engineering (use availability data to improve system reliability), Chaos Engineering (validate that availability monitoring catches failures), Multi-Region Architecture (achieve high availability through geographic redundancy)