Health Endpoint Monitoring for High Availability

TL;DR

Health endpoint monitoring exposes HTTP endpoints that external systems can poll to verify application health. Unlike passive monitoring that waits for failures, health checks actively probe dependencies (databases, caches, queues) to detect degradation before users notice. Cheat sheet: Implement /health (liveness) and /ready (readiness) endpoints. Liveness = “is the process alive?”, readiness = “can it handle traffic?”. Check critical dependencies with timeouts, return 200/503 with JSON details, cache results to avoid cascading failures.

The Analogy

Think of health endpoints like the diagnostic port mechanics plug into your car. Instead of waiting for the engine to die on the highway, the mechanic can connect a scanner and see “transmission fluid low, battery at 60%, oxygen sensor failing” before you leave the lot. Your load balancer is the mechanic, your service is the car, and the health endpoint is that diagnostic port—it tells the mechanic whether to send you onto the highway or keep you in the garage for repairs.

Why This Matters in Interviews

Health endpoint monitoring comes up in almost every system design interview when discussing high availability, load balancing, or deployment strategies. Interviewers want to see that you understand the difference between liveness and readiness, can design checks that don’t create cascading failures, and know how to integrate health checks with orchestration systems like Kubernetes. Mid-level engineers should explain basic implementation; senior engineers should discuss trade-offs around check depth and frequency; staff+ engineers should address organizational patterns like centralized health dashboards and SLO integration.

Core Concept

Health endpoint monitoring is a reliability pattern where services expose HTTP endpoints that external monitoring systems can query to determine operational status. Unlike traditional monitoring that observes metrics after the fact, health checks provide real-time, actionable signals that automation can use immediately—load balancers remove unhealthy instances, orchestrators restart failing containers, deployment systems halt rollouts.

The pattern emerged from the shift to microservices and container orchestration. When Netflix moved from monolithic deployments to hundreds of microservices in 2011, they needed a way for their Eureka service registry to know which instances could handle traffic. Simply checking if a process was running wasn’t enough—a service might be alive but unable to reach its database. Health endpoints solved this by letting each service self-report its ability to serve requests, considering all its dependencies.

Modern implementations distinguish between two types of health checks: liveness (is the process fundamentally broken and needs restart?) and readiness (is the process temporarily unable to serve traffic but will recover?). This distinction prevents orchestrators from thrashing—repeatedly killing and restarting services that are temporarily degraded but will self-heal. Kubernetes popularized this two-endpoint model, and it’s now considered best practice across cloud-native systems.

How It Works

Step 1: Service exposes health endpoints. The application creates HTTP endpoints (typically /health/live and /health/ready) that return HTTP 200 for healthy, 503 for unhealthy, with optional JSON body containing check details. The liveness endpoint performs minimal checks—can the process allocate memory, are critical threads running? The readiness endpoint performs deeper checks—can we reach the database, is the cache responsive, is the message queue accepting connections?

Step 2: Health check logic executes. When a request hits the endpoint, the service runs a series of checks against its dependencies. Each check has a timeout (typically 1-5 seconds) to prevent hanging. For example, a readiness check might execute SELECT 1 against the primary database with a 2-second timeout, ping Redis with a 1-second timeout, and verify the message queue connection. Checks run in parallel when possible to minimize total latency.

Step 3: Results are aggregated and cached. The service combines individual check results into an overall health status. Critically, results are cached for 5-30 seconds to prevent health check storms—if 100 load balancers each poll every second, you don’t want 100 database queries per second just for health checks. The cache ensures that even under heavy monitoring, the health check overhead remains bounded.

Step 4: External systems consume health signals. Load balancers poll the readiness endpoint every 5-10 seconds. If a service returns 503, the load balancer removes it from rotation but keeps polling. When it returns 200 again, traffic resumes. Orchestrators like Kubernetes use liveness checks to decide when to kill and restart containers—three consecutive liveness failures typically triggers a restart. Deployment systems check health during rollouts, halting if new instances fail health checks.

Step 5: Monitoring systems aggregate health data. Centralized monitoring pulls health status from all services, creating dashboards that show system-wide health. When health degrades, alerts fire before user-facing errors occur. This proactive detection is the pattern’s key value—you learn about database connection pool exhaustion from health checks, not from customer complaints.

Health Check Request Flow with Caching and Circuit Breaker

graph LR
    LB["Load Balancer"]
    HealthEndpoint["Health Endpoint<br/>/health/ready"]
    Cache["Result Cache<br/>TTL: 30s"]
    CB["Circuit Breaker"]
    DB[("Database")]
    Redis[("Redis Cache")]
    Queue["Message Queue"]
    
    LB --"1. GET /health/ready<br/>every 5s"--> HealthEndpoint
    HealthEndpoint --"2. Check cache"--> Cache
    Cache --"3a. Cache hit<br/>return cached result"--> HealthEndpoint
    Cache --"3b. Cache miss<br/>run checks"--> CB
    CB --"4a. Circuit open<br/>return unhealthy"--> HealthEndpoint
    CB --"4b. Circuit closed<br/>execute checks"--> DB
    CB --"parallel check"--> Redis
    CB --"parallel check"--> Queue
    DB --"5. SELECT 1<br/>timeout: 2s"--> CB
    Redis --"PING<br/>timeout: 1s"--> CB
    Queue --"connection test<br/>timeout: 2s"--> CB
    CB --"6. Aggregate results<br/>store in cache"--> Cache
    HealthEndpoint --"7. HTTP 200/503<br/>+ JSON details"--> LB

Health check flow showing how caching and circuit breakers prevent dependency overload. The first request executes actual checks, subsequent requests within 30 seconds serve cached results. Circuit breakers stop checking dependencies after repeated failures, preventing health check storms during incidents.

Key Principles

Principle 1: Separate liveness from readiness. Liveness checks answer “should we restart this process?” and should only fail for unrecoverable conditions like memory corruption or deadlock. Readiness checks answer “should we send traffic here?” and can fail for transient issues like database connection pool exhaustion. Mixing these causes restart loops—if your liveness check fails because the database is slow, Kubernetes kills your pod, which doesn’t fix the database. Netflix learned this the hard way in 2012 when aggressive liveness checks caused cascading restarts during a database incident, making the outage worse. Keep liveness checks minimal (process alive, critical threads running) and put dependency checks in readiness.

Principle 2: Implement timeouts and circuit breakers in health checks. A health check that hangs waiting for a dead database is worse than no health check—it ties up threads and prevents the monitoring system from getting any signal. Every dependency check needs a timeout (1-5 seconds maximum). Better yet, use circuit breakers: after three consecutive timeouts to the database, stop checking it for 30 seconds and immediately return unhealthy. This prevents health check storms during incidents. Stripe’s payment API implements this pattern—their health checks use 2-second timeouts and circuit breakers that open after 5 failures, preventing health check traffic from overwhelming already-struggling dependencies.

Principle 3: Cache health check results aggressively. If your service receives 1000 requests per second and your load balancer polls health every second, you don’t want health checks to add 1000 database queries per second. Cache the health check result for 10-30 seconds and serve cached responses to all health check requests during that window. The trade-off is slightly stale health information, but the alternative is health checks becoming a significant source of load. Google’s SRE book recommends caching health results for at least 10 seconds, noting that faster detection rarely matters—if a service is failing, 10 seconds to remove it from rotation is acceptable.

Principle 4: Make health checks observable and debuggable. Return detailed JSON with individual check results, not just a status code. Include timestamps, check durations, and error messages. When a service is unhealthy, operators need to know why—is it the database, the cache, or the message queue? Uber’s services return health responses like {"status": "unhealthy", "checks": {"postgres": {"status": "healthy", "latency_ms": 12}, "redis": {"status": "unhealthy", "error": "connection timeout", "latency_ms": 5000}}}. This detail makes debugging production incidents 10x faster.

Principle 5: Align health checks with SLOs. Your health check should fail when your service can’t meet its SLO, not before and not after. If your API promises 99.9% availability with p99 latency under 500ms, your readiness check should fail when you can’t deliver that. This means checking not just “is the database reachable” but “can we query the database fast enough to meet our latency SLO.” Twitter’s health checks include latency percentile checks—if p99 query latency exceeds 200ms, the service reports unhealthy even though it’s technically functional. This prevents degraded instances from staying in rotation and dragging down overall performance.

Liveness vs Readiness Check Decision Tree

flowchart TB
    Start["Health Check Failure Detected"]
    Question1{"Is this an<br/>unrecoverable<br/>process failure?"}
    Question2{"Can service<br/>recover without<br/>restart?"}
    Question3{"Can service<br/>meet SLO without<br/>this dependency?"}
    
    Liveness["LIVENESS CHECK<br/>Return 503<br/>→ Orchestrator restarts pod"]
    Readiness["READINESS CHECK<br/>Return 503<br/>→ Remove from load balancer"]
    Degraded["DEGRADED STATE<br/>Return 200<br/>→ Keep in rotation<br/>with reduced weight"]
    NoAction["DON'T FAIL<br/>Return 200<br/>→ Service continues normally"]
    
    Start --> Question1
    Question1 --"Yes<br/>(deadlock, memory corruption)"--> Liveness
    Question1 --"No"--> Question2
    Question2 --"No<br/>(database down)"--> Readiness
    Question2 --"Yes<br/>(transient network blip)"--> Question3
    Question3 --"No<br/>(critical dependency)"--> Readiness
    Question3 --"Yes, with degradation<br/>(cache down, can use DB)"--> Degraded
    Question3 --"Yes, no degradation<br/>(non-critical service)"--> NoAction

Decision tree for determining whether a failure should trigger liveness checks (restart), readiness checks (traffic removal), degraded state (reduced capacity), or no action. Liveness failures are rare and indicate unrecoverable process issues; readiness failures are common and indicate temporary inability to serve traffic.

Deep Dive

Types / Variants

Shallow health checks verify only that the process is running and can respond to HTTP requests. They don’t check dependencies. Use these for liveness checks where you want to detect process crashes or deadlocks but not transient dependency issues. Kubernetes liveness probes typically use shallow checks—they just hit an endpoint that returns 200 if the web server can handle requests. The advantage is zero dependency on external systems, so they never fail due to cascading issues. The disadvantage is they don’t catch the most common failure mode: the service is alive but can’t reach its database.

Deep health checks verify all critical dependencies: databases, caches, message queues, downstream APIs. They execute actual operations (SELECT 1, Redis PING, queue connection test) to verify functionality. Use these for readiness checks where you want to remove instances from load balancer rotation when dependencies fail. Netflix’s Eureka clients use deep readiness checks that verify connectivity to all backing services. The advantage is accurate detection of inability to serve traffic. The disadvantage is complexity—you need timeouts, circuit breakers, and caching to prevent health checks from causing incidents.

Synthetic transaction health checks execute a real user workflow end-to-end. Instead of checking “can we reach the database,” they execute “can we fetch a user profile and return it.” Stripe’s payment API health checks actually process a test payment through the full stack, verifying that the entire pipeline works. Use these for critical user journeys where you want to detect subtle integration issues. The advantage is catching problems that component checks miss—maybe the database is up but a schema migration broke your queries. The disadvantage is high cost and complexity—synthetic transactions consume real resources and can be slow.

Dependency health aggregation is a pattern where services expose not just their own health but the health of their dependencies. When Service A calls Service B, Service A’s health check includes Service B’s health. This creates a health dependency tree. Use this when you want a single health check to represent an entire subsystem. Google’s services often implement this—checking the frontend health automatically checks all backend services it depends on. The advantage is simplified monitoring—one check tells you if the whole stack is healthy. The disadvantage is cascading health failures—if a leaf service fails, every service that depends on it reports unhealthy, making it hard to identify the root cause.

Startup probes are a Kubernetes-specific variant that handles slow-starting applications. Some services take 60+ seconds to initialize (loading large datasets into memory, warming caches). During startup, liveness checks would fail and trigger restarts, creating a restart loop. Startup probes run during initialization with longer timeouts and more retries. Once the startup probe succeeds, Kubernetes switches to regular liveness checks. Use these for services with long initialization times. The advantage is preventing restart loops for legitimately slow-starting services. The disadvantage is added complexity in your health check configuration.

Trade-offs

Check depth vs. overhead: Shallow checks (just return 200) have near-zero overhead but miss most failure modes. Deep checks (verify all dependencies) catch real problems but consume resources and can cause cascading failures if not implemented carefully. The decision framework: use shallow checks for liveness (restart decisions), deep checks for readiness (traffic routing decisions). For readiness, check only dependencies that would prevent serving traffic—if your service can handle 80% of requests without the cache, don’t fail health checks when Redis is down. Netflix uses this approach: their readiness checks verify the database (critical) but not the recommendation service (degraded experience, not complete failure).

Synchronous vs. asynchronous health checks: Synchronous checks run when the health endpoint is called, executing all dependency checks in the request path. This is simple but means health check latency equals the sum of all dependency check latencies. Asynchronous checks run in a background thread every N seconds, updating a cached result that the health endpoint returns immediately. The decision framework: use asynchronous checks for services with many dependencies or slow checks. The trade-off is stale health information (up to N seconds old) versus blocking health check requests. Uber’s services use asynchronous checks with 10-second refresh intervals—the health endpoint returns in <1ms by serving cached results, but health status can be up to 10 seconds stale.

Fail open vs. fail closed: When a health check times out or errors, should you return healthy (fail open) or unhealthy (fail closed)? Fail open keeps the service in rotation despite check failures, risking sending traffic to broken instances. Fail closed removes the service from rotation, risking removing healthy instances due to transient check failures. The decision framework: fail closed for checks that rarely fail (database connectivity—if this times out, something is seriously wrong). Fail open for checks that might have false positives (checking a non-critical downstream service that’s flaky). Google’s SRE guidance recommends fail closed by default, with explicit fail-open configuration for non-critical checks.

Individual vs. aggregated health endpoints: Should you expose one endpoint that returns overall health, or separate endpoints for each dependency? One endpoint (/health) is simpler for consumers but makes debugging harder—you know the service is unhealthy but not why. Multiple endpoints (/health/db, /health/cache, /health/queue) provide granular visibility but require consumers to check multiple URLs. The decision framework: expose one aggregated endpoint for automated systems (load balancers, orchestrators) and return detailed JSON with per-dependency status for debugging. This gives you both simplicity for automation and detail for operators. AWS’s Application Load Balancer uses this pattern—it checks one endpoint but expects JSON with component-level details.

Static vs. dynamic health criteria: Should health checks use fixed thresholds (database query must complete in <100ms) or dynamic thresholds (database query must complete in <p99 latency from the last hour)? Static thresholds are simple but can cause false positives during legitimate load spikes. Dynamic thresholds adapt to current conditions but can hide gradual degradation. The decision framework: use static thresholds for hard limits (connection timeouts, error rates) and dynamic thresholds for latency checks. Twitter’s services use static thresholds for error rates (>1% errors = unhealthy) but dynamic thresholds for latency (>2x recent p99 = unhealthy), preventing false positives during traffic spikes while catching real degradation.

Common Pitfalls

Pitfall 1: Health checks that cause cascading failures. You implement deep health checks that query the database on every request. During a database slowdown, health checks pile up, consuming all database connections, making the problem worse. This happened to a major e-commerce site during Black Friday 2019—their health checks consumed 40% of database capacity during an incident, preventing recovery. Why it happens: Not caching health check results and not implementing timeouts. How to avoid: Cache health results for 10-30 seconds, use aggressive timeouts (1-2 seconds), and implement circuit breakers that stop checking a dependency after repeated failures.

Pitfall 2: Liveness checks that fail for transient issues. Your liveness check verifies database connectivity. During a brief network hiccup, the check fails, Kubernetes kills your pod, which doesn’t fix the network. The new pod starts, hits the same network issue, gets killed, creating a restart loop. Why it happens: Confusing liveness (should we restart?) with readiness (should we send traffic?). How to avoid: Keep liveness checks minimal—only check for unrecoverable process-level failures like deadlocks or memory corruption. Put all dependency checks in readiness probes. If your service can recover from a transient issue without restarting, don’t put that check in liveness.

Pitfall 3: Health checks that don’t reflect actual service capability. Your health check verifies the database is reachable but doesn’t check if your connection pool is exhausted. The service reports healthy while rejecting all requests with “no available connections.” Why it happens: Checking infrastructure (is the database up?) instead of capability (can I serve requests?). How to avoid: Health checks should verify the service’s ability to fulfill its SLO, not just the availability of dependencies. Check connection pool availability, queue depths, circuit breaker states—anything that would prevent serving traffic even if infrastructure is healthy.

Pitfall 4: Returning 200 with an error message in JSON. Your health check returns HTTP 200 {"status": "unhealthy"} because you want to provide details. Load balancers see the 200 status code and keep sending traffic to the unhealthy instance. Why it happens: Misunderstanding how load balancers interpret health checks—they typically only look at HTTP status codes, not response bodies. How to avoid: Return HTTP 503 for unhealthy, HTTP 200 for healthy. Put details in the JSON body for human debugging, but make sure the status code reflects health. Most load balancers can be configured to parse JSON, but don’t rely on it—status codes are the universal signal.

Pitfall 5: No health check at all, relying on error rates. You skip implementing health checks, figuring your monitoring will alert on elevated error rates. During a database failover, your service sends 30 seconds of errors to users before monitoring detects the issue and alerts. Why it happens: Underestimating the value of proactive health checks versus reactive error monitoring. How to avoid: Implement health checks even if you have comprehensive monitoring. Health checks enable automation (load balancers, orchestrators) to react in seconds, not minutes. They’re the difference between 30 seconds of errors (no health checks) and 5 seconds (health checks trigger immediate traffic removal).

Health Check Storm During Database Incident

sequenceDiagram
    participant LB1 as Load Balancer 1
    participant LB2 as Load Balancer 2
    participant Mon as Monitoring System
    participant Svc as Service Instance
    participant DB as Database<br/>(degraded)
    
    Note over DB: Database latency spikes to 5s
    
    rect rgb(255, 235, 238)
    Note over LB1,DB: WITHOUT caching/circuit breaker
    LB1->>Svc: Health check (t=0s)
    LB2->>Svc: Health check (t=0s)
    Mon->>Svc: Health check (t=0s)
    Svc->>DB: SELECT 1 (query 1)
    Svc->>DB: SELECT 1 (query 2)
    Svc->>DB: SELECT 1 (query 3)
    Note over DB: 3 concurrent queries, each taking 5s
    LB1->>Svc: Health check (t=5s)
    LB2->>Svc: Health check (t=5s)
    Svc->>DB: SELECT 1 (query 4)
    Svc->>DB: SELECT 1 (query 5)
    Note over DB: 5 concurrent queries<br/>consuming connection pool
    Note over Svc,DB: Health checks make incident WORSE
    end
    
    rect rgb(232, 245, 233)
    Note over LB1,DB: WITH caching + circuit breaker
    LB1->>Svc: Health check (t=0s)
    Svc->>DB: SELECT 1 (query 1)
    Note over Svc: Cache result for 30s<br/>Open circuit breaker
    LB2->>Svc: Health check (t=0s)
    Svc-->>LB2: Return cached unhealthy
    Mon->>Svc: Health check (t=1s)
    Svc-->>Mon: Return cached unhealthy
    LB1->>Svc: Health check (t=5s)
    Svc-->>LB1: Return cached unhealthy
    Note over DB: Only 1 query in 30s<br/>Circuit breaker prevents storm
    end

Comparison of health check behavior during a database incident. Without caching and circuit breakers (top), each health check queries the database, creating a storm of concurrent queries that worsens the incident. With caching and circuit breakers (bottom), only the first check queries the database, subsequent checks serve cached results, preventing overload.

Math & Calculations

Health check overhead calculation: Suppose your service handles 10,000 requests per second and you have 20 instances behind a load balancer. Each load balancer polls health every 5 seconds. Without caching, each health check queries the database. How much database load comes from health checks?

Health check queries per second = (instances × load_balancers × check_frequency) / cache_duration

Without caching (cache_duration = 0):
= 20 instances × 1 load_balancer × (1 check / 5 seconds)
= 4 health checks/second

If each health check queries the database:
= 4 database queries/second from health checks
= 0.04% of total load (4 / 10,000)

Seems negligible, but now add 3 load balancers and 5 monitoring systems:

Total health checkers = 3 load_balancers + 5 monitoring_systems = 8
Health check queries = 20 instances × 8 checkers × (1 / 5 seconds)
= 32 queries/second
= 0.32% of total load

Now consider a database incident where query latency spikes to 5 seconds. Health checks start timing out and retrying:

With 5-second health check timeout and 3 retries:
Concurrent health check queries = 32 queries/second × 5 seconds × 3 retries
= 480 concurrent queries

If your database connection pool has 500 connections:
Health checks consume 96% of connections during the incident

With 30-second caching:

Health check queries = (20 × 8 × 1/5) / 30
= 32 / 30 = 1.07 queries/second

During incident with 5-second timeout:
Concurrent queries = 1.07 × 5 × 3 = 16 concurrent queries
= 3.2% of connection pool

Availability impact calculation: How much does health check frequency affect detection time and availability? Assume a service fails at time T, load balancer polls every N seconds, and requires M consecutive failures before removing from rotation.

Detection time = (M × N) seconds

With N=10 seconds, M=3:
Detection time = 30 seconds

If service handles 1000 req/sec and fails completely:
Failed requests = 1000 req/sec × 30 sec = 30,000 requests

With N=5 seconds, M=2:
Detection time = 10 seconds
Failed requests = 1000 × 10 = 10,000 requests

Availability improvement = (30,000 - 10,000) / 30,000 = 67% fewer errors

But faster polling increases load:

Health check load = instances × checkers × (1/N)

N=10: 20 × 8 × 0.1 = 16 checks/second
N=5:  20 × 8 × 0.2 = 32 checks/second (2x load)
N=1:  20 × 8 × 1.0 = 160 checks/second (10x load)

The sweet spot is typically N=5-10 seconds with M=2-3 failures, balancing detection speed against overhead.

Health Check Overhead Calculation

graph TB
    subgraph Scenario["System Configuration"]
        I["20 Service Instances"]
        C["8 Health Checkers<br/>(3 LBs + 5 monitors)"]
        F["Check Frequency: 5s"]
    end
    
    subgraph Without["WITHOUT Caching"]
        Calc1["Checks/sec = 20 × 8 × (1/5)<br/>= 32 checks/sec"]
        Calc2["During incident<br/>(5s timeout, 3 retries):<br/>Concurrent = 32 × 5 × 3<br/>= 480 queries"]
        Impact1["96% of 500-connection<br/>pool consumed by<br/>health checks"]
    end
    
    subgraph With["WITH 30s Caching"]
        Calc3["Checks/sec = 32 / 30<br/>= 1.07 checks/sec"]
        Calc4["During incident:<br/>Concurrent = 1.07 × 5 × 3<br/>= 16 queries"]
        Impact2["3.2% of connection<br/>pool consumed<br/>(97% reduction)"]
    end
    
    Scenario --> Without
    Scenario --> With
    Calc1 --> Calc2 --> Impact1
    Calc3 --> Calc4 --> Impact2

Mathematical comparison of health check overhead with and without caching. Without caching, health checks consume 96% of database connections during an incident. With 30-second caching, overhead drops to 3.2%, preventing health checks from worsening the incident.

Real-World Examples

Netflix’s Eureka and health checks: Netflix’s Eureka service registry uses health checks to manage their microservices architecture of 700+ services. Each service instance sends a heartbeat to Eureka every 30 seconds (liveness) and exposes a /health endpoint that Eureka polls every 30 seconds (readiness). The interesting detail: Netflix uses a “self-preservation mode” where if more than 15% of instances fail health checks simultaneously, Eureka assumes it’s a network partition, not mass service failure, and stops removing instances from the registry. This prevented a cascading failure in 2014 when an AWS availability zone had network issues—without self-preservation, Eureka would have deregistered all instances in that AZ, causing clients to overload the remaining AZs. The health check implementation includes circuit breakers with 20-second timeouts and exponential backoff, ensuring that health checks never consume more than 5% of system resources even during incidents.

Kubernetes health checks at Spotify: Spotify runs 1,500+ microservices on Kubernetes, using liveness and readiness probes extensively. Their platform team mandates that all services implement both probe types with specific requirements: liveness probes must respond in <100ms and check only process health (no dependency checks), while readiness probes can take up to 5 seconds and must verify all critical dependencies. The interesting detail: Spotify’s readiness checks include a “startup grace period” where new pods report ready immediately for the first 60 seconds, even if dependencies aren’t fully healthy. This prevents cascading failures during cluster-wide restarts—if every service waited for downstream dependencies to be ready before reporting ready itself, you’d have a deadlock where nothing ever becomes ready. After the grace period, normal dependency checks kick in. This pattern reduced their deployment times by 70% and eliminated restart loops during infrastructure maintenance.

Stripe’s payment API health checks: Stripe’s payment processing API uses sophisticated health checks that go beyond simple dependency pings. Their /health endpoint executes a synthetic transaction: creating a test payment, authorizing it, and capturing it, all against a test merchant account. This end-to-end check catches subtle issues like database schema migrations that break specific queries or configuration changes that affect payment routing. The interesting detail: Stripe runs these synthetic checks every 10 seconds but caches results for 30 seconds to prevent overload. They also implement “degraded” health status (HTTP 200 but with "status": "degraded" in JSON) for scenarios where the service is functional but operating at reduced capacity—for example, if the primary database is healthy but a read replica is down. Load balancers keep degraded instances in rotation but with reduced weight, allowing Stripe to serve traffic during partial outages while signaling to monitoring systems that something needs attention. This three-state health model (healthy/degraded/unhealthy) reduced their false positive alerts by 80% while maintaining 99.99% availability.

Stripe’s Three-State Health Model

graph LR
    subgraph Health States
        Healthy["HEALTHY<br/>HTTP 200<br/>status: healthy<br/><br/>All dependencies up<br/>Full capacity"]
        Degraded["DEGRADED<br/>HTTP 200<br/>status: degraded<br/><br/>Core deps up<br/>Non-critical deps down<br/>Reduced capacity"]
        Unhealthy["UNHEALTHY<br/>HTTP 503<br/>status: unhealthy<br/><br/>Critical deps down<br/>Cannot serve traffic"]
    end
    
    subgraph Dependencies
        PrimaryDB[("Primary DB<br/>CRITICAL")]
        ReplicaDB[("Read Replica<br/>NON-CRITICAL")]
        Cache[("Redis Cache<br/>NON-CRITICAL")]
    end
    
    subgraph Load Balancer Actions
        Full["Full Weight<br/>100% traffic"]
        Reduced["Reduced Weight<br/>50% traffic<br/>+ Alert"]
        Remove["Remove from Pool<br/>0% traffic<br/>+ Page"]
    end
    
    PrimaryDB --"UP"--> Healthy
    ReplicaDB --"UP"--> Healthy
    Cache --"UP"--> Healthy
    
    PrimaryDB --"UP"--> Degraded
    ReplicaDB --"DOWN"--> Degraded
    Cache --"DOWN"--> Degraded
    
    PrimaryDB --"DOWN"--> Unhealthy
    
    Healthy --> Full
    Degraded --> Reduced
    Unhealthy --> Remove

Stripe’s three-state health model distinguishes between healthy (all dependencies up), degraded (non-critical dependencies down), and unhealthy (critical dependencies down). Load balancers keep degraded instances in rotation with reduced weight, preventing cascading failures while signaling issues to monitoring systems. This reduced false positive alerts by 80% while maintaining 99.99% availability.

Interview Expectations

Mid-Level

What you should know: Explain the difference between liveness and readiness checks clearly—liveness determines restart decisions, readiness determines traffic routing. Describe a basic implementation: expose /health/live and /health/ready endpoints, return 200 for healthy and 503 for unhealthy, include JSON with check details. Understand that health checks should verify critical dependencies (database, cache) with timeouts to prevent hanging. Know how load balancers use health checks to remove unhealthy instances from rotation.

Bonus points: Mention caching health check results to prevent overhead. Discuss the importance of timeouts on dependency checks (1-5 seconds). Explain that liveness checks should be minimal (process alive, critical threads running) while readiness checks can be deeper (verify dependencies). Reference how Kubernetes uses these probes. Show awareness that health checks themselves can cause problems if not implemented carefully (health check storms during incidents).

Senior

What you should know: Everything from mid-level plus trade-offs around check depth, frequency, and failure thresholds. Explain when to use shallow vs. deep health checks and the implications of each. Discuss how to prevent health checks from causing cascading failures (caching, circuit breakers, aggressive timeouts). Describe the relationship between health checks and SLOs—health should fail when you can’t meet your SLO. Explain patterns like startup probes for slow-starting services and degraded health states for partial outages.

Bonus points: Discuss the math behind health check overhead and how to calculate the impact on your system. Explain Netflix’s self-preservation mode or similar patterns that prevent health checks from making incidents worse. Describe how to implement health checks in a way that’s debuggable (detailed JSON responses with per-dependency status and latency). Talk about the trade-off between detection speed and false positives—faster polling catches failures sooner but increases false positive rate. Show understanding of how health checks integrate with deployment systems (canary deployments halt if new instances fail health checks).

Staff+

What you should know: Everything from senior plus organizational and architectural patterns. Discuss how to design health check standards across a large microservices architecture—what should be centralized vs. service-specific. Explain how to build health check aggregation systems that provide system-wide visibility without creating tight coupling. Describe patterns for handling health check versioning and evolution as services change. Talk about the relationship between health checks, SLOs, and incident response—how health check failures should trigger automated remediation.

Distinguishing signals: Propose a health check framework for a company with 500+ microservices that balances standardization with flexibility. Discuss the organizational challenges: how do you enforce health check standards without slowing down teams? How do you prevent teams from gaming health checks (always returning healthy to avoid being paged)? Describe advanced patterns like health check composition (aggregating health across service boundaries), multi-level health (healthy/degraded/unhealthy), and integration with chaos engineering (using health checks to verify resilience during fault injection). Reference specific incidents where health check design prevented or caused outages, showing deep understanding of the pattern’s impact on system reliability.

Common Interview Questions

Q1: How would you implement health checks for a service that depends on a database and a cache?

60-second answer: Expose two endpoints: /health/live (liveness) and /health/ready (readiness). Liveness just returns 200 if the process is running—no dependency checks. Readiness checks both database and cache with 2-second timeouts: execute SELECT 1 against the database and PING against Redis. If either fails, return 503. Cache the result for 30 seconds to prevent health check storms. Return JSON with per-dependency status for debugging.

2-minute answer: Start with the liveness endpoint—this should be minimal, just verifying the process can handle HTTP requests. No dependency checks here because you don’t want to restart the service just because the database is temporarily slow. For readiness, implement checks for both dependencies but with important safeguards. First, run checks in parallel with 2-second timeouts—if the database hangs, you don’t want the health check to hang. Second, cache results for 30 seconds and serve cached responses to subsequent health check requests. This prevents the scenario where 10 load balancers polling every 5 seconds create 60 database queries per minute just for health checks. Third, implement circuit breakers: if the database check fails 5 times in a row, stop checking it for 60 seconds and immediately return unhealthy. This prevents health check traffic from overwhelming an already-struggling database. Finally, return detailed JSON: {"status": "healthy", "checks": {"database": {"status": "healthy", "latency_ms": 15}, "cache": {"status": "healthy", "latency_ms": 3}}}. When debugging a production incident, this detail is invaluable.

Red flags: Saying you’d check dependencies in the liveness probe (causes restart loops). Not mentioning timeouts (health checks can hang). Checking dependencies synchronously without caching (creates overhead). Returning 200 with "status": "unhealthy" in JSON (load balancers only look at status codes).

Q2: Your health checks are causing 10% of your database load during an incident. What went wrong and how do you fix it?

60-second answer: The health checks aren’t caching results and are overwhelming the database. Fix it by caching health check results for 30-60 seconds, so repeated health check requests serve cached responses instead of hitting the database every time. Also implement circuit breakers that stop checking the database after repeated failures, preventing health check traffic from making the incident worse.

2-minute answer: This is a classic health check storm. Here’s what happened: you have 50 service instances, 3 load balancers, and 5 monitoring systems, each polling health every 5 seconds. That’s 50 × 8 × (1/5) = 80 health checks per second. Each health check queries the database, so you’re adding 80 queries/second just for health checks. During the incident, database queries are slow (5 seconds instead of 50ms), so health checks start timing out and retrying. With 3 retries, you now have 80 × 3 = 240 concurrent health check queries, consuming a significant portion of your database connection pool. The fix has three parts. First, cache health check results for 30-60 seconds—the first health check hits the database, subsequent checks serve the cached result. This reduces 80 queries/second to ~1.5 queries/second. Second, implement circuit breakers: after 5 consecutive database check failures, stop checking for 60 seconds and immediately return unhealthy. This prevents health check traffic during incidents. Third, reduce health check frequency during incidents—if your monitoring system detects elevated error rates, automatically back off health check frequency from every 5 seconds to every 30 seconds. These changes together reduce health check load by 95%+ during incidents.

Red flags: Not recognizing this as a health check storm. Suggesting to just increase database capacity (doesn’t address root cause). Not mentioning caching or circuit breakers. Proposing to disable health checks entirely (loses the benefit of automated traffic management).

Q3: Should health checks fail if a non-critical dependency like a recommendation service is down?

60-second answer: No. Health checks should only fail when the service can’t fulfill its core SLO. If your service can serve 95% of requests without the recommendation service (just with degraded experience), keep reporting healthy. Use a separate “degraded” status or metrics to signal the issue without removing the service from rotation. This prevents cascading failures where one non-critical service failure takes down everything.

2-minute answer: This depends on your SLO and the dependency’s criticality. If your service promises “return user profile data” and the recommendation service just adds “suggested products” to the response, then no—health checks should not fail when recommendations are down. You can still fulfill your core SLO (return profile data), just with a degraded experience. Failing health checks would remove your service from rotation unnecessarily, reducing overall availability. Instead, implement a three-state health model: healthy (all dependencies up), degraded (core dependencies up, non-critical dependencies down), unhealthy (can’t serve traffic). Load balancers keep degraded instances in rotation, possibly with reduced weight. Monitoring systems alert on degraded status but don’t page on-call engineers immediately. However, if the recommendation service is critical to your SLO—say you promise “return profile with personalized recommendations”—then yes, fail health checks when it’s down. The decision framework: ask “can we meet our SLO without this dependency?” If yes, don’t fail health checks. If no, fail them. Stripe uses this pattern: their payment API reports degraded when read replicas are down (can still process payments, just with reduced capacity) but unhealthy when the primary database is down (can’t process payments at all).

Red flags: Saying all dependencies should cause health check failures (causes cascading outages). Not distinguishing between critical and non-critical dependencies. Not mentioning SLOs as the decision framework. Suggesting to remove health checks entirely for non-critical dependencies (loses visibility).

Q4: How do you prevent health check false positives during legitimate traffic spikes?

60-second answer: Use dynamic thresholds based on recent performance instead of static thresholds. For example, fail health checks if latency exceeds 2x the p99 from the last 10 minutes, not if it exceeds a fixed 100ms. This adapts to current load. Also implement percentile-based checks—fail if p99 latency is high, not if average latency is high, since averages can look good even when some requests are timing out.

2-minute answer: False positives during traffic spikes happen when health checks use static thresholds that don’t account for legitimate load increases. For example, if your health check fails when database query latency exceeds 100ms, you’ll get false positives during Black Friday when latency legitimately increases to 150ms due to higher load. The solution is dynamic thresholds. Instead of “fail if latency > 100ms,” use “fail if latency > 2x recent p99.” Calculate the p99 latency over the last 10 minutes and fail health checks only if current latency exceeds 2x that value. During traffic spikes, the baseline increases, so you don’t get false positives. Also use percentile-based checks, not averages. A service might have average latency of 50ms (looks healthy) but p99 latency of 5 seconds (actually unhealthy). Check p99 or p95 latency, not average. Finally, require multiple consecutive failures before reporting unhealthy. A single slow health check might be a transient blip; three consecutive slow checks indicate a real problem. Twitter uses this approach: their health checks fail only if p99 latency exceeds 2x the recent baseline for 3 consecutive checks. This reduced false positives by 90% while still catching real degradation.

Red flags: Only suggesting to increase static thresholds (doesn’t solve the root problem). Not mentioning percentiles (averages hide problems). Suggesting to disable health checks during known traffic spikes (loses protection during the most critical time).

Q5: How would you design health checks for a system with 500 microservices?

60-second answer: Create a standard health check library that all services use, providing consistent implementation of liveness/readiness endpoints, caching, timeouts, and circuit breakers. Build a centralized health dashboard that aggregates health status across all services. Enforce standards through CI/CD—deployments fail if health checks aren’t implemented correctly. Provide service-specific extension points for custom dependency checks while keeping the framework consistent.

2-minute answer: At this scale, you need organizational patterns, not just technical ones. Start with a standard health check library (in each language your company uses) that implements best practices: liveness/readiness endpoints, 30-second result caching, 2-second timeouts, circuit breakers, detailed JSON responses. Services import this library and just configure which dependencies to check—the library handles all the complexity. Enforce this through CI/CD: your deployment pipeline runs automated tests that verify health endpoints exist, return proper status codes, and respond within SLA. Services that don’t pass these tests can’t deploy. Build a centralized health dashboard that polls all services and shows system-wide health, with drill-down to individual services and dependencies. This gives you a single pane of glass for health across 500 services. Implement health check composition: services can include their dependencies’ health in their own health response, creating a health dependency tree. This helps identify root causes—if 50 services report unhealthy, you can trace back to the one database that’s actually down. Finally, integrate health checks with your incident response system: when health checks fail, automatically create incidents, page on-call engineers, and trigger runbooks. The key is making health checks so easy to implement correctly that teams have no excuse not to do it. Netflix and Google both use this pattern—centralized framework, decentralized implementation, automated enforcement.

Red flags: Suggesting each team implement health checks from scratch (leads to inconsistency). Not mentioning enforcement mechanisms (standards without enforcement don’t work). Proposing a centralized team that implements health checks for everyone (doesn’t scale). Not discussing how to handle the organizational challenge of getting 500 teams to adopt standards.

Red Flags to Avoid

Red flag 1: “Health checks should query the database on every request to ensure it’s working.” Why it’s wrong: This creates massive overhead and can cause cascading failures during incidents. If you have 100 load balancers polling every second, that’s 100 database queries per second just for health checks. During a database slowdown, health checks pile up and make the problem worse. What to say instead: “Health checks should cache results for 30 seconds and use aggressive timeouts with circuit breakers. The first health check queries the database, subsequent checks serve cached results. This bounds overhead even during incidents.”

Red flag 2: “Liveness and readiness checks should be the same—just check if all dependencies are working.” Why it’s wrong: This causes restart loops. If your liveness check fails because the database is slow, Kubernetes kills your pod, which doesn’t fix the database. The new pod starts, hits the same slow database, gets killed again. What to say instead: “Liveness checks should be minimal—only check for unrecoverable process-level failures like deadlocks. Readiness checks verify dependencies. This separation prevents restart loops during transient dependency issues.”

Red flag 3: “Health checks should always return 200 OK and put the actual status in the JSON body.” Why it’s wrong: Most load balancers only look at HTTP status codes, not response bodies. If you return 200 with {"status": "unhealthy"}, the load balancer sees 200 and keeps sending traffic to the unhealthy instance. What to say instead: “Health checks should return HTTP 503 for unhealthy, HTTP 200 for healthy. Put detailed status in the JSON body for human debugging, but make sure the status code reflects health. Status codes are the universal signal that all load balancers understand.”

Red flag 4: “We don’t need health checks because we have monitoring and alerting.” Why it’s wrong: Monitoring is reactive—it detects problems after they’ve affected users and alerts humans who take minutes to respond. Health checks are proactive—they detect problems before users notice and enable automation (load balancers, orchestrators) to react in seconds. What to say instead: “Health checks and monitoring serve different purposes. Health checks enable automated remediation—load balancers remove unhealthy instances in seconds. Monitoring provides visibility and alerts humans. You need both. Health checks are the difference between 5 seconds of errors and 5 minutes.”

Red flag 5: “Health checks should test every possible failure mode to be thorough.” Why it’s wrong: Overly thorough health checks are slow, expensive, and create false positives. If your health check tests 20 different failure modes, it takes 10+ seconds to run, consumes significant resources, and fails for transient issues that don’t actually prevent serving traffic. What to say instead: “Health checks should verify the service’s ability to meet its SLO, not test every possible failure. Check critical dependencies that would prevent serving traffic, use aggressive timeouts, and align health with your SLO. If you can serve traffic without the cache, don’t fail health checks when Redis is down.”

Key Takeaways

Separate liveness from readiness: Liveness checks determine restart decisions and should only fail for unrecoverable process failures. Readiness checks determine traffic routing and should verify critical dependencies. Mixing these causes restart loops during transient issues.
Cache aggressively and use timeouts: Cache health check results for 30 seconds to prevent health check storms. Every dependency check needs a 1-5 second timeout and circuit breakers that stop checking after repeated failures. Health checks should never consume more than 5% of system resources, even during incidents.
Align health with SLOs: Health checks should fail when your service can’t meet its SLO, not before and not after. If you can serve 80% of requests without the cache, don’t fail health checks when Redis is down. Use three-state health (healthy/degraded/unhealthy) to signal partial outages without removing instances from rotation.
Make health checks observable: Return detailed JSON with per-dependency status, latency, and error messages. When a service is unhealthy, operators need to know why. Return HTTP 503 for unhealthy (status codes are universal), but include rich details in the response body for debugging.
Prevent health checks from causing incidents: Use circuit breakers, caching, and aggressive timeouts to ensure health checks never make incidents worse. During a database slowdown, health check traffic should automatically back off, not pile up and consume all connections. Design health checks to be safe even when the system is failing.

Prerequisites: Understanding Load Balancing is essential since health checks are primarily consumed by load balancers to make traffic routing decisions. Familiarity with Circuit Breaker Pattern helps understand how to prevent health checks from causing cascading failures. Knowledge of SLOs and SLIs is important for aligning health checks with service level objectives.

Related patterns: Heartbeat Pattern is closely related—heartbeats are push-based (service sends “I’m alive” signals) while health checks are pull-based (monitor queries service). Bulkhead Pattern complements health checks by isolating failures. Graceful Degradation explains how to handle partial failures that health checks detect.

Next steps: After understanding health checks, explore Chaos Engineering to learn how to verify that health checks correctly detect failures during fault injection. Study Observability to understand how health checks fit into broader monitoring strategies. Learn about Blue-Green Deployment and Canary Deployment to see how health checks enable safe deployment strategies.