Health Monitoring: Liveness & Readiness Probes

TL;DR

Health monitoring continuously verifies that system components are alive and functioning correctly by checking their ability to process requests. Unlike metrics that measure performance, health checks answer a binary question: “Is this service ready to handle traffic?” This determines routing decisions, auto-scaling actions, and incident alerts.

Cheat Sheet: Liveness (is it running?) vs Readiness (can it serve traffic?). Push-based heartbeats for distributed coordination, pull-based checks for load balancers. Shallow checks (<100ms) for routing, deep checks for diagnostics. Always include dependency health with circuit breakers to prevent cascading failures.

The Analogy

Think of health monitoring like a hospital’s triage system. When you arrive at an emergency room, the nurse doesn’t run a full battery of tests—they quickly check your pulse, breathing, and consciousness to decide if you need immediate attention or can wait. That’s a shallow health check. If you’re critical, doctors perform deeper diagnostics checking organ function, blood chemistry, and vital signs—that’s a deep health check. The triage nurse (load balancer) uses the quick check to route patients (requests) only to doctors (servers) who are ready to help. If a doctor is overwhelmed or their equipment is broken, they’re marked “not ready” even though they’re still alive.

Why This Matters in Interviews

Health monitoring comes up in almost every system design interview when discussing reliability, load balancing, or auto-scaling. Interviewers want to see that you understand the difference between a service being alive versus being ready to serve traffic—many candidates conflate these. They’re looking for you to proactively discuss health check design when you introduce load balancers, service meshes, or orchestration layers. Strong candidates explain the tradeoffs between check frequency, depth, and false positive rates, and connect health monitoring to circuit breakers and graceful degradation. This topic separates engineers who’ve operated production systems from those who’ve only built features.

Core Concept

Health monitoring is the practice of continuously verifying that system components can successfully process requests. At its core, it answers a deceptively simple question: “Should this component receive traffic right now?” This differs fundamentally from metrics monitoring, which tracks how well a system performs (latency, throughput, error rates). Health monitoring makes binary decisions that directly control traffic routing, auto-scaling, and alerting.

The challenge is that “healthy” is context-dependent. A web server might be running (process alive) but unable to serve requests because its database connection pool is exhausted. A cache node might be functional but evicting entries so aggressively due to memory pressure that it’s worse than useless. A payment service might be operational but deliberately rejecting traffic because an upstream fraud detection service is down. Effective health monitoring must capture these nuances while remaining fast enough to make real-time routing decisions.

Modern distributed systems typically implement multiple layers of health checking: shallow checks for immediate routing decisions (“Can this instance handle the next request?”), deep checks for diagnostic purposes (“Why is this service degraded?”), and synthetic checks that simulate real user workflows (“Can a user actually complete a purchase end-to-end?”). Each layer serves different consumers—load balancers, orchestrators, on-call engineers—and requires different tradeoffs between accuracy, latency, and resource consumption.

How It Works

Step 1: Health Check Endpoint Registration Each service exposes a dedicated health check endpoint (commonly /health, /healthz, or /ready) that returns HTTP 200 for healthy and 503 for unhealthy. The service registers this endpoint with its orchestrator (Kubernetes, ECS) and load balancer during startup. The endpoint is typically unauthenticated and rate-limited separately from application traffic to prevent health checks from being affected by application load.

Step 2: Periodic Probing The health checker (load balancer, orchestrator, or dedicated monitoring service) polls the endpoint at a configured interval—typically every 5-30 seconds for production systems. The checker maintains a sliding window of recent results (e.g., last 3 checks) to avoid flapping from transient failures. For example, AWS ALB requires 2 consecutive successful checks to mark a target healthy and 2 consecutive failures to mark it unhealthy.

Step 3: Health Evaluation Logic The service’s health endpoint executes a series of checks, each with a timeout (typically 50-200ms total). A shallow check verifies the process is alive and can handle basic operations—checking that the HTTP server responds, critical threads are running, and memory isn’t exhausted. A readiness check additionally verifies dependencies: database connections are available, cache is responsive, required downstream services are reachable. The endpoint returns unhealthy if any critical check fails.

Step 4: Routing Decision Based on the health status, the load balancer or service mesh updates its routing table. Unhealthy instances are immediately removed from the active pool and stop receiving new requests. In-flight requests may be allowed to complete (graceful) or terminated immediately (aggressive), depending on configuration. The orchestrator may attempt to restart unhealthy instances or scale up replacements.

Step 5: Recovery and Re-registration Once an instance recovers and passes the required number of consecutive health checks, it’s added back to the active pool. Some systems implement a “warming” period where recovered instances receive gradually increasing traffic to avoid overwhelming them. Throughout this process, health check results are logged and emitted as metrics for alerting and post-incident analysis.

Health Check Flow: From Probe to Routing Decision

sequenceDiagram
    participant LB as Load Balancer
    participant Inst as Service Instance
    participant DB as Database
    participant Pool as Connection Pool
    
    Note over LB,DB: Every 10 seconds
    LB->>Inst: 1. GET /ready
    activate Inst
    Inst->>Pool: 2. Acquire connection
    Pool-->>Inst: Connection available
    Inst->>DB: 3. SELECT 1 (timeout: 100ms)
    DB-->>Inst: Success
    Inst->>Pool: 4. Release connection
    Inst-->>LB: 5. HTTP 200 OK
    deactivate Inst
    
    Note over LB: Mark healthy after<br/>2 consecutive successes
    
    LB->>Inst: 6. GET /ready (next check)
    activate Inst
    Inst->>Pool: 7. Acquire connection
    Pool-->>Inst: Pool exhausted!
    Inst-->>LB: 8. HTTP 503 Service Unavailable
    deactivate Inst
    
    Note over LB: Mark unhealthy after<br/>2 consecutive failures
    Note over LB: Remove from rotation<br/>Stop sending traffic

A typical health check sequence showing how load balancers probe service instances, how instances verify dependency health (database connectivity), and how consecutive failures trigger removal from rotation. Note the hysteresis: 2 successes to mark healthy, 2 failures to mark unhealthy.

Key Principles

Principle 1: Separate Liveness from Readiness Liveness checks answer “Is the process alive?” while readiness checks answer “Can it serve traffic?” A service can be alive but not ready—for example, during startup while loading configuration, or when dependencies are unavailable. Kubernetes distinguishes these explicitly: failed liveness checks trigger container restarts, while failed readiness checks only remove the pod from service endpoints. Conflating these leads to restart loops when a service is alive but waiting for a database to recover. Netflix’s Eureka service registry maintains this distinction, allowing services to remain registered (alive) while marking themselves OUT_OF_SERVICE (not ready) during maintenance windows.

Principle 2: Health Checks Must Be Faster Than Timeouts A health check that takes 5 seconds is useless for a load balancer with a 3-second request timeout. The check must complete in a fraction of your SLA—typically under 100ms for user-facing services. This means shallow checks only: verify the process responds, check critical thread pools aren’t deadlocked, confirm memory isn’t exhausted. Deep diagnostics (query database, call downstream services) belong in separate endpoints used for debugging, not routing. Google’s SRE practices recommend health checks consume less than 1% of system resources and complete in under 50ms for services with sub-second latency requirements.

Principle 3: Include Dependency Health with Circuit Breakers A service is only as healthy as its critical dependencies. If your API depends on a database, the health check should verify the database is reachable—but with a circuit breaker to prevent cascading failures. If the database is down, the first few health checks will fail slowly (timing out), causing all instances to be marked unhealthy simultaneously. A circuit breaker opens after N consecutive failures, causing subsequent checks to fail fast without actually calling the database. This prevents a database outage from taking down your entire service fleet. Stripe’s payment API implements this pattern: when their fraud detection service is unhealthy, payment services remain healthy but operate in a degraded mode with simplified fraud rules.

Principle 4: Avoid False Positives Through Hysteresis A single failed health check shouldn’t remove an instance from rotation—transient network blips, garbage collection pauses, or brief CPU spikes happen. Implement hysteresis by requiring N consecutive failures (typically 2-3) before marking unhealthy, and M consecutive successes (typically 2-5) before marking healthy again. The asymmetry (M > N) prevents flapping: it’s better to be conservative about adding instances back than to rapidly oscillate. AWS ELB uses 2 failures / 10 successes by default. The failure threshold should be low enough to detect real problems quickly (under 30 seconds) but high enough to tolerate transient issues.

Principle 5: Health Checks Are Not Metrics Health checks drive binary routing decisions; metrics measure system behavior. Don’t try to make health checks do both. A service experiencing elevated latency (P99 = 2s instead of 500ms) might still be healthy enough to serve traffic—removing it from rotation could make things worse by overloading remaining instances. Instead, health checks should only fail when the service cannot fulfill requests (database unreachable, out of memory, deadlocked). Use metrics and alerts to detect degradation, use health checks to prevent routing traffic to broken instances. Twitter’s approach: health checks verify “can respond,” while their observability stack monitors “responding well.”

Liveness vs Readiness: Separate Concerns

graph TB
    subgraph Liveness Check ["Liveness Check (/healthz)"]
        L1["Process Running?"] --> L2["HTTP Server Responsive?"]
        L2 --> L3["Critical Threads Alive?"]
        L3 --> L4["Memory Not Exhausted?"]
        L4 --> LResult{All Pass?}
    end
    
    subgraph Readiness Check ["Readiness Check (/ready)"]
        R1["Liveness Checks"] --> R2["Database Reachable?"]
        R2 --> R3["Cache Responsive?"]
        R3 --> R4["Connection Pool Available?"]
        R4 --> RResult{All Pass?}
    end
    
    LResult -->|Yes| LiveAction["Return 200<br/>Process is alive"]
    LResult -->|No| LiveFail["Return 503<br/>⚠️ Restart container"]
    
    RResult -->|Yes| ReadyAction["Return 200<br/>Route traffic here"]
    RResult -->|No| ReadyFail["Return 503<br/>Remove from rotation<br/>(but don't restart)"]

Liveness checks verify the process is alive and should almost never fail—failures trigger container restarts. Readiness checks additionally verify dependencies and determine traffic routing. A service can be alive but not ready (e.g., during database outage), preventing restart loops while correctly removing it from rotation.

Deep Dive

Types / Variants

Shallow Health Checks (Liveness Probes) Shallow checks verify the process is alive and the HTTP server is responsive, typically by returning a hardcoded 200 OK response. These execute in under 10ms and consume minimal resources. Use shallow checks for liveness probes in orchestrators—they determine whether to restart a container. Kubernetes liveness probes commonly hit /healthz endpoints that do nothing but return success. The advantage is speed and reliability: they can’t fail due to external dependencies. The disadvantage is they don’t catch many real failure modes—a service with a deadlocked thread pool or exhausted database connections will still pass. Example: A Node.js service that returns res.status(200).send('OK') without any logic.

Deep Health Checks (Readiness Probes) Deep checks verify the service can actually fulfill requests by testing critical dependencies: database connectivity, cache availability, downstream service reachability. These might take 50-200ms and execute actual queries (e.g., SELECT 1 against the database). Use deep checks for readiness probes that control load balancer routing. They catch real failure modes: if your database connection pool is exhausted, the health check will fail and traffic will be routed elsewhere. The tradeoff is complexity and potential for cascading failures—if your health check calls a downstream service that’s slow, all your instances might be marked unhealthy simultaneously. Example: Uber’s services check that they can acquire a database connection from the pool and that their Redis cache responds to a PING within 100ms.

Synthetic Health Checks (End-to-End Probes) Synthetic checks simulate real user workflows by executing complete transactions: creating a test order, processing a dummy payment, querying for a specific record. These run less frequently (every 1-5 minutes) and are typically executed by external monitoring services, not load balancers. They catch issues that component-level checks miss: misconfigured routing, broken authentication flows, data corruption. The downside is they’re slow (seconds), expensive (consume real resources), and can generate false positives from test data issues. Example: Stripe runs synthetic transactions through their entire payment pipeline every minute, creating test charges that exercise card validation, fraud detection, and settlement systems.

Heartbeat-Based Health (Push Model) Instead of being polled, services actively send heartbeats to a central coordinator (e.g., every 5 seconds). If heartbeats stop, the service is presumed dead. This is common in distributed coordination systems like Apache ZooKeeper or Consul, where services maintain ephemeral nodes that disappear if heartbeats cease. The advantage is immediate detection of network partitions and process crashes—no waiting for the next poll interval. The disadvantage is the service must remain active enough to send heartbeats, and the coordinator becomes a critical dependency. Example: Netflix Eureka uses heartbeats where services send renewal requests every 30 seconds; if 3 consecutive renewals are missed (90 seconds), the instance is evicted from the registry.

Passive Health Checks (Failure Detection) Rather than actively probing, passive checks infer health from actual request outcomes. If a backend returns 5xx errors or times out repeatedly, it’s marked unhealthy without explicit health checks. This is common in service meshes like Envoy and Linkerd, which track error rates and latencies for each upstream instance. The advantage is zero overhead—you’re already processing requests. The disadvantage is you need traffic to detect failures, and you might route several failed requests before marking an instance unhealthy. Example: Envoy’s outlier detection marks an upstream host unhealthy if it returns 5 consecutive 5xx errors, then ejects it from the load balancing pool for 30 seconds before retrying.

Health Check Types: Push vs Pull Models

graph LR
    subgraph Pull-Based ["Pull-Based (Load Balancer Polling)"]
        LB["Load Balancer"] -->|"1. Poll every 10s"| S1["Service A"]
        LB -->|"2. Poll every 10s"| S2["Service B"]
        LB -->|"3. Poll every 10s"| S3["Service C"]
        S1 & S2 & S3 -.->|"HTTP 200/503"| LB
    end
    
    subgraph Push-Based ["Push-Based (Heartbeat)"]
        Svc1["Service X"] -->|"1. Heartbeat every 30s"| Registry["Service Registry<br/>(Eureka/Consul)"]
        Svc2["Service Y"] -->|"2. Heartbeat every 30s"| Registry
        Svc3["Service Z"] -->|"3. Heartbeat every 30s"| Registry
        Registry -.->|"No heartbeat for 90s<br/>= evict from registry"| Svc1
    end
    
    subgraph Passive ["Passive (Failure Detection)"]
        Client["Client"] -->|"Real requests"| Proxy["Service Mesh<br/>(Envoy)"]
        Proxy -->|"Route traffic"| Backend1["Backend 1"]
        Proxy -->|"Route traffic"| Backend2["Backend 2"]
        Backend2 -.->|"5 consecutive 5xx errors"| Proxy
        Proxy -.->|"Mark unhealthy<br/>Eject for 30s"| Backend2
    end

Three health check models: Pull-based (load balancer polls services) scales well and is simple but has detection latency. Push-based (services send heartbeats) enables immediate failure detection but requires services to remain active. Passive (infer from actual requests) has zero overhead but needs traffic to detect failures.

Trade-offs

Check Frequency: Fast Detection vs Resource Overhead Checking every 1 second enables sub-second failure detection but generates significant load: 1000 instances × 1 check/sec = 1000 QPS to your health endpoint. Checking every 30 seconds reduces overhead but means up to 30 seconds of failed requests before an instance is removed. The decision framework: For user-facing services with strict SLAs, check every 5-10 seconds and accept the overhead. For internal services or batch workloads, 30-60 seconds is acceptable. For systems with thousands of instances, consider hierarchical checking where a subset of instances are checked frequently and outliers trigger deeper investigation. Netflix checks every 30 seconds for most services but every 5 seconds for critical path services like their API gateway.

Check Depth: Accuracy vs Latency Shallow checks (process alive?) complete in 10ms but miss most real failures. Deep checks (dependencies healthy?) catch real issues but take 100-200ms and can cause cascading failures if dependencies are slow. The decision framework: Use shallow checks for liveness probes and initial routing decisions. Use deep checks for readiness probes but implement aggressive timeouts (50-100ms) and circuit breakers. For critical dependencies, check connectivity but not full functionality—verify you can acquire a database connection, not that you can execute a complex query. Google’s approach: health checks test that the service can attempt to fulfill a request (connections available, threads not deadlocked) but don’t actually execute business logic.

Failure Threshold: False Positives vs Detection Speed Marking unhealthy after 1 failed check enables 5-second detection (with 5s intervals) but causes flapping from transient issues. Requiring 3 consecutive failures prevents flapping but delays detection to 15 seconds. The decision framework: Set the threshold based on your tolerance for false positives. For stateless services that scale horizontally, be aggressive (2 failures) since removing a healthy instance temporarily is low cost. For stateful services or those that are expensive to restart, be conservative (3-5 failures). Always use a higher threshold for marking healthy again (5-10 successes) to prevent oscillation. AWS’s recommendation: 2 failures for marking unhealthy, 10 successes for marking healthy.

Push vs Pull: Immediate Detection vs Scalability Push-based heartbeats (services send “I’m alive” messages) enable immediate detection when heartbeats stop but require services to remain active and create a central coordinator bottleneck. Pull-based checks (load balancer polls services) scale better and don’t require services to know about the checker, but have detection latency equal to the poll interval. The decision framework: Use push for distributed coordination where services need to know about each other (service discovery, leader election). Use pull for load balancing and routing decisions where the checker (load balancer) is already centralized. For very large fleets (10,000+ instances), consider hybrid approaches where services push to regional coordinators that are polled by global systems.

Dependency Inclusion: Accuracy vs Blast Radius Including dependency health in checks (“I’m unhealthy if my database is down”) accurately reflects ability to serve traffic but risks cascading failures—one slow dependency can mark your entire fleet unhealthy. Excluding dependencies (“I’m healthy if my process runs”) prevents cascading failures but allows routing traffic to instances that will fail requests. The decision framework: Include critical dependencies (database, cache) but with circuit breakers that fail fast after N consecutive failures. Exclude optional dependencies or implement graceful degradation—if the recommendation service is down, the product page is still healthy but shows fewer recommendations. Stripe’s pattern: payment services check that fraud detection is reachable but remain healthy even if it’s down, falling back to simpler fraud rules.

Common Pitfalls

Pitfall 1: Health Checks That Call Downstream Services Without Circuit Breakers Your API’s health check calls a downstream recommendation service to verify it’s available. The recommendation service becomes slow (5s response times). Now all your API instances’ health checks timeout, marking them unhealthy simultaneously. Your entire API fleet is removed from rotation, causing a complete outage—even though your API could have served traffic without recommendations. This happens because engineers want “accurate” health checks that verify all dependencies, but don’t account for cascading failures. How to avoid: Implement circuit breakers in health check logic. After 3 consecutive failures calling the recommendation service, open the circuit and fail fast without actually calling it. Better yet, make the dependency optional—mark yourself healthy but operate in degraded mode. The health check should verify you can serve traffic, not that you can serve it perfectly.

Pitfall 2: Expensive Health Checks That Consume Significant Resources Your health check executes a complex database query to verify the database is functioning: SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY. With 100 instances checking every 10 seconds, that’s 10 QPS of expensive queries. During a database incident, these health checks contribute to the overload, making recovery harder. This happens because engineers treat health checks like integration tests, verifying full functionality rather than basic connectivity. How to avoid: Health checks should be trivial operations: SELECT 1, PING, acquiring a connection from the pool without executing a query. The check verifies the dependency is reachable, not that it’s performant. Reserve complex queries for synthetic monitoring that runs every few minutes, not health checks that run every few seconds. Netflix’s rule: health checks should consume less than 1% of system capacity.

Pitfall 3: Using the Same Endpoint for Liveness and Readiness Your Kubernetes deployment uses /health for both liveness and readiness probes. The health check verifies database connectivity. During a database outage, all pods fail health checks. Kubernetes interprets this as pods being dead (liveness failure) and restarts them. The new pods also can’t reach the database, so they’re immediately restarted. You’re now in a restart loop, and your service never recovers even after the database comes back. This happens because engineers don’t understand the distinction between “process is alive” and “ready to serve traffic.” How to avoid: Expose separate endpoints: /healthz for liveness (always returns 200 unless the process is truly broken) and /ready for readiness (checks dependencies). Configure Kubernetes liveness probes to hit /healthz and readiness probes to hit /ready. The liveness check should almost never fail—only for unrecoverable errors like out-of-memory or deadlocked threads.

Pitfall 4: No Hysteresis, Causing Flapping Your load balancer checks health every 5 seconds and marks instances unhealthy after a single failure. An instance experiences a brief GC pause (2 seconds), fails one health check, and is removed from rotation. 5 seconds later it passes the check and is added back. This happens repeatedly, causing the instance to flap in and out of rotation, creating inconsistent user experiences and making debugging impossible. This happens because engineers don’t account for transient failures—they assume health is binary and stable. How to avoid: Require multiple consecutive failures (typically 2-3) before marking unhealthy, and multiple consecutive successes (typically 5-10) before marking healthy. The asymmetry is intentional: be quick to remove unhealthy instances but slow to add them back. This prevents flapping while still detecting real failures within 15-30 seconds. Monitor your health check state transitions—if you see frequent flapping, increase your thresholds.

Pitfall 5: Health Checks That Don’t Match Actual Request Paths Your health check verifies the database is reachable by connecting to a read replica. Your application traffic goes to the primary database. The primary fails, but health checks keep passing because replicas are still up. Traffic continues routing to your instances, which fail all requests. This happens because engineers take shortcuts in health checks, testing a “similar” dependency rather than the actual one. How to avoid: Health checks should exercise the same code paths as real requests. If your app reads from the primary, health checks should too. If your app requires both database and cache, health checks should verify both. The check doesn’t need to execute full business logic, but it should touch the same infrastructure. Uber’s pattern: health checks use the same connection pools, client libraries, and configuration as application code, just with trivial operations (SELECT 1 instead of complex queries).

Cascading Failure: Health Checks Without Circuit Breakers

sequenceDiagram
    participant LB as Load Balancer
    participant I1 as Instance 1
    participant I2 as Instance 2
    participant I3 as Instance 3
    participant DB as Database<br/>(Slow/Down)
    
    Note over DB: Database becomes slow<br/>(5s response time)
    
    par Health Check Round 1
        LB->>I1: GET /ready
        LB->>I2: GET /ready
        LB->>I3: GET /ready
    end
    
    par All instances check DB
        I1->>DB: SELECT 1
        I2->>DB: SELECT 1
        I3->>DB: SELECT 1
    end
    
    Note over I1,I3: All health checks timeout<br/>after 5 seconds
    
    par All return unhealthy
        I1-->>LB: 503 (timeout)
        I2-->>LB: 503 (timeout)
        I3-->>LB: 503 (timeout)
    end
    
    Note over LB: ❌ ALL instances unhealthy<br/>Complete service outage!
    
    rect rgb(255, 200, 200)
        Note over LB,DB: Without circuit breakers,<br/>one slow dependency takes down<br/>the entire service fleet
    end

Without circuit breakers, a slow database causes all health checks to timeout simultaneously, marking all instances unhealthy and causing a complete outage. The solution: after N consecutive failures, open a circuit breaker that fails fast without calling the database, preventing cascading failures.

Math & Calculations

Health Check Overhead Calculation

Given:

N = number of service instances
F = health check frequency (checks per second)
L = latency of each health check (seconds)
C = number of checkers (load balancers, orchestrators)

Total QPS to health endpoints: QPS = N × F × C

For a service with 1000 instances, checked every 10 seconds (F = 0.1) by 3 load balancers: QPS = 1000 × 0.1 × 3 = 300 health checks/second

If each check takes 50ms (L = 0.05), total CPU time consumed: CPU time = 300 × 0.05 = 15 seconds of CPU per second

This means you need at least 15 CPU cores dedicated to handling health checks, or 1.5% of a 1000-core fleet. This is acceptable overhead. However, if checks take 200ms: CPU time = 300 × 0.2 = 60 seconds of CPU per second = 60 cores = 6% overhead

This is excessive. The rule of thumb: health check overhead should be under 1-2% of total system capacity.

Failure Detection Time Calculation

Given:

I = check interval (seconds)
T = failure threshold (consecutive failures required)
P = processing time to update routing tables (seconds)

Worst-case detection time: Detection = (T × I) + P

For a system with 10-second intervals, 3-failure threshold, and 2-second routing update: Detection = (3 × 10) + 2 = 32 seconds

This means up to 32 seconds of failed requests before the instance is removed. If your SLA is 99.9% (43 minutes downtime/month) and you have 100 instances, each instance can be down for 26 seconds/month. A 32-second detection time consumes your entire error budget in a single failure.

To achieve 10-second detection:

Option 1: Reduce interval to 3 seconds: (3 × 3) + 2 = 11 seconds
Option 2: Reduce threshold to 1: (1 × 10) + 2 = 12 seconds (but increases false positives)
Option 3: Reduce both: (2 × 4) + 2 = 10 seconds

False Positive Rate Estimation

If your health check has a 1% chance of transient failure (network blip, GC pause) and you require 3 consecutive failures: False positive rate = 0.01³ = 0.000001 = 0.0001%

With 1000 instances checked every 10 seconds (8,640 checks/day per instance): False positives = 1000 × 8640 × 0.000001 = 8.64 false positives/day

This is acceptable. But with only 1 failure required: False positive rate = 0.01 = 1% False positives = 1000 × 8640 × 0.01 = 86,400/day

This would cause constant flapping. The math shows why hysteresis (multiple failures required) is critical for stability.

Real-World Examples

Netflix: Eureka Service Registry with Heartbeats

Netflix’s Eureka service registry uses a push-based heartbeat model where services send renewal requests every 30 seconds. If Eureka doesn’t receive 3 consecutive renewals (90 seconds), the instance is evicted from the registry. Interestingly, Eureka implements “self-preservation mode”—if more than 15% of instances fail to renew within a time window, Eureka assumes there’s a network partition rather than mass service failure and stops evicting instances. This prevents cascading failures where a network issue causes Eureka to evict healthy instances, which then can’t re-register, causing a complete outage. Netflix learned this lesson the hard way during an AWS availability zone failure in 2012, where Eureka’s aggressive eviction made the incident worse. The self-preservation mode is a sophisticated circuit breaker at the service discovery layer, recognizing that mass simultaneous failures are more likely infrastructure issues than application problems.

Uber: Dependency Health with Graceful Degradation

Uber’s services implement multi-tiered health checks that distinguish between critical and optional dependencies. A ride-matching service’s health check verifies it can connect to the database (critical) and cache (critical), but doesn’t fail if the surge pricing service is unavailable (optional). When surge pricing is down, the service remains healthy and continues matching rides, just without dynamic pricing—falling back to standard rates. This is implemented through a health check that returns 200 with a JSON body indicating degraded mode: {"status": "healthy", "degraded": ["surge-pricing"]}. The load balancer routes traffic normally, but monitoring alerts fire to notify engineers of degraded operation. This pattern allows Uber to maintain core functionality (matching riders with drivers) even when auxiliary services fail, preventing cascading outages. During a major incident in 2018, this design kept 80% of Uber’s functionality operational when several microservices failed.

Google: Shallow Health Checks with Separate Diagnostics

Google’s production services use extremely lightweight health checks for load balancing decisions—typically just verifying the HTTP server responds within 50ms. Deeper diagnostics (database connectivity, downstream service health, resource utilization) are exposed through separate endpoints like /healthz/ready and /healthz/live that are consumed by monitoring systems, not load balancers. This separation prevents slow dependencies from affecting routing decisions. Google’s SRE teams discovered that including database checks in load balancer health probes caused cascading failures: when a database became slow, all instances failed health checks simultaneously, causing complete service outages even though instances could have served cached data or degraded responses. The solution was to make load balancer health checks test only “can this instance accept a connection and start processing a request,” while separate monitoring tracks “is this instance performing well.” This pattern is now codified in Kubernetes, which distinguishes liveness probes (shallow) from readiness probes (deeper) and startup probes (for slow-starting applications).

Netflix Eureka: Self-Preservation Mode

stateDiagram-v2
    [*] --> Normal: Eureka starts
    
    Normal --> Evaluating: <15% instances<br/>sending heartbeats
    Evaluating --> SelfPreservation: Condition persists<br/>for 15 minutes
    Evaluating --> Normal: Heartbeats resume
    
    SelfPreservation --> Evaluating: >85% instances<br/>sending heartbeats
    
    state Normal {
        [*] --> ReceivingHeartbeats
        ReceivingHeartbeats --> EvictInstance: No heartbeat<br/>for 90s
        EvictInstance --> ReceivingHeartbeats: New instance<br/>registers
    }
    
    state SelfPreservation {
        [*] --> PreservingInstances
        PreservingInstances --> PreservingInstances: Ignore missing<br/>heartbeats
        note right of PreservingInstances
            Assume network partition,
            not mass service failure.
            Keep all instances registered.
        end note
    }
    
    note right of Normal
        Normal operation:
        Evict instances that
        miss 3 consecutive
        heartbeats (90s)
    end note
    
    note right of SelfPreservation
        Self-preservation:
        Stop evicting instances
        to prevent cascading
        failures during network
        partitions
    end note

Netflix Eureka’s self-preservation mode prevents cascading failures during network partitions. If more than 15% of instances stop sending heartbeats, Eureka assumes a network issue rather than mass service failure and stops evicting instances. This sophisticated circuit breaker at the service discovery layer prevents infrastructure issues from causing complete outages.

Interview Expectations

Mid-Level

What You Should Know: Explain the difference between liveness and readiness checks with concrete examples (process alive vs. can serve traffic). Describe basic health check implementation: HTTP endpoint returning 200/503, polled by load balancer every 10-30 seconds, requires 2-3 consecutive failures before marking unhealthy. Understand why health checks should be fast (<100ms) and what they should verify (process responsive, critical resources available). Explain how health checks integrate with load balancers to remove unhealthy instances from rotation.

Bonus Points: Discuss hysteresis (different thresholds for marking unhealthy vs. healthy) and why it prevents flapping. Mention that health checks should include critical dependencies like databases but with timeouts. Explain the tradeoff between check frequency and overhead. Reference Kubernetes liveness/readiness probes or AWS ELB health checks to show familiarity with real systems.

Senior

What You Should Know: Design a complete health monitoring system including multiple check types (shallow for routing, deep for diagnostics, synthetic for end-to-end validation). Explain when to use push-based heartbeats vs. pull-based checks and the tradeoffs. Discuss circuit breakers in health check logic to prevent cascading failures when dependencies are slow. Calculate health check overhead and failure detection time given instance count, check frequency, and thresholds. Describe graceful degradation patterns where services remain healthy but operate in reduced functionality mode when optional dependencies fail.

Bonus Points: Explain passive health checking (inferring health from actual request outcomes) and when it’s preferable to active checks. Discuss self-preservation modes (like Netflix Eureka) that prevent mass evictions during network partitions. Describe how health checks interact with auto-scaling: unhealthy instances trigger scale-up, but you need hysteresis to prevent oscillation. Explain the difference between health checks for stateless vs. stateful services (stateful services need longer grace periods and careful coordination during restarts). Reference specific production incidents where health check design prevented or caused outages.

Staff+

What You Should Know: Architect health monitoring for a multi-region, multi-tenant system with thousands of microservices. Discuss hierarchical health checking strategies that scale to large fleets without overwhelming monitoring infrastructure. Design health check systems that handle partial failures gracefully: some instances unhealthy, some dependencies degraded, some regions unavailable. Explain how to prevent thundering herds when recovering from mass failures (staggered health check schedules, gradual traffic ramping). Describe sophisticated patterns like health-aware load balancing (routing based on instance health score, not just binary healthy/unhealthy) and predictive health monitoring (marking instances unhealthy before they fail based on leading indicators like memory growth or increasing latency).

Distinguishing Signals: Discuss the organizational and operational aspects: how to prevent teams from gaming health checks (marking services healthy to avoid alerts), how to enforce health check standards across hundreds of services, how to balance autonomy (teams define their own health criteria) with consistency (platform provides guardrails). Explain how health monitoring integrates with incident response: automatic mitigation (removing unhealthy instances), escalation (alerting when too many instances are unhealthy), and diagnostics (health check results inform debugging). Describe the evolution of health monitoring as systems scale: simple HTTP checks → dependency-aware checks → circuit breakers → graceful degradation → predictive health. Reference how companies like Google, Netflix, or Uber evolved their health monitoring practices through production incidents.

Common Interview Questions

Q1: How would you design health checks for a service that depends on a database?

60-second answer: Implement separate liveness and readiness checks. Liveness just verifies the process is running—returns 200 always unless truly broken. Readiness checks database connectivity by acquiring a connection from the pool and executing SELECT 1 with a 100ms timeout. Use a circuit breaker: after 3 consecutive database check failures, open the circuit and fail fast without calling the database. This prevents all instances from being marked unhealthy simultaneously during a database outage. Configure the load balancer to require 2 consecutive readiness failures before removing an instance, and 5 consecutive successes before adding it back.

2-minute detailed answer: Start by distinguishing what “healthy” means for this service. Liveness means the process is alive and not deadlocked—this should almost never fail, so implement it as a simple HTTP handler that returns 200. Readiness means the service can fulfill requests, which requires database connectivity. Implement readiness by attempting to acquire a connection from the pool (this verifies the pool isn’t exhausted) and executing a trivial query like SELECT 1 with a strict timeout (50-100ms). Don’t execute complex queries—you’re testing connectivity, not performance. Wrap the database check in a circuit breaker: track the last N check results (e.g., 10), and if more than 3 failed, open the circuit and fail fast without actually calling the database. This prevents cascading failures where a slow database causes all health checks to timeout, marking all instances unhealthy simultaneously. Configure your load balancer with hysteresis: require 2 consecutive failures to mark unhealthy (detects real issues within 20 seconds with 10s intervals) but 5 consecutive successes to mark healthy (prevents flapping). Monitor health check state transitions—if you see frequent flapping, increase thresholds. Finally, consider graceful degradation: if the database is unhealthy but you have a cache, maybe the service should remain healthy but operate in read-only mode.

Red flags: Saying health checks should execute complex queries or business logic. Not mentioning circuit breakers when checking dependencies. Using the same endpoint for liveness and readiness. Not considering the impact of health check failures on the entire fleet (cascading failures).

Q2: Your health checks are causing 5% CPU overhead. How do you reduce this?

60-second answer: First, profile the health check to identify what’s expensive—likely database queries or downstream service calls. Replace complex operations with trivial ones: SELECT 1 instead of COUNT(*), connection pool acquisition instead of actual queries, PING instead of GET requests. Reduce check frequency: if checking every 5 seconds, try 15 seconds—this 3x reduction cuts overhead to 1.7%. Implement hierarchical checking: have load balancers check a subset of instances (10%) frequently, and only check all instances if the subset shows problems. Finally, consider passive health checking: infer health from actual request outcomes (error rates, latencies) rather than active probing.

2-minute detailed answer: Start by measuring: instrument your health check endpoint to track execution time and resource consumption. Break down the check into components (database, cache, downstream services) and identify the expensive operations. Common culprits: complex database queries (COUNT(*)), calling downstream services without timeouts, checking multiple dependencies sequentially instead of in parallel. Optimize each component: replace SELECT COUNT(*) FROM large_table with SELECT 1, replace actual downstream service calls with connection checks, add aggressive timeouts (50-100ms). If optimization isn’t enough, reduce frequency: calculate your failure detection time requirement (typically 30-60 seconds is acceptable) and set the interval accordingly. Going from 5-second to 15-second intervals reduces overhead by 67%. For very large fleets (1000+ instances), implement hierarchical checking: the load balancer actively checks 10% of instances every 5 seconds and the rest every 30 seconds. If the frequently-checked subset shows elevated failures, increase checking frequency for all instances. Alternatively, implement passive health checking: track error rates and latencies from actual requests, and mark instances unhealthy if they exceed thresholds (e.g., >5% error rate or P99 > 2x normal). This has zero overhead since you’re already processing requests. The tradeoff is you need traffic to detect failures, so combine passive checking with infrequent active checks (every 60 seconds) as a backstop.

Red flags: Not measuring before optimizing. Suggesting to just “check less frequently” without calculating failure detection time. Not considering passive health checking. Optimizing the wrong thing (e.g., making the HTTP response smaller when the database query is the bottleneck).

Q3: How do you prevent cascading failures when a dependency becomes slow?

60-second answer: Implement circuit breakers in your health check logic. Track the last N dependency checks (e.g., 10), and if more than 3 failed or timed out, open the circuit. When open, health checks fail fast without calling the dependency. This prevents all instances from timing out simultaneously and being marked unhealthy. Additionally, make dependencies optional when possible: if the recommendation service is slow, remain healthy but operate in degraded mode without recommendations. Use aggressive timeouts (50-100ms) on dependency checks so slow dependencies fail fast rather than causing health check timeouts.

2-minute detailed answer: Cascading failures happen when a slow dependency causes all your health checks to timeout, marking all instances unhealthy simultaneously, which removes your entire service from rotation. The solution is circuit breakers at the health check level. Implement a sliding window tracking the last 10-20 dependency checks. If more than 30% failed or timed out, open the circuit—subsequent health checks immediately fail without calling the dependency. Keep the circuit open for 30-60 seconds, then try a single check (half-open state). If it succeeds, close the circuit; if it fails, reopen for another 60 seconds. This prevents your health checks from contributing to the dependency’s overload. Second, distinguish critical from optional dependencies. Database connectivity is critical—if it’s down, you genuinely can’t serve traffic. But the recommendation service might be optional—you can serve product pages without recommendations. For optional dependencies, don’t include them in health checks at all, or mark yourself as “healthy but degraded” when they’re unavailable. Third, use aggressive timeouts: if your normal request timeout is 3 seconds, health check timeouts should be 50-100ms. A slow dependency should fail the health check quickly rather than causing the check itself to timeout. Fourth, consider the blast radius: if checking a dependency could mark all instances unhealthy, maybe don’t check it in health checks—use separate monitoring to alert on dependency issues. Finally, implement graceful degradation: when a dependency is unhealthy, remain in rotation but operate in reduced functionality mode, and emit metrics indicating degraded operation.

Red flags: Not mentioning circuit breakers. Suggesting to just “increase timeouts” (this makes the problem worse). Not distinguishing critical from optional dependencies. Assuming all dependencies must be healthy for the service to be healthy.

Q4: How would you design health checks for a stateful service like a database?

60-second answer: Stateful services need different health check strategies than stateless services. Liveness checks verify the process is running and not deadlocked—check that the database can accept connections and respond to simple queries like SELECT 1. Readiness checks verify the database can serve production traffic—check replication lag (for replicas), disk space, and connection pool availability. Use longer grace periods: a database starting up might take 30-60 seconds to replay WAL logs, so don’t mark it unhealthy immediately. Implement separate checks for different roles: a primary database might be healthy but not ready to accept writes if it’s in the process of failing over.

2-minute detailed answer: Stateful services like databases have unique health check requirements because they can’t be trivially replaced—restarting a database or removing it from rotation has significant consequences. First, distinguish operational states: a database can be starting (replaying WAL logs), running (accepting connections), primary (accepting writes), replica (accepting reads), or failing over (transitioning between states). Each state needs different health criteria. Liveness checks should verify the process is alive and not deadlocked: can it accept a connection and respond to SELECT 1 within 100ms? This should almost never fail—only for unrecoverable errors like out-of-disk or corrupted data files. Readiness checks verify the database can serve production traffic: for replicas, check replication lag (unhealthy if >30 seconds behind primary), for primaries, check connection pool availability and disk space (unhealthy if >90% full). Use longer grace periods: configure startup probes with 60-second timeouts and allow 5-10 minutes for the database to become ready after starting. This accommodates WAL replay and cache warming. Implement role-specific checks: a primary database should verify it holds the write lock (in systems like Patroni or MySQL Group Replication), while replicas should verify they’re connected to a primary. During failovers, the old primary should mark itself unhealthy immediately (to prevent split-brain), while the new primary should only mark itself healthy after confirming it has the latest data. Finally, coordinate health checks with orchestration: when scaling down, the orchestrator should drain connections gracefully (stop accepting new connections, wait for existing ones to complete) before marking the database unhealthy.

Red flags: Treating stateful services like stateless services (aggressive restarts, short grace periods). Not considering replication lag or failover states. Not coordinating health checks with graceful shutdown. Suggesting to restart databases when health checks fail.

Q5: How do health checks integrate with auto-scaling?

60-second answer: Health checks inform auto-scaling decisions by distinguishing between “need more capacity” and “instances are broken.” If instances are healthy but CPU/memory is high, scale up. If instances are unhealthy, investigate and fix rather than scaling. Implement hysteresis: don’t scale up immediately when one instance fails health checks (might be transient), but do scale up if multiple instances are unhealthy for >5 minutes (might indicate insufficient capacity causing overload). When scaling down, mark instances unhealthy first to drain traffic, then terminate. Health checks prevent auto-scaling from making problems worse—without them, an incident causing unhealthy instances might trigger scale-up, adding more broken instances.

2-minute detailed answer: Health checks and auto-scaling must work together to distinguish between capacity problems (need more instances) and operational problems (instances are broken). The integration works like this: auto-scaling monitors both health check status and resource utilization (CPU, memory, request rate). If instances are healthy but CPU is consistently >70%, scale up—you need more capacity. If instances are unhealthy, don’t scale up automatically—instead, alert engineers to investigate. Scaling up when instances are unhealthy often makes problems worse: if instances are unhealthy because of a bad deployment, adding more instances just spreads the bad code. If they’re unhealthy because a dependency is down, adding instances increases load on the dependency. Implement hysteresis in both directions: don’t scale up immediately when one instance fails (might be transient), but do scale up if >20% of instances are unhealthy for >5 minutes (might indicate the remaining instances are overloaded). When scaling down, coordinate with health checks: first mark the instance unhealthy (or set it to “draining” mode), wait for existing connections to complete (typically 30-60 seconds), then terminate. This prevents dropping in-flight requests. For stateful services, health checks should prevent auto-scaling from terminating instances with important data—mark them as “unhealthy for new requests” but keep them alive to serve existing sessions. Finally, implement safeguards: if >50% of instances are unhealthy, disable auto-scaling entirely and page on-call—this is likely a major incident, not a capacity issue. Netflix’s approach: auto-scaling only adds capacity when instances are healthy but overloaded, and only removes capacity when instances are healthy and underutilized. Unhealthy instances trigger alerts, not auto-scaling.

Red flags: Not distinguishing between capacity problems and operational problems. Suggesting to scale up when instances are unhealthy. Not coordinating scale-down with graceful connection draining. Not implementing safeguards to prevent auto-scaling during major incidents.

Red Flags to Avoid

Red Flag 1: “Health checks should verify the service is working correctly by executing real business logic”

Why it’s wrong: Health checks are for routing decisions, not integration testing. Executing business logic (processing an order, running a recommendation algorithm) is too slow (seconds) and expensive for checks that run every 5-10 seconds across hundreds of instances. It also creates side effects—you don’t want health checks creating test data in production databases. Health checks should verify the service can process requests (dependencies available, resources not exhausted), not that it correctly processes them.

What to say instead: “Health checks should be fast (<100ms) and side-effect-free, verifying only that the service can accept and begin processing requests. They should check that critical dependencies are reachable (database connection available, cache responds to PING) but not execute actual business logic. We use separate synthetic monitoring to verify end-to-end functionality, running complete workflows every few minutes rather than every few seconds.”

Red Flag 2: “If a health check fails once, immediately remove the instance from rotation”

Why it’s wrong: Transient failures happen constantly in distributed systems: brief network blips, garbage collection pauses, momentary CPU spikes. Removing instances after a single failure causes flapping—instances rapidly oscillating between healthy and unhealthy, creating inconsistent user experiences and making debugging impossible. It also increases the risk of cascading failures: one transient issue could remove multiple instances simultaneously, overloading the remaining ones.

What to say instead: “Implement hysteresis by requiring multiple consecutive failures (typically 2-3) before marking an instance unhealthy. This filters out transient issues while still detecting real problems within 20-30 seconds. Use an even higher threshold for marking healthy again (5-10 consecutive successes) to prevent rapid oscillation. Monitor health check state transitions—if you see frequent flapping, increase your thresholds. The goal is stability: it’s better to leave a marginally unhealthy instance in rotation temporarily than to constantly churn the active instance pool.”

Red Flag 3: “Health checks should call downstream services to verify they’re available”

Why it’s wrong: Without circuit breakers, this causes cascading failures. If a downstream service becomes slow, all your health checks timeout, marking all your instances unhealthy simultaneously. Your entire service is removed from rotation, causing a complete outage—even though you could have served traffic with degraded functionality. Additionally, health check traffic contributes to the downstream service’s overload, making recovery harder.

What to say instead: “When health checks include downstream dependencies, implement circuit breakers that fail fast after N consecutive failures. This prevents all instances from timing out simultaneously. Better yet, distinguish critical from optional dependencies: if the database is down, we genuinely can’t serve traffic, so include it in health checks. But if the recommendation service is down, we can still serve product pages without recommendations—so either exclude it from health checks or mark ourselves as ‘healthy but degraded.’ Use aggressive timeouts (50-100ms) so slow dependencies fail fast rather than causing health check timeouts.”

Red Flag 4: “Use the same health check endpoint for both liveness and readiness”

Why it’s wrong: Liveness checks determine whether to restart a container, while readiness checks determine whether to route traffic. If you use the same endpoint that checks dependencies, a database outage will cause liveness checks to fail, triggering container restarts. The new containers also can’t reach the database, so they’re immediately restarted. You’re now in a restart loop, and your service never recovers even after the database comes back online.

What to say instead: “Expose separate endpoints for liveness and readiness. Liveness checks (e.g., /healthz) should verify only that the process is alive and not deadlocked—they should almost never fail, returning 200 unless there’s an unrecoverable error like out-of-memory. Readiness checks (e.g., /ready) verify the service can fulfill requests, including dependency health. Configure orchestrators to restart on liveness failures but only remove from rotation on readiness failures. This prevents restart loops during dependency outages while still detecting genuinely broken processes.”

Red Flag 5: “Health checks aren’t important for small systems—just use metrics and alerts”

Why it’s wrong: Even small systems need health checks for load balancer integration and graceful deployments. Without health checks, load balancers route traffic to instances that are starting up, shutting down, or broken, causing user-facing errors. During deployments, new instances receive traffic before they’re ready, causing elevated error rates. Health checks are the mechanism that tells infrastructure “this instance is ready for traffic”—without them, you’re flying blind.

What to say instead: “Health checks are foundational infrastructure, not a scaling concern. Even a single-instance service behind a load balancer needs health checks to handle restarts gracefully—the load balancer should stop routing traffic before the instance shuts down. For multi-instance services, health checks enable zero-downtime deployments: new instances don’t receive traffic until they pass health checks, and old instances drain connections before terminating. Health checks are the contract between your application and the infrastructure layer, communicating ‘I’m ready’ or ‘I’m not ready.’ Metrics and alerts tell you how well the system is performing; health checks tell the infrastructure whether to route traffic. They serve different purposes.”

Key Takeaways

Health checks answer a binary question—“Should this instance receive traffic?”—not “How well is it performing?” They drive routing decisions (load balancers), orchestration actions (container restarts), and auto-scaling. This is fundamentally different from metrics, which measure system behavior.
Separate liveness (is the process alive?) from readiness (can it serve traffic?). A service can be alive but not ready—during startup, when dependencies are unavailable, or during graceful shutdown. Conflating these leads to restart loops and cascading failures.
Implement circuit breakers when health checks include dependencies. Without them, a slow dependency causes all health checks to timeout, marking all instances unhealthy simultaneously and causing complete outages. Circuit breakers fail fast after N consecutive failures, preventing cascading failures.
Use hysteresis to prevent flapping: require multiple consecutive failures to mark unhealthy (2-3) and even more consecutive successes to mark healthy (5-10). Transient failures are common in distributed systems; removing instances after a single failure causes instability. The asymmetry prevents rapid oscillation.
Health checks must be fast (<100ms) and lightweight (<1-2% of system capacity). They run every few seconds across all instances, so expensive checks (complex queries, downstream service calls) create significant overhead and can contribute to incidents. Check connectivity, not functionality—SELECT 1, not COUNT(*).

Prerequisites:

Load Balancing - Health checks integrate with load balancers to route traffic only to healthy instances
Circuit Breakers - Essential for preventing cascading failures in health check logic
Service Discovery - Health checks inform service registries about instance availability

Related Topics:

Metrics and Monitoring - Complements health checks by measuring system performance
Alerting - Health check failures trigger alerts for on-call engineers
Graceful Degradation - Health checks enable services to remain operational with reduced functionality

Next Steps:

Distributed Tracing - Complements health checks by showing request flows across services
Auto-scaling - Uses health check data to make scaling decisions
Chaos Engineering - Tests health check effectiveness by injecting failures