Health Endpoint Monitoring: /health API Guide
After this topic, you will be able to:
- Implement health check endpoints with appropriate dependency checks and timeout handling
- Design health check hierarchies distinguishing liveness, readiness, and startup probes
- Configure health check intervals and failure thresholds for load balancer integration
TL;DR
Health endpoint monitoring exposes standardized HTTP endpoints that external systems query to verify service health, enabling automated detection of failures and intelligent traffic routing. Services implement liveness checks (process alive?), readiness checks (can accept traffic?), and startup checks (finished initialization?) to integrate with load balancers, orchestrators, and monitoring systems. The pattern transforms opaque service failures into actionable signals that drive automated remediation.
Cheat Sheet: /health/live for process health, /health/ready for traffic eligibility, /health/startup for initialization; aggregate dependency health with circuit breakers; set timeouts < 1s; fail fast on critical dependencies; return 200/503 with JSON details; configure LB thresholds (3-5 failures before removal).
The Problem It Solves
Distributed systems face a fundamental observability problem: how do external systems know if a service is healthy enough to receive traffic? Traditional approaches like TCP port checks only verify that a process is listening, not that it can successfully handle requests. A service might pass a port check while its database connection pool is exhausted, its cache is unavailable, or it’s experiencing memory pressure that causes 10-second response times. Load balancers continue routing traffic to degraded instances, cascading failures across the system.
The problem intensifies in orchestrated environments like Kubernetes where services restart frequently. Without structured health signals, orchestrators can’t distinguish between a service that’s still initializing (needs more time) and one that’s genuinely failed (needs restart). They either wait too long to restart failed services or prematurely kill services during slow startups. Manual health verification doesn’t scale—engineers can’t SSH into hundreds of instances to check logs. The system needs a standardized, machine-readable way for services to communicate their health status that drives automated decisions about traffic routing and instance lifecycle management.
Solution Overview
Health endpoint monitoring implements HTTP endpoints that return structured health status, enabling external systems to make intelligent routing and lifecycle decisions. Services expose multiple endpoints with different semantics: liveness probes answer “is the process alive?”, readiness probes answer “can this instance handle traffic?”, and startup probes answer “has initialization completed?”. Each endpoint performs appropriate health checks—from simple process verification to deep dependency validation—and returns HTTP status codes (200 for healthy, 503 for unhealthy) with JSON payloads containing diagnostic details.
The pattern integrates with infrastructure components through a polling model. Load balancers query readiness endpoints every 5-10 seconds, removing instances that fail consecutive checks and restoring them when they recover. Orchestrators use liveness checks to detect deadlocked processes that need restart and startup checks to give slow-starting services adequate initialization time. Monitoring systems scrape health endpoints to track service degradation trends and trigger alerts before complete failures. This creates a closed feedback loop where services self-report their health, and infrastructure automatically responds—removing unhealthy instances from rotation, restarting failed processes, and routing traffic only to capable instances.
Health Endpoint Monitoring Architecture
graph LR
subgraph Service Instance
App["Application<br/><i>Service Process</i>"]
Live["/health/live<br/><i>Liveness Probe</i>"]
Ready["/health/ready<br/><i>Readiness Probe</i>"]
Startup["/health/startup<br/><i>Startup Probe</i>"]
App --> Live
App --> Ready
App --> Startup
end
subgraph Dependencies
DB[("Database<br/><i>PostgreSQL</i>")]
Cache[("Cache<br/><i>Redis</i>")]
Queue["Message Queue<br/><i>Kafka</i>"]
end
subgraph Infrastructure
LB["Load Balancer<br/><i>ALB/NLB</i>"]
Orch["Orchestrator<br/><i>Kubernetes</i>"]
Monitor["Monitoring<br/><i>Prometheus</i>"]
end
Ready --"Validates"--> DB
Ready --"Validates"--> Cache
Ready --"Validates"--> Queue
LB --"1. Poll every 5-10s"--> Ready
Orch --"2. Poll every 10s"--> Live
Orch --"3. Poll during startup"--> Startup
Monitor --"4. Scrape metrics"--> Ready
LB --"5. Route traffic"--> App
Health endpoint monitoring exposes three distinct endpoints that infrastructure components poll to make automated decisions. Load balancers use readiness checks for traffic routing, orchestrators use liveness checks for restart decisions, and startup checks prevent premature restarts during initialization.
How It Works
Step 1: Implement Health Check Endpoints. Services expose three distinct endpoints with different responsibilities. The /health/live endpoint performs minimal checks—verify the HTTP server responds and critical threads aren’t deadlocked. This typically returns in under 50ms and should almost never fail unless the process is truly dead. The /health/ready endpoint performs comprehensive checks: database connection pool has available connections, cache responds within 100ms, message queue is reachable, disk space exceeds 10% free. This might take 200-500ms and fails when dependencies are degraded. The /health/startup endpoint checks initialization completion: database migrations finished, configuration loaded, caches warmed. This only matters during service startup and can take several seconds.
Step 2: Aggregate Dependency Health. Each dependency check runs with strict timeouts and returns a status (healthy/degraded/unhealthy) plus metadata. For a database check, attempt a simple SELECT 1 query with a 100ms timeout. If it succeeds, return healthy. If it times out or fails, check if a circuit breaker is open—if so, return degraded (dependency is known-bad, don’t keep hammering it). Aggregate these individual checks using a policy: all critical dependencies must be healthy for overall health, but degraded non-critical dependencies (like a recommendation service) only warrant a warning in the response payload. Netflix’s approach is instructive: they mark instances as “out of service” if any critical dependency fails but keep serving traffic with degraded functionality if optional dependencies fail.
Step 3: Configure Load Balancer Integration. Load balancers poll the readiness endpoint at regular intervals (typically 5-10 seconds). Configure failure thresholds carefully: require 3-5 consecutive failures before removing an instance to avoid flapping from transient network issues, but restore instances after just 1-2 successful checks to quickly recover capacity. Set health check timeouts shorter than the interval (e.g., 2-second timeout with 5-second interval) to prevent check accumulation. At LinkedIn, they learned that overly aggressive health checks (1-second intervals) created thundering herd problems when dependencies failed—thousands of instances simultaneously hammering a degraded database made recovery impossible.
Step 4: Implement Graceful Degradation. When the readiness check fails, the service should stop accepting new requests but finish processing in-flight requests. Return 503 immediately for new requests while existing requests complete normally. This prevents the “half-dead instance” problem where a service fails health checks but continues processing requests poorly, degrading user experience. Set a maximum graceful shutdown period (30-60 seconds) after which the process forcibly terminates. Kubernetes implements this elegantly: when a pod fails readiness checks, it’s removed from service endpoints immediately, but the pod continues running to finish existing work before termination.
Step 5: Monitor Health Check Metrics. Instrument health checks themselves to detect systemic issues. Track check duration (p50, p99), failure rates by dependency, and time-to-recovery after failures. If database health checks suddenly spike from 10ms to 200ms across all instances, that’s an early warning of database degradation before complete failure. Stripe monitors the correlation between health check failures and actual request errors—if health checks fail but requests succeed, the checks are too sensitive; if requests fail but health checks pass, the checks are too shallow.
Health Check Dependency Aggregation Flow
graph TB
Request["Health Check Request<br/>/health/ready"] --> Orchestrator["Health Check<br/>Orchestrator"]
Orchestrator --"Parallel checks"--> DB_Check["Database Check<br/><i>SELECT 1</i>"]
Orchestrator --"Parallel checks"--> Cache_Check["Cache Check<br/><i>PING</i>"]
Orchestrator --"Parallel checks"--> Queue_Check["Queue Check<br/><i>Connection test</i>"]
DB_Check --"100ms timeout"--> DB_Result{"Result?"}
Cache_Check --"50ms timeout"--> Cache_Result{"Result?"}
Queue_Check --"100ms timeout"--> Queue_Result{"Result?"}
DB_Result --"Success"--> DB_Healthy["✓ Healthy"]
DB_Result --"Timeout/Error"--> DB_CB{"Circuit<br/>Breaker?"}
DB_CB --"Open"--> DB_Degraded["⚠ Degraded<br/><i>Known failure</i>"]
DB_CB --"Closed"--> DB_Unhealthy["✗ Unhealthy"]
Cache_Result --"Success"--> Cache_Healthy["✓ Healthy"]
Cache_Result --"Timeout/Error"--> Cache_Unhealthy["✗ Unhealthy"]
Queue_Result --"Success"--> Queue_Healthy["✓ Healthy"]
Queue_Result --"Timeout/Error"--> Queue_Unhealthy["✗ Unhealthy"]
DB_Healthy & Cache_Healthy & Queue_Healthy --> Aggregator["Aggregate Results<br/><i>Policy: All critical healthy</i>"]
DB_Degraded & Cache_Healthy & Queue_Healthy --> Aggregator
DB_Unhealthy --> Aggregator
Cache_Unhealthy --> Aggregator
Queue_Unhealthy --> Aggregator
Aggregator --> Decision{"All Critical<br/>Healthy?"}
Decision --"Yes"--> Response200["HTTP 200<br/>{status: healthy, details: {...}}"]
Decision --"No"--> Response503["HTTP 503<br/>{status: unhealthy, details: {...}}"]
Health checks run dependency validations in parallel with strict timeouts, use circuit breakers to fail fast on known-bad dependencies, and aggregate results based on criticality. Critical dependency failures return 503, while degraded non-critical dependencies return 200 with warnings.
Load Balancer Health Check Integration
sequenceDiagram
participant LB as Load Balancer
participant I1 as Instance 1
participant I2 as Instance 2
participant I3 as Instance 3
Note over LB: Poll every 5 seconds
LB->>I1: GET /health/ready
I1-->>LB: 200 OK (healthy)
LB->>I2: GET /health/ready
I2-->>LB: 200 OK (healthy)
LB->>I3: GET /health/ready
I3-->>LB: 200 OK (healthy)
Note over LB,I3: All instances in rotation
Note over I2: Database connection<br/>pool exhausted
LB->>I2: GET /health/ready
I2-->>LB: 503 Unhealthy (failure 1/3)
Note over LB: Continue routing to I2<br/>(below threshold)
LB->>I2: GET /health/ready
I2-->>LB: 503 Unhealthy (failure 2/3)
LB->>I2: GET /health/ready
I2-->>LB: 503 Unhealthy (failure 3/3)
Note over LB: Remove I2 from rotation<br/>(threshold reached)
Note over LB,I3: Traffic to I1 and I3 only
Note over I2: Connection pool<br/>recovers
LB->>I2: GET /health/ready
I2-->>LB: 200 OK (success 1/2)
LB->>I2: GET /health/ready
I2-->>LB: 200 OK (success 2/2)
Note over LB: Restore I2 to rotation<br/>(recovery threshold met)
Note over LB,I3: All instances in rotation
Load balancers poll readiness endpoints at regular intervals and use failure thresholds to prevent flapping. Instances require 3 consecutive failures before removal but only 2 successes for restoration, optimizing for availability during transient issues while quickly recovering capacity.
Preventing Cascading Failures During Dependency Outages
graph TB
subgraph Normal Operation
N_HC["Health Check"] --"1. Query"--> N_DB[("Database")]
N_DB --"2. Response 10ms"--> N_HC
N_HC --"3. Return 200"--> N_LB["Load Balancer"]
end
subgraph Dependency Failure - Without Protection
F_HC["Health Check"] --"1. Query"--> F_DB[("Database<br/><i>Overloaded</i>")]
F_DB --"2. Timeout 1000ms"--> F_HC
F_HC --"3. Retry immediately"--> F_DB
F_DB --"4. Timeout 1000ms"--> F_HC
F_HC --"5. Return 503"--> F_LB["Load Balancer"]
Note1["❌ Problem:<br/>• Health checks hammer DB<br/>• Prevents recovery<br/>• All instances fail checks"]
end
subgraph Dependency Failure - With Protection
P_HC["Health Check"] --"1. Query"--> P_DB[("Database<br/><i>Overloaded</i>")]
P_DB --"2. Timeout 100ms"--> P_HC
P_HC --"3. Check circuit breaker"--> P_CB{"Circuit<br/>Breaker"}
P_CB --"Open"--> P_Skip["Skip check<br/><i>Use cached result</i>"]
P_CB --"Closed"--> P_Open["Open circuit<br/><i>After 3 failures</i>"]
P_Skip --"4. Return 503<br/><i>Known failure</i>"--> P_LB["Load Balancer"]
P_Open --"5. Wait 30s<br/><i>Exponential backoff</i>"--> P_Retry["Retry after<br/>backoff period"]
Note2["✓ Solution:<br/>• Strict timeout (100ms)<br/>• Circuit breaker stops hammering<br/>• Exponential backoff<br/>• Separate connection pool"]
end
subgraph Grace Period Strategy
G_HC["Health Check"] --"1. Dependency fails"--> G_Timer{"Grace<br/>Period?"}
G_Timer --"< 30s since failure"--> G_Cache["Return last<br/>healthy status"]
G_Timer --"> 30s since failure"--> G_Fail["Return 503"]
G_Cache --"2. Allow failover time"--> G_LB["Load Balancer"]
Note3["✓ Allows time for:<br/>• Database failover (10-15s)<br/>• Connection pool refresh<br/>• Transient issue recovery"]
end
Preventing cascading failures requires circuit breakers that stop checking failed dependencies, strict timeouts to fail fast, exponential backoff to reduce load, and grace periods that allow time for dependency failover. Without these protections, health checks can prevent recovery by overwhelming already-stressed dependencies.
Variants
Shallow vs Deep Health Checks: Shallow checks verify only the service’s own health (process alive, memory available, threads responsive) and return in under 50ms. Deep checks validate all dependencies and can take 500ms or more. Use shallow checks for liveness probes and high-frequency monitoring (every 1-2 seconds) to quickly detect process failures. Use deep checks for readiness probes at lower frequency (every 10-30 seconds) to make traffic routing decisions. The trade-off is detection speed versus accuracy—shallow checks catch failures faster but miss dependency issues, while deep checks provide complete health pictures but add latency and load.
Synchronous vs Asynchronous Health Checks: Synchronous checks perform all validation when the endpoint is called, blocking until complete. Asynchronous checks run validation continuously in background threads and cache results, making the endpoint return instantly with the last known status. Synchronous checks are simpler to implement and always current but can timeout under load. Asynchronous checks scale better and never timeout but can return stale status if background checks hang. Use synchronous for low-traffic services and asynchronous for high-traffic services where health check load becomes significant. Google’s services typically use asynchronous checks with 1-second refresh intervals, ensuring health endpoints return in under 10ms even when checking dozens of dependencies.
Binary vs Graduated Health Status: Binary health returns only healthy (200) or unhealthy (503). Graduated health returns multiple states: healthy (200), degraded (200 with warnings), unhealthy (503), and unknown (500). Binary status is simpler for load balancers to interpret—either route traffic or don’t. Graduated status enables sophisticated routing policies: prefer healthy instances, use degraded instances only when healthy capacity is exhausted, never route to unhealthy instances. AWS Application Load Balancer supports only binary health, while service meshes like Istio support graduated health with weighted routing. Use binary for simple deployments and graduated when you need fine-grained traffic control during partial failures.
Liveness vs Readiness vs Startup Probe Decision Tree
flowchart TB
Start(["Health Check<br/>Decision"]) --> Question1{"What are you<br/>checking?"}
Question1 --"Process is alive"--> Liveness["Use Liveness Probe<br/>/health/live"]
Question1 --"Can accept traffic"--> Readiness["Use Readiness Probe<br/>/health/ready"]
Question1 --"Finished initialization"--> Startup["Use Startup Probe<br/>/health/startup"]
Liveness --> L_Checks["Minimal Checks:<br/>• HTTP server responds<br/>• No thread deadlock<br/>• Memory available"]
L_Checks --> L_Timing["⏱ < 50ms response<br/>🔄 Every 10s<br/>❌ Failure = Restart"]
Readiness --> R_Checks["Comprehensive Checks:<br/>• Database connectivity<br/>• Cache availability<br/>• Queue reachable<br/>• Disk space > 10%"]
R_Checks --> R_Timing["⏱ < 500ms response<br/>🔄 Every 5-10s<br/>❌ Failure = Remove from LB"]
Startup --> S_Checks["Initialization Checks:<br/>• DB migrations done<br/>• Config loaded<br/>• Caches warmed<br/>• Dependencies ready"]
S_Checks --> S_Timing["⏱ < 5s response<br/>🔄 Every 5s<br/>❌ Failure = Restart after timeout"]
L_Timing --> L_Example["Example: Process<br/>deadlocked, needs restart"]
R_Timing --> R_Example["Example: Database down,<br/>can't serve traffic"]
S_Timing --> S_Example["Example: Still loading<br/>10GB cache, need time"]
The three probe types serve distinct purposes with different check depths and failure consequences. Liveness detects dead processes (restart), readiness validates traffic eligibility (remove from LB), and startup prevents premature restarts during initialization (wait longer).
Trade-offs
Check Frequency vs System Load: Frequent health checks (every 1-2 seconds) detect failures quickly, enabling sub-10-second recovery times, but generate significant load—1000 instances checking 5 dependencies every 2 seconds creates 2500 requests/second to each dependency. Infrequent checks (every 30-60 seconds) minimize load but delay failure detection, potentially routing traffic to failed instances for a full minute. The decision framework: use frequent checks for user-facing services where every second of downtime matters (e-commerce checkout, payment processing), infrequent checks for batch processing systems where 60-second detection delays are acceptable, and adaptive intervals that increase frequency when degradation is detected.
Comprehensive vs Minimal Dependency Checks: Comprehensive checks validate every dependency (databases, caches, queues, downstream services), providing complete health pictures but taking 500ms+ and creating tight coupling—your health depends on every dependency’s health. Minimal checks validate only critical path dependencies, returning in under 100ms and maintaining independence but potentially missing issues. Choose comprehensive checks for services where any dependency failure prevents useful work (API gateways that need all backend services). Choose minimal checks for services with graceful degradation (recommendation engines that can serve cached results when real-time services fail). Uber’s approach: readiness checks validate only dependencies required for the next request, not all possible dependencies.
Fail-Fast vs Fail-Tolerant Health Checks: Fail-fast health checks return unhealthy immediately when any critical dependency fails, removing instances from rotation quickly but potentially causing cascading failures when dependencies experience transient issues. Fail-tolerant checks use circuit breakers and retries, marking instances unhealthy only after sustained failures, maintaining availability during transient issues but potentially serving degraded traffic longer. Use fail-fast for services where correctness matters more than availability (financial transactions, inventory updates). Use fail-tolerant for services where availability matters more than perfect correctness (content recommendations, social feeds). The key metric: if a dependency fails for 10 seconds, would you rather serve no traffic (fail-fast) or potentially degraded traffic (fail-tolerant)?
When to Use (and When Not To)
Implement health endpoint monitoring when you have multiple service instances behind load balancers or orchestrators that need automated traffic routing decisions. This is essential for any service running in Kubernetes, ECS, or behind AWS ALB/NLB where manual instance management is impractical. The pattern is critical when services have complex dependencies—if your service depends on databases, caches, and downstream APIs, health checks prevent routing traffic to instances where those dependencies have failed.
Use graduated health checks (healthy/degraded/unhealthy) when you can serve useful traffic even when some dependencies fail. A product catalog service might return cached data when the inventory service is down—mark it as degraded, not unhealthy. Use binary checks (healthy/unhealthy) for services that can’t function with any dependency failure, like payment processing that requires both the payment gateway and fraud detection service.
Anti-patterns to avoid: Don’t implement health checks that call other services’ health checks—this creates a dependency graph where one service’s failure cascades to mark all dependent services unhealthy, even if they could serve cached data. Don’t make health checks perform expensive operations like full database scans or cache warming—health checks should verify that dependencies can respond, not that they’re performing optimally. Don’t use the same endpoint for both liveness and readiness—Kubernetes will restart pods that fail readiness checks if you only configure liveness, causing restart loops during dependency failures. Don’t ignore health check failures in development—if checks fail locally, they’ll fail in production, but you won’t notice until deployment.
Real-World Examples
company: LinkedIn
system: Service Infrastructure
implementation: LinkedIn implements a three-tier health check system across their microservices platform. Services expose /liveness (process health), /readiness (traffic eligibility), and /status (detailed diagnostics) endpoints. The readiness check validates database connection pools, Kafka producer health, and downstream service circuit breaker states with a 500ms timeout. They learned through production incidents that health checks must use separate connection pools from application traffic—during a database overload incident, health checks competed with application queries for connections, causing healthy instances to fail checks and get removed from rotation, which concentrated load on remaining instances and created a cascading failure. They now allocate dedicated connection pool capacity for health checks (10% of total pool size) and use exponential backoff when dependencies fail to prevent thundering herd problems. Their load balancers require 3 consecutive failures before removal but only 1 success for restoration, optimizing for availability during transient issues.
interesting_detail: LinkedIn discovered that health check failures during deployments caused unnecessary rollbacks. When new code versions took 30 seconds to warm caches, health checks failed and triggered automatic rollback before the service was truly ready. They introduced startup probes with 60-second timeouts and 5-second intervals specifically for deployment scenarios, separate from the 10-second readiness checks used during normal operation.
company: Netflix system: Eureka Service Discovery implementation: Netflix’s Eureka service registry implements a sophisticated health check aggregation system where services report their health status along with dependency health details. Each service runs local health checks every 30 seconds and pushes results to Eureka rather than waiting for Eureka to poll—this push model reduces load on Eureka and enables faster propagation of health changes. Services report not just binary health but a detailed status object including: overall health, individual dependency health (database, cache, downstream services), current load metrics (CPU, memory, request rate), and capability flags (can serve reads, can serve writes, can serve cached data). Eureka uses this rich health data to implement intelligent routing: when a service reports degraded database health but healthy cache, Eureka marks it as available for read-only traffic but unavailable for writes. interesting_detail: Netflix’s health checks include a “self-preservation mode” that prevents cascading failures. If more than 15% of instances in a service fail health checks simultaneously, Eureka assumes the health checks themselves are broken (network partition, monitoring system failure) rather than all instances genuinely failing, and continues routing traffic to all instances. This prevents scenarios where a monitoring system failure causes Eureka to mark all instances unhealthy and stop all traffic.
company: Stripe
system: API Infrastructure
implementation: Stripe’s API services implement health checks with strict SLAs: liveness checks must return within 50ms, readiness checks within 200ms, or they’re considered failed. Each health check validates specific capabilities: the /health/ready endpoint checks that the service can connect to its primary database (with a 100ms timeout), that Redis cache responds (50ms timeout), and that the service’s internal queue has fewer than 1000 pending items (indicating it’s keeping up with load). They use a weighted health score system where critical dependencies (primary database) have weight 1.0 and optional dependencies (analytics pipeline) have weight 0.3. An instance is marked healthy only if the weighted sum of healthy dependencies exceeds 0.8. This allows services to remain in rotation with degraded optional dependencies while failing fast when critical dependencies fail.
interesting_detail: Stripe discovered that health checks during database failovers caused unnecessary instance churn. When their primary database failed over to a replica (taking 10-15 seconds), health checks failed and load balancers removed all instances from rotation, even though the instances would recover automatically once failover completed. They implemented a “grace period” where health checks return the last known healthy status for up to 30 seconds after a critical dependency fails, giving dependencies time to failover before marking instances unhealthy. This reduced unnecessary instance removals by 80% during database maintenance windows.
Interview Essentials
Mid-Level
Explain the difference between liveness and readiness probes and when each should fail. Liveness checks verify the process is alive and should almost never fail—only when the process is deadlocked or corrupted. Readiness checks verify the service can handle traffic and should fail when dependencies are unavailable or the service is overloaded. Demonstrate understanding that failing liveness causes restart (expensive) while failing readiness only removes from load balancer (recoverable).
Describe how to implement a health check that validates database connectivity. Show you’d use a simple query like SELECT 1 with a strict timeout (100ms), use a dedicated connection pool for health checks separate from application traffic, and implement exponential backoff if checks fail to avoid thundering herd. Explain that you’d return 200 if the query succeeds within timeout, 503 if it fails or times out.
Walk through how load balancers use health checks to make routing decisions. Explain the polling model (LB queries endpoint every N seconds), failure threshold (require X consecutive failures before removal), and recovery threshold (require Y consecutive successes before restoration). Show you understand the trade-off between aggressive removal (fast failure detection but flapping risk) and conservative removal (stable but slow detection).
Senior
Design a health check system for a service with 10 dependencies of varying criticality. Demonstrate you’d categorize dependencies as critical (database, auth service) vs optional (analytics, recommendations), implement separate checks for each with appropriate timeouts, aggregate results with a policy (all critical must be healthy, optional failures only warn), and return detailed status in the response payload. Discuss how you’d use circuit breakers to fail fast on known-bad dependencies and prevent health check load from overwhelming failed dependencies.
Explain how you’d prevent health check failures from causing cascading failures during dependency outages. Show you’d implement circuit breakers that open after N failures and stop checking the dependency for M seconds, use separate connection pools for health checks, implement exponential backoff when dependencies fail, and consider a grace period where services remain healthy for 30-60 seconds after critical dependency failures to allow for failover. Discuss the trade-off between failing fast (remove bad instances quickly) and failing tolerant (maintain availability during transient issues).
Describe how you’d tune health check intervals and thresholds for a high-traffic service. Walk through the calculation: if you have 1000 instances checking 5 dependencies every 10 seconds, that’s 500 requests/second per dependency. Explain you’d use longer intervals (30-60 seconds) for deep checks, shorter intervals (5-10 seconds) for shallow checks, implement asynchronous health checks that cache results to avoid blocking, and monitor health check duration to detect dependency degradation early. Discuss how you’d adjust thresholds based on observed failure patterns—if transient network issues cause 1-second failures, require 3-5 consecutive failures before removal.
Staff+
Design a health check strategy for a multi-region, multi-tier system where services depend on both regional and global dependencies. Discuss how you’d implement separate health checks for regional dependencies (regional database, cache) vs global dependencies (global configuration service, cross-region replication), use different failure policies (fail fast on regional, fail tolerant on global), and coordinate health status across regions to prevent simultaneous region-wide failures. Explain how you’d handle scenarios where a global dependency fails—should all regions mark themselves unhealthy, or should they serve degraded traffic using cached global state?
Explain how you’d design health checks to support gradual rollouts and canary deployments. Describe implementing version-aware health checks that report the deployed version, capability flags that indicate which features are enabled, and load metrics that enable intelligent traffic shifting. Discuss how you’d use health check data to automatically roll back bad deployments (if error rate exceeds threshold within 5 minutes, mark canary unhealthy and shift traffic back), gradually increase canary traffic (shift 1% → 5% → 25% → 100% as health remains good), and implement safeguards against bad health checks causing unnecessary rollbacks.
Discuss the architectural trade-offs between push-based health reporting (services push status to registry) vs pull-based health checking (registry polls services). Analyze that push-based scales better (registry doesn’t need to poll thousands of instances), propagates changes faster (services report immediately when health changes), but requires services to maintain connections to the registry and handle registry failures gracefully. Pull-based is simpler (services just expose HTTP endpoint), more reliable (registry failure doesn’t affect service operation), but generates more load and has slower propagation. Explain when you’d choose each approach and how you’d implement hybrid systems that use push for fast propagation with pull as fallback.
Common Interview Questions
How do you prevent health checks from overwhelming dependencies during outages? Use circuit breakers that stop checking failed dependencies, implement exponential backoff (1s → 2s → 4s → 8s between checks), use separate connection pools for health checks, and implement a grace period where services remain healthy briefly after dependency failures to allow for failover.
What should a health check return when a non-critical dependency fails? Return 200 (healthy) with a warning in the response payload indicating degraded functionality. The service can still handle traffic, just with reduced capabilities. Only return 503 (unhealthy) when critical dependencies fail and the service genuinely can’t serve useful traffic.
How do you handle health checks during deployments when services are warming up? Implement startup probes separate from readiness probes with longer timeouts (60+ seconds) and longer intervals (5-10 seconds). Configure orchestrators to use startup probes during initial deployment and switch to readiness probes once startup completes. This prevents premature restarts of slow-starting services.
Should health checks validate downstream service health? No, health checks should only validate that you can reach downstream services, not that they’re healthy. If you check downstream health, one service failure cascades to mark all dependent services unhealthy, even if they could serve cached data. Use circuit breakers to track downstream health separately from your own health status.
Red Flags to Avoid
Implementing health checks that perform expensive operations like full database scans, cache warming, or complex business logic. Health checks should be fast (under 500ms) and lightweight, verifying that dependencies can respond, not that they’re performing optimally.
Using the same endpoint for liveness and readiness checks. This causes orchestrators to restart pods when dependencies fail (readiness issue) rather than just removing them from load balancer rotation. Always implement separate endpoints with different semantics.
Not implementing timeouts on health checks or using timeouts longer than the check interval. This causes health check accumulation where slow checks pile up and eventually exhaust resources. Always use timeouts shorter than the interval (e.g., 2s timeout with 5s interval).
Ignoring health check failures in development or marking them as “flaky tests” to ignore. Health checks that fail locally will fail in production, but you won’t notice until deployment causes incidents. Fix health check failures immediately or remove the checks if they’re not meaningful.
Implementing health checks that call other services’ health check endpoints. This creates a dependency graph where failures cascade and makes it impossible to determine which service actually failed. Health checks should only validate your own service’s ability to reach dependencies, not the dependencies’ health.
Key Takeaways
Health endpoint monitoring enables automated failure detection and traffic routing by exposing standardized HTTP endpoints that return structured health status. Implement three distinct endpoints: liveness (process alive?), readiness (can accept traffic?), and startup (finished initialization?) with different check depths and failure semantics.
Aggregate dependency health with strict timeouts and circuit breakers to prevent health checks from overwhelming failed dependencies. Categorize dependencies as critical (all must be healthy) vs optional (failures only warn), use separate connection pools for health checks, and implement exponential backoff when dependencies fail.
Configure load balancer integration with appropriate failure thresholds (3-5 consecutive failures before removal) and intervals (5-10 seconds for readiness, 30-60 seconds for deep checks) to balance fast failure detection against flapping risk. Monitor health check duration and failure rates to detect dependency degradation early.
Implement graceful degradation where services stop accepting new requests but finish in-flight work when health checks fail. Return 503 immediately for new requests while existing requests complete, preventing the half-dead instance problem where services fail checks but continue processing requests poorly.
Use graduated health status (healthy/degraded/unhealthy) for services that can serve useful traffic with failed optional dependencies, and binary status (healthy/unhealthy) for services that require all dependencies. Never check downstream services’ health endpoints directly—validate only that you can reach them, not that they’re healthy.