Monitoring in System Design: Metrics, Logs & Traces

After this topic, you will be able to:

Differentiate between monitoring, observability, and telemetry
Analyze the relationship between metrics, logs, and traces
Identify appropriate monitoring strategies for different system characteristics
Evaluate the trade-offs between monitoring granularity and cost

TL;DR

Monitoring is the practice of collecting, aggregating, and analyzing data about system behavior to detect problems, understand performance, and inform operational decisions. It differs from observability (understanding internal state from external outputs) and relies on three pillars: metrics (numeric measurements over time), logs (discrete event records), and traces (request flow through distributed systems). Effective monitoring requires balancing coverage, granularity, and cost while aligning with business SLAs.

Cheat Sheet: Monitoring = proactive data collection for known failure modes. Observability = debugging unknown problems. Use metrics for trends, logs for context, traces for distributed debugging. Monitor health (is it up?), availability (can users access it?), performance (is it fast enough?), and business metrics (are we making money?).

Why This Matters

When Netflix streams video to 230 million subscribers across 190 countries, a single service degradation can cost millions in revenue and customer trust. Without monitoring, engineers fly blind—they discover outages from angry tweets instead of automated alerts, spend hours debugging issues that could be diagnosed in minutes with proper telemetry, and make capacity decisions based on gut feeling rather than data. In system design interviews, monitoring separates candidates who design systems from those who design systems that actually run in production. Interviewers want to see that you understand the operational reality: systems fail, performance degrades, and the only way to maintain reliability is through comprehensive, intelligent monitoring.

The stakes are concrete. Google’s SRE team calculates that every additional nine of availability (99.9% to 99.99%) requires exponentially more investment in monitoring and automation. Stripe’s payment processing system monitors over 1,000 metrics per service because a single missed anomaly could mean fraudulent transactions or failed payments. Uber’s ride-matching system uses distributed tracing to debug latency issues that span 50+ microservices. These aren’t academic exercises—monitoring is the foundation of reliability engineering, incident response, capacity planning, and ultimately, business success. Understanding monitoring architecture demonstrates that you think about systems holistically, not just the happy path.

The Landscape

The monitoring landscape has evolved dramatically from simple uptime checks to sophisticated observability platforms. Traditional monitoring focused on infrastructure metrics—CPU, memory, disk—using tools like Nagios and Cacti. This worked when applications were monolithic and failure modes were predictable. Modern distributed systems changed everything. When a single user request touches 20 microservices across 3 cloud regions, understanding system behavior requires correlating metrics, logs, and traces across all components.

Today’s ecosystem divides into several layers. Time-series databases like Prometheus, InfluxDB, and Datadog store metrics with high cardinality and query performance. Log aggregation platforms like Elasticsearch, Splunk, and Loki collect and index billions of log lines daily. Distributed tracing systems like Jaeger, Zipkin, and AWS X-Ray track requests across service boundaries. Observability platforms like Honeycomb and Lightstep combine all three pillars with advanced analytics. Cloud providers offer integrated solutions—AWS CloudWatch, Google Cloud Monitoring, Azure Monitor—that reduce operational overhead but may lack flexibility.

The philosophical divide between monitoring and observability shapes tool selection. Monitoring answers known questions: “Is the database responding?” “Is CPU above 80%?” You define metrics upfront based on anticipated failure modes. Observability answers unknown questions: “Why is this specific user’s checkout failing?” “What changed between 2 PM and 2:15 PM?” It requires high-cardinality data and exploratory analysis. In practice, production systems need both. You monitor known failure patterns with alerts and dashboards, then use observability tools to debug novel issues that inevitably arise.

Key Areas

Health Monitoring tracks whether services are alive and responding. This includes heartbeat checks, health endpoints, and process monitoring. Health monitoring is your first line of defense—it detects complete failures like crashed processes or network partitions. Twitter’s fail whale appeared when health checks didn’t catch cascading failures fast enough. Modern health monitoring goes beyond simple ping checks to include dependency health (is the database reachable?), resource exhaustion (are we out of file descriptors?), and startup/shutdown lifecycle events. The key insight: a service can be “up” but unhealthy if its dependencies are degraded.

Performance Monitoring measures how fast your system responds and where time is spent. This includes latency percentiles (p50, p95, p99), throughput (requests per second), and resource utilization. Netflix monitors video start time at multiple percentiles because a slow p99 means some users wait 10+ seconds while the average looks fine. Performance monitoring reveals bottlenecks, capacity limits, and degradation before users notice. It answers: “Are we meeting our SLAs?” and “Where should we optimize next?” The challenge is balancing granularity (per-endpoint metrics) with cardinality explosion (millions of unique metric combinations).

Availability Monitoring measures whether users can successfully complete critical workflows. This differs from health monitoring—a service can be healthy but unavailable if it’s returning errors, throttling requests, or serving stale data. Amazon monitors checkout completion rate, not just API uptime, because a 200 OK response doesn’t mean the order succeeded. Availability monitoring tracks error rates, success rates, and business-critical transactions. It connects technical metrics to business impact. The key question: “Can users do what they came here to do?”

Security Monitoring detects anomalous behavior that might indicate attacks, breaches, or abuse. This includes authentication failures, unusual access patterns, privilege escalations, and data exfiltration attempts. Stripe monitors API key usage patterns to detect compromised keys before they cause damage. Security monitoring overlaps with traditional security information and event management (SIEM) but focuses on application-level threats. It requires baseline understanding of normal behavior to identify anomalies—a sudden spike in database queries might be an attack or a legitimate marketing campaign.

Business Metrics Monitoring tracks outcomes that matter to the company: revenue, conversions, user engagement, feature adoption. Spotify monitors daily active users, stream starts, and subscription conversions alongside technical metrics. When technical metrics look healthy but business metrics drop, you’ve found a user-impacting issue that traditional monitoring missed. This bridges the gap between engineering and product teams, ensuring technical investments align with business goals. The insight: uptime is meaningless if users aren’t successfully using your product.

Monitoring Layers: From Infrastructure to Business Impact

graph TB
    subgraph Business Layer
        B1["Revenue per Hour<br/><i>$50K/hr target</i>"]
        B2["Checkout Completion<br/><i>98.5% success</i>"]
        B3["User Engagement<br/><i>Daily Active Users</i>"]
    end
    
    subgraph Application Layer
        A1["Availability<br/><i>Can users complete workflows?</i>"]
        A2["Performance<br/><i>P99 < 500ms</i>"]
        A3["Security<br/><i>Failed auth attempts</i>"]
    end
    
    subgraph Service Layer
        S1["Health Checks<br/><i>Service alive?</i>"]
        S2["Error Rates<br/><i>4xx/5xx responses</i>"]
        S3["Throughput<br/><i>Requests/sec</i>"]
    end
    
    subgraph Infrastructure Layer
        I1["CPU Usage<br/><i>< 80%</i>"]
        I2["Memory<br/><i>Available RAM</i>"]
        I3["Network<br/><i>Bandwidth/Latency</i>"]
    end
    
    I1 & I2 & I3 --"Supports"--> S1 & S2 & S3
    S1 & S2 & S3 --"Enables"--> A1 & A2 & A3
    A1 & A2 & A3 --"Drives"--> B1 & B2 & B3
    
    Alert["Alert Priority"]
    B2 --"Critical: Revenue Impact"--> Alert
    A1 --"High: User-Facing"--> Alert
    S1 --"Medium: Service Issue"--> Alert
    I1 --"Low: May Cause Future Issues"--> Alert

Monitoring operates in layers, from infrastructure metrics to business outcomes. Each layer builds on the one below. Alert priority should reflect business impact—a service can be healthy (infrastructure) but unavailable (application) if it’s returning errors. Modern monitoring connects technical metrics to business KPIs.

How Things Connect

Monitoring types aren’t independent—they form a layered defense system where each layer informs the others. Health monitoring is foundational: if a service is down, performance and availability metrics become meaningless. Performance monitoring builds on health: a healthy service might still be slow, degrading user experience. Availability monitoring synthesizes both: a service can be healthy and fast but still unavailable if it’s returning errors or rejecting requests due to rate limiting.

The three pillars connect through correlation. When a performance metric (p99 latency) spikes, you pivot to logs to find error messages, then to traces to identify which service in the call chain is slow. Modern observability platforms automate this correlation—Honeycomb lets you click from a metric spike directly to example traces. Without correlation, you’re context-switching between tools, losing precious minutes during incidents.

Monitoring feeds into capacity planning and architecture decisions. If performance monitoring shows database CPU consistently above 70%, you need to scale vertically, add read replicas, or introduce caching. If availability monitoring reveals that 5% of requests fail during deployments, you need better deployment strategies like canary releases or circuit breakers. Security monitoring might reveal that certain API endpoints are being abused, prompting rate limiting or authentication changes. Business metrics monitoring might show that a technically successful feature has low adoption, questioning whether engineering effort was well-spent.

The feedback loop is continuous: monitoring reveals problems, you fix them, then adjust monitoring to catch similar issues earlier. Netflix’s Chaos Engineering practice deliberately injects failures while monitoring system response, improving both system resilience and monitoring coverage. This iterative refinement is how monitoring evolves from basic uptime checks to sophisticated anomaly detection.

Incident Investigation Flow: Correlating Monitoring Signals

sequenceDiagram
    participant Alert as Alert System
    participant Metrics as Metrics DB<br/>(Prometheus)
    participant Logs as Log Aggregator<br/>(Elasticsearch)
    participant Traces as Trace System<br/>(Jaeger)
    participant Engineer as On-Call Engineer
    
    Note over Alert,Traces: 14:32:00 - P99 Latency Spike Detected
    
    Alert->>Engineer: 1. Alert: P99 > 2000ms (SLO breach)
    Engineer->>Metrics: 2. Query: Show latency by endpoint
    Metrics-->>Engineer: /api/checkout: 2500ms (normal: 200ms)
    
    Engineer->>Metrics: 3. Query: Error rate for /api/checkout
    Metrics-->>Engineer: 15% errors (normal: 0.5%)
    
    Engineer->>Logs: 4. Search: errors in last 5 min
    Logs-->>Engineer: "Database connection timeout"<br/>"Retry exhausted after 3 attempts"
    
    Engineer->>Traces: 5. Query: Sample traces for /api/checkout
    Traces-->>Engineer: Trace abc123:<br/>API Gateway: 10ms<br/>Payment Service: 50ms<br/>Database: 2400ms (timeout)
    
    Note over Engineer: Root Cause: Database connection pool exhausted
    
    Engineer->>Metrics: 6. Query: DB connection pool metrics
    Metrics-->>Engineer: Active: 100/100 (maxed out)<br/>Wait time: 2000ms
    
    Note over Engineer: Resolution: Scale DB connection pool<br/>Add connection pool monitoring

During incidents, engineers correlate signals across monitoring pillars. The flow starts with a metric alert, uses logs to find error messages, examines traces to identify bottlenecks, then returns to metrics for confirmation. This correlation reduces mean time to resolution (MTTR) from hours to minutes.

Real-World Context

At Netflix, the monitoring infrastructure processes 2 billion metrics per minute using Atlas, their custom time-series database. They monitor everything from video bitrate and buffering events to microservice latency and AWS instance health. When a CDN degrades in Europe, automated systems detect increased buffering rates within seconds, reroute traffic to healthy CDNs, and alert on-call engineers—all before most users notice. Their monitoring philosophy: measure everything, alert on business impact, automate remediation where possible.

Uber’s observability platform handles 100 million traces per minute across 4,000+ microservices. When a rider reports “my driver location isn’t updating,” engineers query traces by rider ID to see the exact request path, identifying that a specific geolocation service was timing out. Without distributed tracing, debugging this would require manually checking logs across dozens of services. Uber’s investment in observability directly reduces mean time to resolution (MTTR) from hours to minutes.

Stripe monitors payment success rates with 99.99% accuracy requirements. Their monitoring distinguishes between different failure modes: network timeouts, bank declines, fraud blocks, and system errors. Each has different remediation strategies. They use anomaly detection to catch subtle degradations—if Visa approval rates drop 0.5%, that’s millions in lost revenue. Their monitoring includes synthetic transactions that continuously test critical payment flows, catching issues before real customers are affected.

Google’s SRE teams pioneered the concept of Service Level Objectives (SLOs) backed by monitoring. Instead of alerting on every metric threshold breach, they alert when error budgets are being consumed too quickly. If a service has a 99.9% availability SLO (43 minutes downtime per month), monitoring tracks actual availability against this budget. This aligns engineering effort with business requirements and prevents alert fatigue from non-critical issues.

These companies share common patterns: comprehensive instrumentation (measure everything), intelligent alerting (reduce noise), automated response (fix common issues without human intervention), and continuous refinement (monitoring evolves with the system). They also invest heavily—Uber’s observability team has 50+ engineers, and infrastructure costs run into millions annually. The ROI is clear: faster incident response, better capacity planning, and higher reliability.

Netflix Monitoring Architecture: 2 Billion Metrics Per Minute

graph LR
    subgraph Client Layer
        Mobile["Mobile Apps<br/><i>iOS/Android</i>"]
        Web["Web Players<br/><i>Browser</i>"]
        TV["Smart TVs<br/><i>Streaming Devices</i>"]
    end
    
    subgraph Microservices - 1000+ Services
        API["API Gateway"]
        Video["Video Service"]
        Rec["Recommendation"]
        CDN["CDN Router"]
    end
    
    subgraph Monitoring Infrastructure
        Atlas[("Atlas<br/><i>Time-Series DB</i><br/>2B metrics/min")]
        Mantis["Mantis<br/><i>Stream Processing</i>"]
        Alerts["Alert Manager<br/><i>Anomaly Detection</i>"]
    end
    
    subgraph Observability Platform
        Dash["Dashboards<br/><i>Real-time Visualization</i>"]
        Auto["Auto-Remediation<br/><i>Traffic Rerouting</i>"]
        Oncall["On-Call Engineers<br/><i>Incident Response</i>"]
    end
    
    Mobile & Web & TV --"1. Playback metrics<br/>(buffering, bitrate)"--> API
    API & Video & Rec & CDN --"2. Service metrics<br/>(latency, errors)"--> Mantis
    Mantis --"3. Aggregate & store"--> Atlas
    Atlas --"4. Query metrics"--> Alerts
    Alerts --"5a. Critical: CDN degraded"--> Auto
    Alerts --"5b. Warning: High latency"--> Oncall
    Auto --"6. Reroute traffic"--> CDN
    Atlas --"7. Visualize trends"--> Dash
    Dash -."Monitor impact".-> Oncall

Netflix’s monitoring processes 2 billion metrics per minute from 1,000+ microservices and millions of client devices. Atlas (custom time-series DB) stores metrics, Mantis performs real-time stream processing, and automated systems detect anomalies and reroute traffic before users notice degradation. This architecture enables sub-minute incident detection and response.

Interview Essentials

Mid-Level

Mid-level candidates should articulate the difference between monitoring and observability, explain the three pillars (metrics, logs, traces) with examples, and describe basic monitoring strategies for a given system. When designing a URL shortener, you should mention monitoring redirect latency (performance), tracking 404 rates (availability), and logging suspicious access patterns (security). Demonstrate awareness that monitoring isn’t an afterthought—it’s part of the initial design. Explain how you’d set up health checks for services and what metrics you’d track. Show understanding that different metrics have different cardinality and cost implications. Common mistake: treating monitoring as a checkbox (“we’ll use Prometheus”) without explaining what you’d monitor or why.

Senior

Senior candidates should design comprehensive monitoring strategies that balance coverage, cost, and operational burden. For a payment processing system, explain how you’d monitor transaction success rates, latency percentiles, fraud detection accuracy, and reconciliation completeness. Discuss trade-offs: high-cardinality metrics (per-merchant, per-card-type) provide better debugging but increase storage costs exponentially. Describe how monitoring informs architecture decisions—if logs show 80% of errors come from a specific dependency, you need circuit breakers or fallbacks. Explain alert design: why alert on symptoms (users can’t check out) rather than causes (database CPU high). Demonstrate understanding of monitoring in distributed systems: how do you correlate metrics across 20 microservices? How do you handle clock skew in distributed tracing? Show awareness of sampling strategies for high-volume systems.

Monitoring Trade-offs: Cardinality vs. Cost vs. Debugging Capability

graph TB
    Start["Design Monitoring Strategy"]-->Decision1{"High Traffic System?<br/>(>10K req/sec)"}
    
    Decision1 -->|Yes| Decision2{"Need Per-User<br/>Debugging?"}
    Decision1 -->|No| LowCard["Low Cardinality<br/>Track aggregates only<br/>Cost: $"]
    
    Decision2 -->|Yes| Sample["High Cardinality + Sampling<br/>Store 1% of traces<br/>All metrics aggregated<br/>Cost: $$$"]
    Decision2 -->|No| MedCard["Medium Cardinality<br/>Per-endpoint metrics<br/>Sampled logs<br/>Cost: $$"]
    
    Sample --> Example1["Example: Stripe<br/>Track per-merchant success rate<br/>Sample 1% of transactions<br/>Full traces for errors"]
    
    MedCard --> Example2["Example: E-commerce<br/>Track per-endpoint latency<br/>Log errors only<br/>Sample 10% of traces"]
    
    LowCard --> Example3["Example: Internal Tool<br/>Overall success rate<br/>Error logs only<br/>No distributed tracing"]
    
    Example1 --> Cost1["Storage: 500TB/month<br/>Query: Complex<br/>Retention: 30 days<br/>Bill: $50K/month"]
    
    Example2 --> Cost2["Storage: 50TB/month<br/>Query: Moderate<br/>Retention: 90 days<br/>Bill: $5K/month"]
    
    Example3 --> Cost3["Storage: 5TB/month<br/>Query: Simple<br/>Retention: 1 year<br/>Bill: $500/month"]

Senior engineers balance monitoring granularity with cost constraints. High-cardinality metrics (per-user, per-merchant) enable better debugging but increase storage costs exponentially. The solution: strategic sampling, aggregation, and retention policies. A payment system might track all merchant-level metrics but sample only 1% of individual transaction traces.

Staff+

Staff+ candidates should architect monitoring systems themselves and explain organizational monitoring strategy. Discuss building vs. buying: when does it make sense to build custom monitoring infrastructure like Netflix’s Atlas vs. using commercial solutions? Explain how monitoring strategy evolves with company growth—what works at 10 services doesn’t work at 1,000. Describe monitoring for monitoring: how do you ensure your observability platform is reliable? Discuss cost optimization: Datadog bills can reach millions annually; how do you control cardinality explosion while maintaining debugging capability? Explain monitoring’s role in organizational practices: SLO-based alerting, error budgets, on-call rotation design, incident response runbooks. Describe advanced techniques: anomaly detection using machine learning, automated root cause analysis, predictive alerting before SLO violations. Show understanding of monitoring’s business impact: how do you justify monitoring infrastructure investment to executives? How do you measure monitoring effectiveness?

Common Interview Questions

“How would you monitor this system?” is asked in nearly every system design interview. The interviewer wants to see that you think about operational concerns, not just happy-path functionality. “What metrics would you track?” tests whether you understand the difference between vanity metrics and actionable metrics. “How would you debug a latency spike?” evaluates your ability to use monitoring tools for troubleshooting. “How do you prevent alert fatigue?” assesses your understanding of intelligent alerting strategies. “What’s the difference between monitoring and observability?” checks conceptual understanding. “How would you handle monitoring at scale?” for senior+ roles tests your awareness of cardinality, sampling, and cost management.

Red Flags to Avoid

Treating monitoring as an afterthought (“we’ll add monitoring later”) shows lack of production experience. Suggesting to “monitor everything” without discussing cost or cardinality constraints reveals naivety about scale. Confusing metrics, logs, and traces or not knowing when to use each suggests surface-level understanding. Proposing alerts on every metric threshold without considering alert fatigue demonstrates poor operational judgment. Not connecting monitoring to business outcomes (SLAs, user experience) shows inability to think beyond technical metrics. Ignoring distributed systems challenges (clock skew, trace correlation, eventual consistency) in a microservices design reveals gaps in real-world experience. Focusing only on infrastructure metrics (CPU, memory) without application-level metrics (error rates, latency) misses the point of modern observability.

Key Takeaways

Monitoring is proactive data collection for known failure modes; observability is debugging unknown problems through high-cardinality data and exploratory analysis. Production systems need both: monitoring for automated alerting, observability for incident investigation.

The three pillars—metrics (aggregated numbers), logs (discrete events), traces (request flows)—provide complementary views. Use metrics for trends and alerts, logs for debugging context, traces for distributed system latency analysis. Effective monitoring correlates all three.

Monitor at multiple layers: health (is it alive?), performance (is it fast?), availability (can users succeed?), security (are we under attack?), and business metrics (are we achieving goals?). Each layer informs the others and connects technical metrics to business impact.

Balance monitoring granularity with cost: high-cardinality metrics enable better debugging but increase storage costs exponentially. Use sampling, aggregation, and retention policies to control costs while maintaining debugging capability. Not everything needs millisecond resolution forever.

Design monitoring for your operational model: alert on symptoms (user-facing issues) not causes (infrastructure metrics), use SLOs and error budgets to align with business requirements, and automate remediation for common failures. Monitoring should reduce MTTR and prevent alert fatigue, not create more operational burden.

Three Pillars

The “three pillars of observability”—metrics, logs, and traces—provide complementary views into system behavior. Understanding when to use each pillar is crucial for effective monitoring strategy.

Metrics are numeric measurements aggregated over time: request count, error rate, CPU percentage, queue depth. They’re cheap to collect and store because they compress information—instead of recording every request, you track “500 requests in the last minute.” Metrics excel at showing trends, triggering alerts, and answering aggregate questions. Prometheus, the industry standard for metrics, can handle millions of time series. The limitation: metrics lose individual request context. You know error rate increased, but not which specific requests failed or why. Metrics answer “what” and “when” but rarely “why.”

Logs are discrete event records with structured or unstructured data: “User 12345 failed login at 2024-01-15 14:32:01 from IP 192.168.1.1 with error ‘invalid password’.” Logs provide context that metrics lack—you can search for specific users, errors, or patterns. The challenge is volume and cost. A busy service might generate terabytes of logs daily. Elasticsearch clusters at Uber store petabytes of logs, requiring sophisticated retention policies and sampling strategies. Logs answer “why” and “who” but are expensive to query at scale. Modern structured logging (JSON format) enables better searching and correlation than traditional text logs.

Traces track individual requests as they flow through distributed systems. When a user loads their Twitter timeline, that request might touch 15 microservices. Distributed tracing instruments each service to record timing, errors, and metadata, then stitches these spans into a complete trace. This reveals bottlenecks (“90% of latency is in the recommendation service”), cascading failures, and inter-service dependencies. Google’s Dapper paper pioneered this approach; now Jaeger and Zipkin are open-source standards. The cost: instrumentation overhead and storage for high-cardinality trace data. Traces answer “where” and “how long” across service boundaries.

In practice, you need all three. When an alert fires (metric), you check logs for error messages, then examine traces to understand the request path. The art is knowing which pillar to start with: metrics for trends and alerts, logs for debugging specific errors, traces for distributed system latency analysis.

Three Pillars of Observability: Complementary Views

graph TB
    subgraph System Under Observation
        Service["Payment Service<br/><i>Processing Transactions</i>"]
    end
    
    subgraph Metrics Layer
        M1["Request Rate<br/><i>500 req/min</i>"]
        M2["Error Rate<br/><i>2.5%</i>"]
        M3["P99 Latency<br/><i>450ms</i>"]
    end
    
    subgraph Logs Layer
        L1["2024-01-15 14:32:01<br/>User 12345 payment failed<br/>Error: Card declined"]
        L2["2024-01-15 14:32:15<br/>Retry attempt 1<br/>Status: Success"]
    end
    
    subgraph Traces Layer
        T1["Trace ID: abc123<br/>Total: 450ms"]
        T2["API Gateway: 10ms"]
        T3["Payment Service: 50ms"]
        T4["Bank API: 380ms"]
        T5["Database: 10ms"]
        T1 --> T2
        T2 --> T3
        T3 --> T4
        T3 --> T5
    end
    
    Service --"Aggregated Numbers"--> M1 & M2 & M3
    Service --"Discrete Events"--> L1 & L2
    Service --"Request Flow"--> T1
    
    Questions["Questions Answered"]
    M1 -."WHAT: Error rate increased".-> Questions
    L1 -."WHY: Card declined for user 12345".-> Questions
    T4 -."WHERE: 84% time in Bank API".-> Questions

The three pillars provide complementary views into system behavior. Metrics show trends and trigger alerts (WHAT happened), logs provide debugging context (WHY it happened), and traces reveal request paths across services (WHERE time was spent). Effective monitoring correlates all three during incident investigation.