Performance Monitoring: Latency, Throughput & Errors

TL;DR

Performance monitoring is the continuous observation and measurement of system behavior under load to detect degradation before it causes failures. It tracks metrics like latency, throughput, error rates, and resource utilization to enable proactive intervention. Think of it as your system’s vital signs monitor—catching problems when they’re still treatable rather than waiting for cardiac arrest.

Cheat Sheet: Monitor the four golden signals (latency, traffic, errors, saturation). Use percentiles (p50, p95, p99) not averages. Set SLOs with error budgets. Alert on symptoms (user impact) not causes (CPU spikes). Aggregate metrics at multiple time windows (1m, 5m, 1h) for different response speeds.

The Analogy

Performance monitoring is like the dashboard in your car. Your speedometer, fuel gauge, and temperature warning don’t prevent problems—they give you early signals so you can act before you’re stranded on the highway. Just as you wouldn’t wait for your engine to seize before checking the temperature gauge, you don’t wait for your system to crash before checking latency. The difference is that modern distributed systems have thousands of gauges across hundreds of “cars” (services), so you need automated systems to watch them all and alert you only when something meaningful is wrong—not every time a single gauge flickers.

Why This Matters in Interviews

Performance monitoring comes up in almost every system design interview when discussing production operations, especially for high-scale systems. Interviewers want to see that you understand the difference between monitoring and observability, that you know what metrics actually matter (not just “we’ll monitor CPU”), and that you can design alerting that catches real problems without creating noise. Senior+ candidates are expected to discuss SLOs, error budgets, and how monitoring influences architectural decisions. This topic often bridges into discussions about incident response, capacity planning, and reliability engineering.

Core Concept

Performance monitoring is the systematic collection, aggregation, and analysis of metrics that describe how your system behaves under real-world conditions. Unlike traditional application logs that tell you what happened, performance monitoring tells you how well it happened—measuring the quality of service your users experience. At companies like Netflix, where a 1-second increase in latency can cost millions in lost engagement, performance monitoring isn’t optional infrastructure—it’s the nervous system that keeps the entire platform healthy.

The fundamental challenge is that distributed systems fail in complex, cascading ways. A database slowdown doesn’t immediately crash your API—it gradually increases latency, which causes connection pool exhaustion, which triggers retries, which amplifies load, which eventually causes total failure. Performance monitoring catches this cascade at the “gradually increases latency” stage, when you can still add capacity, optimize queries, or shed load gracefully. Without it, you’re flying blind until users start complaining on Twitter.

Effective performance monitoring requires three components working together: metric collection (gathering data points from every service), metric storage (time-series databases that can handle millions of data points per second), and metric analysis (dashboards, alerts, and anomaly detection that turn raw numbers into actionable insights). The art is in choosing what to measure—too little and you miss critical signals, too much and you drown in noise.

How It Works

Step 1: Instrumentation and Metric Collection

Every service in your system emits metrics at regular intervals (typically every 10-60 seconds). This happens through instrumentation libraries embedded in your application code. For example, when your API handles a request, the instrumentation automatically records the request duration, HTTP status code, and endpoint. These metrics are pushed to a local agent or pulled by a centralized collector. At Uber, each microservice runs a sidecar agent that collects metrics and forwards them to a central aggregation tier, handling over 100 million metrics per second across their fleet.

Step 2: Metric Aggregation and Storage

Raw metrics flow into a time-series database (TSDB) like Prometheus, InfluxDB, or M3DB. These databases are optimized for write-heavy workloads with time-stamped data points. The TSDB aggregates metrics across dimensions—for example, combining latency measurements from 500 API servers into percentile distributions (p50, p95, p99). This aggregation happens at multiple time windows: 1-minute for real-time alerts, 5-minute for dashboards, 1-hour for capacity planning. Retention policies automatically downsample older data to save storage—you might keep 1-second resolution for 24 hours, 1-minute resolution for 30 days, and 1-hour resolution for a year.

Step 3: Visualization and Dashboarding

Aggregated metrics are visualized in dashboards (Grafana, Datadog, SignalFx) that show system health at a glance. A typical service dashboard shows request rate, latency percentiles, error rate, and resource utilization in a 2x2 grid. Engineers use these dashboards during incidents to correlate symptoms across services—for example, noticing that API latency spiked exactly when database connection pool saturation hit 100%. The key is designing dashboards that answer specific questions: “Is the system healthy?” (executive dashboard), “Where is the bottleneck?” (troubleshooting dashboard), “Are we meeting our SLOs?” (reliability dashboard).

Step 4: Alerting and Anomaly Detection

The monitoring system continuously evaluates alert rules against incoming metrics. When a rule triggers (e.g., “p99 latency > 500ms for 5 minutes”), it sends notifications through PagerDuty, Slack, or email. Modern systems use dynamic thresholds based on historical patterns rather than static values—an alert that fires when traffic is 20% above the same-hour-last-week baseline, accounting for daily and weekly patterns. At Google, they use multi-window, multi-burn-rate alerts that fire faster for severe SLO violations and slower for minor ones, reducing alert fatigue while maintaining fast incident response.

Step 5: Continuous Analysis and Improvement

Performance data feeds back into system design decisions. Weekly SLO reviews identify services that consistently burn through their error budget, triggering reliability improvements. Capacity planning uses historical growth trends to predict when you’ll need more infrastructure. Post-incident reviews analyze metric timelines to understand failure modes and add new monitoring to catch similar issues earlier. This creates a virtuous cycle where monitoring improves the system, which generates better monitoring data, which enables better improvements.

End-to-End Performance Monitoring Pipeline

graph LR
    subgraph Services
        API["API Service"]
        DB["Database Service"]
        Cache["Cache Service"]
    end
    
    subgraph Collection Layer
        Agent1["Metrics Agent"]
        Agent2["Metrics Agent"]
        Agent3["Metrics Agent"]
    end
    
    subgraph Storage & Processing
        TSDB[("Time-Series DB<br/><i>Prometheus/M3DB</i>")]
        Aggregator["Aggregation Engine<br/><i>1m, 5m, 1h windows</i>"]
    end
    
    subgraph Analysis & Action
        Dashboard["Dashboards<br/><i>Grafana</i>"]
        Alerting["Alert Manager"]
        PagerDuty["PagerDuty/Slack"]
    end
    
    API --"1. Emit metrics<br/>every 10s"--> Agent1
    DB --"1. Emit metrics<br/>every 10s"--> Agent2
    Cache --"1. Emit metrics<br/>every 10s"--> Agent3
    
    Agent1 & Agent2 & Agent3 --"2. Push/Pull<br/>raw metrics"--> TSDB
    
    TSDB --"3. Store &<br/>aggregate"--> Aggregator
    
    Aggregator --"4. Query<br/>metrics"--> Dashboard
    Aggregator --"5. Evaluate<br/>alert rules"--> Alerting
    
    Alerting --"6. Trigger<br/>notifications"--> PagerDuty
    
    Dashboard --"7. Engineer<br/>investigates"--> Services

The complete monitoring pipeline showing how metrics flow from instrumented services through collection, storage, aggregation, and finally to dashboards and alerts. Each step happens continuously, with metrics aggregated at multiple time windows (1m, 5m, 1h) to support both real-time alerting and long-term trend analysis.

Key Principles

The Four Golden Signals (Google SRE)

Google’s Site Reliability Engineering team distilled monitoring to four essential metrics: latency (how long requests take), traffic (how many requests you’re serving), errors (rate of failed requests), and saturation (how full your system is). These signals work together to diagnose problems—high latency with high saturation suggests resource exhaustion, high errors with normal latency suggests application bugs. At Stripe, every service dashboard starts with these four signals before showing service-specific metrics. The principle is that if you can only monitor four things, these four tell you the most about user experience and system health.

Use Percentiles, Not Averages

Averages lie. If 99% of your requests complete in 50ms but 1% take 10 seconds, your average latency looks great (149ms) while 1 in 100 users have a terrible experience. Percentiles tell the truth: p50 (median) shows typical experience, p95 shows what most users see, p99 catches outliers that still affect thousands of users at scale. Amazon famously optimizes for p99.9 latency because even rare slowdowns matter when you serve billions of requests. The principle is to measure and optimize for the experience of your unluckiest users, not your average user.

Alert on Symptoms, Not Causes

Bad monitoring alerts on “CPU > 80%” or “disk space < 10GB”—these are causes that might not affect users. Good monitoring alerts on “API latency p99 > 1s” or “error rate > 0.1%“—these are symptoms that definitely affect users. At Netflix, they alert when streaming start time exceeds their SLO, not when a particular microservice’s memory usage is high. The principle is that you care about user impact, and you should only wake up engineers when users are actually experiencing problems. This dramatically reduces alert fatigue and focuses incident response on what matters.

Monitor at Multiple Time Scales

A 10-second latency spike might be noise, but if it lasts 5 minutes it’s an incident. Effective monitoring uses multiple time windows: 1-minute windows for immediate alerts (“the system is on fire right now”), 5-minute windows for sustained problems (“this isn’t a blip, it’s a real issue”), and 1-hour windows for trends (“we’re slowly degrading”). At Uber, their alerting system requires anomalies to persist across multiple time windows before paging, which filters out transient spikes from legitimate problems. The principle is that different problems have different time signatures, and your monitoring should match the problem’s natural timescale.

Establish SLOs and Error Budgets

Service Level Objectives (SLOs) define what “good” looks like—for example, “99.9% of requests complete in under 200ms.” Your error budget is how much you can violate this SLO—if you promise 99.9% availability, you have 43 minutes of downtime per month. When you’re within budget, you can take risks and ship fast. When you’ve burned your budget, you focus on reliability. At Google, teams that exceed their error budget are blocked from launching new features until they improve reliability. The principle is that monitoring should drive decision-making, not just provide data—it should tell you whether to prioritize speed or stability.

The Four Golden Signals Monitoring Dashboard

graph TB
    subgraph Golden Signals Dashboard
        subgraph Latency
            L1["p50: 45ms<br/>p95: 120ms<br/>p99: 380ms"]
            L2["Target: p99 < 500ms"]
        end
        
        subgraph Traffic
            T1["Current: 15K req/s<br/>Peak: 18K req/s<br/>Baseline: 12K req/s"]
        end
        
        subgraph Errors
            E1["Error Rate: 0.08%<br/>5xx: 0.05%<br/>4xx: 0.03%"]
            E2["Target: < 0.1%"]
        end
        
        subgraph Saturation
            S1["CPU: 65%<br/>Memory: 72%<br/>Connections: 450/1000"]
            S2["Headroom: 35%"]
        end
    end
    
    Diagnosis["Diagnosis: System Healthy<br/><i>All signals within SLO</i>"]
    
    L1 --> Diagnosis
    T1 --> Diagnosis
    E1 --> Diagnosis
    S1 --> Diagnosis

Google’s Four Golden Signals provide a complete picture of system health. Latency shows user experience quality (using percentiles, not averages), Traffic indicates load levels, Errors measure reliability, and Saturation reveals how close you are to resource limits. Together, these four metrics enable rapid diagnosis—high latency with high saturation suggests resource exhaustion, while high errors with normal latency suggests application bugs.

Multi-Window Alert Evaluation Strategy

graph TB
    Metric["Incoming Metric:<br/>API p99 Latency"] --> W1["1-Minute Window<br/>Current: 650ms"]
    Metric --> W5["5-Minute Window<br/>Average: 580ms"]
    Metric --> W60["1-Hour Window<br/>Average: 420ms"]
    
    W1 --> E1{"Exceeds 500ms<br/>threshold?"}
    W5 --> E5{"Sustained above<br/>threshold?"}
    W60 --> E60{"Significant deviation<br/>from baseline?"}
    
    E1 --"Yes"--> Check1["Potential Issue<br/><i>Wait for confirmation</i>"]
    E1 --"No"--> OK1["Normal Spike"]
    
    E5 --"Yes"--> Check2["Confirmed Issue<br/><i>Not transient</i>"]
    E5 --"No"--> OK2["Transient Spike"]
    
    E60 --"Yes"--> Alert["🚨 FIRE ALERT<br/>Page On-Call Engineer"]
    E60 --"No"--> OK3["Within Normal Range"]
    
    Check1 & Check2 --> Alert
    
    Alert --> Action["Action Required:<br/>1. Check dashboard<br/>2. Review runbook<br/>3. Investigate root cause"]

Multi-window alerting prevents false alarms by requiring anomalies to persist across multiple time scales. A 10-second spike might trigger the 1-minute window but won’t fire an alert unless it’s sustained for 5 minutes and represents a significant deviation from the 1-hour baseline. This approach filters transient noise while ensuring real incidents are caught quickly.

Deep Dive

Types / Variants

Infrastructure Monitoring

Infrastructure monitoring tracks the health of your underlying compute, storage, and network resources. This includes CPU utilization, memory usage, disk I/O, network throughput, and system-level metrics like load average and context switches. Tools like Datadog, New Relic Infrastructure, and CloudWatch collect these metrics from every host, container, or VM in your fleet. When to use: Always—this is the foundation layer that everything else depends on. Pros: Catches resource exhaustion before it causes application failures, enables capacity planning, helps with cost optimization. Cons: Doesn’t tell you about user experience, can create false alarms (high CPU doesn’t always mean problems). Example: At Airbnb, infrastructure monitoring detected that their Elasticsearch cluster was hitting disk I/O limits during peak search traffic, prompting them to migrate to faster SSDs before users experienced slowdowns.

Application Performance Monitoring (APM)

APM tracks the behavior of your application code, including request traces, database query performance, external API calls, and code-level profiling. Tools like Datadog APM, New Relic APM, and Dynatrace instrument your application to capture detailed traces showing exactly where time is spent in each request. When to use: For any user-facing application where you need to optimize latency and debug performance issues. Pros: Pinpoints exact bottlenecks in your code, shows dependencies between services, enables optimization based on real usage patterns. Cons: Adds overhead to your application (typically 1-5%), requires code instrumentation, generates massive amounts of data. Example: Shopify uses APM to trace every checkout request across their 100+ microservices, identifying that a single slow database query in their inventory service was adding 200ms to checkout latency for 5% of requests.

Real User Monitoring (RUM)

RUM collects performance data directly from end-user browsers or mobile apps, measuring actual user experience including page load times, JavaScript errors, and API latency as seen by real users across different networks and devices. Tools like Google Analytics, Datadog RUM, and custom beacon implementations send performance data back to your monitoring system. When to use: For any customer-facing web or mobile application where user experience varies by geography, device, or network. Pros: Measures actual user experience not synthetic tests, catches issues specific to certain browsers/devices/regions, correlates performance with business metrics. Cons: Privacy concerns require careful implementation, can’t monitor backend-only services, data is noisy due to user environment variability. Example: Pinterest discovered through RUM that their mobile web experience was 3x slower in India than the US, not due to their servers but due to 3G network latency, leading them to build a lightweight version specifically for emerging markets.

Synthetic Monitoring

Synthetic monitoring runs automated tests against your system from various locations, simulating user behavior to proactively detect issues. This includes uptime checks (“is the site responding?”), transaction tests (“can users complete checkout?”), and API health checks. Tools like Pingdom, Catchpoint, and custom scripts run these tests every 1-5 minutes from multiple geographic regions. When to use: For critical user flows and external-facing APIs where you want to detect problems before users report them. Pros: Catches issues 24/7 even during low-traffic periods, provides consistent baseline for comparison, can test from user locations you don’t have real traffic from. Cons: Doesn’t reflect real user behavior, can miss issues that only appear under load, costs money to run continuously. Example: Stripe runs synthetic tests that attempt to create charges, refunds, and payouts every minute from 20 global locations, alerting within 60 seconds if any critical API endpoint fails.

Log Aggregation and Analysis

While technically distinct from metrics monitoring, log aggregation (ELK stack, Splunk, Loki) is essential for performance monitoring because logs provide the context behind metric anomalies. Structured logs with request IDs, latency, and status codes can be aggregated into metrics, while detailed error logs explain why errors occurred. When to use: Always, as a complement to metrics—metrics tell you what’s wrong, logs tell you why. Pros: Provides detailed context for debugging, can be converted into metrics retroactively, captures information that’s hard to represent as metrics. Cons: Expensive to store at scale, requires careful indexing for fast queries, can overwhelm engineers with too much data. Example: At Uber, when their metrics showed a spike in payment failures, they queried their log aggregation system to find that all failures had the same error message about a third-party payment processor timeout, immediately identifying the root cause.

Distributed Tracing

Distributed tracing follows individual requests as they flow through multiple services, creating a detailed timeline showing exactly how much time was spent in each service and where bottlenecks occurred. Tools like Jaeger, Zipkin, and AWS X-Ray instrument your services to propagate trace IDs and record spans. When to use: In microservices architectures where requests touch multiple services and you need to understand end-to-end latency. Pros: Shows exact request flow and timing across services, identifies which service is causing slowdowns, helps understand service dependencies. Cons: High overhead if you trace every request (typically sample 1-10%), complex to implement across heterogeneous services, generates enormous data volumes. Example: Netflix traces a sample of streaming requests to understand the full path from “user clicks play” to “video starts streaming,” discovering that 30% of their latency came from a single authentication check that could be cached.

Monitoring Types and Their Coverage

graph TB
    subgraph User Layer
        Browser["Web Browser<br/><i>Chrome, Safari</i>"]
        Mobile["Mobile App<br/><i>iOS, Android</i>"]
    end
    
    subgraph Application Layer
        LB["Load Balancer"]
        API["API Service"]
        Worker["Background Worker"]
    end
    
    subgraph Data Layer
        Cache[("Redis Cache")]
        DB[("PostgreSQL")]
        Queue["Message Queue"]
    end
    
    subgraph Infrastructure Layer
        K8s["Kubernetes Cluster"]
        EC2["EC2 Instances"]
        Network["Network/VPC"]
    end
    
    RUM["Real User Monitoring<br/><i>Actual user experience</i>"]
    Synthetic["Synthetic Monitoring<br/><i>Simulated transactions</i>"]
    APM["Application Performance<br/><i>Code-level traces</i>"]
    Infra["Infrastructure Monitoring<br/><i>Resource utilization</i>"]
    Tracing["Distributed Tracing<br/><i>Request flow</i>"]
    
    Browser & Mobile -."measures".-> RUM
    Synthetic -."tests".-> LB
    
    API & Worker -."instruments".-> APM
    API & Worker -."traces".-> Tracing
    
    K8s & EC2 & Network -."monitors".-> Infra
    Cache & DB & Queue -."monitors".-> Infra
    
    Tracing -."follows requests<br/>across services".-> API
    Tracing -.-> Worker
    Tracing -.-> Cache
    Tracing -.-> DB

Different monitoring types provide complementary views of system health. RUM measures actual user experience, Synthetic catches issues proactively, APM pinpoints code-level bottlenecks, Infrastructure tracks resource health, and Distributed Tracing follows requests across services. A comprehensive monitoring strategy combines all five types to cover the full stack from user browser to infrastructure.

Trade-offs

Push vs. Pull Metrics Collection

In push-based systems (StatsD, Datadog), your application actively sends metrics to a collector. In pull-based systems (Prometheus), a central server periodically scrapes metrics from your application’s HTTP endpoint. Push is better for short-lived jobs (Lambda functions), firewalled environments, and when you want services to control their own metric emission. Pull is better for service discovery (the scraper automatically finds new instances), reducing load on your services (they just expose an endpoint), and ensuring you don’t lose metrics if the collector is down (metrics are still available at the endpoint). Decision framework: Use push for serverless/ephemeral workloads and edge locations; use pull for long-lived services in Kubernetes where service discovery is built-in. Many companies use both—Uber uses push for their core services but pull for Kubernetes workloads.

High-Cardinality vs. Low-Cardinality Metrics

Cardinality refers to the number of unique combinations of metric labels. Low-cardinality metrics have few labels (service, region, status_code), while high-cardinality metrics have many (including user_id, request_id, or IP address). High-cardinality metrics provide detailed insights but explode your storage costs and query performance—adding user_id to your metrics increases cardinality from thousands to millions. Low-cardinality metrics are cheap to store and fast to query but lose detail. Decision framework: Use low-cardinality metrics for real-time dashboards and alerts (service-level aggregates), use high-cardinality metrics sparingly for debugging specific issues (sample 1% of requests), and use logs/traces for truly high-cardinality data like individual request IDs. At Stripe, they limit metric cardinality to 1000 unique combinations per metric name, forcing engineers to aggregate at the service level rather than per-customer.

Real-Time vs. Batch Processing

Real-time monitoring processes metrics as they arrive, enabling sub-minute alerting but requiring expensive streaming infrastructure. Batch processing aggregates metrics every 5-15 minutes, reducing costs but delaying alerts. Real-time is critical for user-facing incidents (you want to know within 60 seconds if your API is down), while batch is sufficient for capacity planning and trend analysis. Decision framework: Use real-time processing for golden signals and SLO monitoring where every minute of downtime matters; use batch processing for cost metrics, usage analytics, and long-term trends. At Netflix, they run real-time monitoring for streaming quality metrics but batch-process content recommendation metrics that don’t require immediate action.

Sampling vs. Full Collection

Full collection captures every metric, log, and trace, providing complete visibility but generating massive data volumes. Sampling captures a representative subset (1-10%), dramatically reducing costs but risking that you miss rare but important events. For metrics, you typically collect everything because they’re already aggregated. For traces, you sample heavily because each trace is large. For logs, you might collect all errors but sample info-level logs. Decision framework: Always collect all metrics (they’re cheap), sample traces at 1-10% for normal traffic but increase to 100% during incidents, collect all error logs but sample info/debug logs at 1%. At Google, they use adaptive sampling that automatically increases sampling rates when anomalies are detected, giving them detailed data exactly when they need it.

Alerting Sensitivity vs. Specificity

Sensitive alerts catch every problem but create false alarms (alert fatigue), while specific alerts only fire for real issues but might miss some problems. Too sensitive and engineers ignore alerts; too specific and you miss incidents. The tradeoff is between mean time to detect (MTTD) and alert noise. Decision framework: For critical user-facing services, err toward sensitivity—better to investigate a false alarm than miss an outage. For internal tools, err toward specificity—engineers can tolerate some degradation without being paged. Use multi-window alerts (must be bad for 5 minutes) and multi-burn-rate alerts (different thresholds for different severities) to balance both. At Uber, they tune alerts to fire when they’ve burned 2% of their monthly error budget in one hour, which catches real problems while filtering transient spikes.

Alert Sensitivity vs. Specificity Tradeoff

graph LR
    subgraph High Sensitivity Low Specificity
        HS["Alert Threshold:<br/>p99 > 300ms for 1 min"]
        HS_Pro["✓ Catches all incidents<br/>✓ Fast detection (MTTD=1m)"]
        HS_Con["✗ Many false alarms<br/>✗ Alert fatigue<br/>✗ Engineers ignore alerts"]
    end
    
    subgraph Balanced Approach
        Bal["Alert Threshold:<br/>p99 > 500ms for 5 min<br/>+ 2% error budget burn"]
        Bal_Pro["✓ Catches real incidents<br/>✓ Few false alarms<br/>✓ Engineers trust alerts"]
        Bal_Con["⚠ Slightly slower detection<br/>⚠ Might miss brief incidents"]
    end
    
    subgraph Low Sensitivity High Specificity
        LS["Alert Threshold:<br/>p99 > 1000ms for 15 min"]
        LS_Pro["✓ Zero false alarms<br/>✓ Only severe incidents"]
        LS_Con["✗ Misses moderate issues<br/>✗ Slow detection (MTTD=15m)<br/>✗ Users complain first"]
    end
    
    HS --> HS_Pro
    HS --> HS_Con
    Bal --> Bal_Pro
    Bal --> Bal_Con
    LS --> LS_Pro
    LS --> LS_Con
    
    Decision["Decision Framework:<br/>Critical user-facing → Balanced<br/>Internal tools → High Specificity<br/>SLO violations → Multi-burn-rate"]
    
    Bal -."recommended for<br/>most services".-> Decision

The alert sensitivity-specificity tradeoff balances fast incident detection against alert fatigue. High sensitivity catches everything but creates noise, low specificity only fires for severe issues but misses problems. The balanced approach uses multi-window evaluation and error budget burn rates to achieve both fast detection and high precision, which is why Google’s SRE practices recommend it for production services.