Latency vs Throughput: System Design Trade-offs

After this topic, you will be able to:

Calculate latency and throughput metrics for distributed systems
Compare trade-offs between optimizing for latency versus throughput
Apply latency-throughput analysis to design decisions in interview scenarios

TL;DR

Latency measures how long a single request takes (time per operation), while throughput measures how many requests a system can handle per unit time (operations per time). They’re inversely related: optimizing for one often degrades the other. In interviews, demonstrate understanding that real systems need balanced optimization—maximize throughput while keeping latency within acceptable SLA bounds.

Cheat Sheet: Latency = response time (ms). Throughput = requests/sec. Batching increases throughput but raises latency. Parallelism can improve both.

The Analogy

Think of a highway toll booth. Latency is how long it takes one car to get through the booth—from entering the lane to exiting (maybe 30 seconds). Throughput is how many cars the booth processes per hour (maybe 120 cars/hour). You could reduce latency by having the attendant work faster, getting each car through in 15 seconds. Or you could increase throughput by making cars wait in line while you batch-process payments for multiple cars at once—but now each individual car waits longer. The best solution? Add more toll booths (horizontal scaling) so you get both low latency AND high throughput.

Why This Matters in Interviews

Interviewers use latency vs throughput to assess whether you understand the fundamental tension in system design. Junior candidates often say “make it faster” without specifying which metric matters. Strong candidates ask clarifying questions: “Are we optimizing for p99 latency or peak throughput?” They recognize that Netflix prioritizes throughput (serving millions of streams) while trading firms prioritize latency (microsecond order execution). This topic appears in 80% of system design interviews because every architectural decision—caching, batching, replication, load balancing—involves latency-throughput trade-offs. Demonstrating nuanced understanding here signals you’ve built real production systems.

Core Concept

Latency and throughput are the two fundamental performance metrics in distributed systems, yet they measure completely different things. Latency measures responsiveness—how quickly a system responds to a single request, typically measured in milliseconds or seconds. Throughput measures capacity—how many requests a system can process per unit time, measured in requests per second (RPS), queries per second (QPS), or transactions per second (TPS). The critical insight: these metrics are often inversely related. Techniques that improve throughput (like batching database writes) frequently increase latency for individual requests. Understanding this trade-off is essential because different systems have different priorities. A stock trading platform might accept lower throughput to achieve sub-millisecond latency, while a batch analytics system prioritizes throughput over individual query speed.

Latency vs Throughput: Fundamental Difference

graph LR
    subgraph Latency Measurement
        A["Request Arrives"] -->|"Time: 0ms"| B["Processing"]
        B -->|"Time: 50ms"| C["Response Sent"]
        C -.->|"Latency = 50ms<br/>(time per operation)"| D["Single Request<br/>Lifecycle"]
    end
    
    subgraph Throughput Measurement
        E["Request 1"] -->|"0ms"| H["System"]
        F["Request 2"] -->|"10ms"| H
        G["Request 3"] -->|"20ms"| H
        I["Request 4"] -->|"30ms"| H
        J["Request 5"] -->|"40ms"| H
        H -.->|"Throughput = 100 RPS<br/>(operations per time)"| K["Multiple Requests<br/>Per Second"]
    end

Latency measures how long a single request takes from start to finish (vertical time axis), while throughput measures how many requests the system processes per unit time (horizontal request volume). These are fundamentally different metrics measuring different aspects of system performance.

Batching Trade-off: Throughput vs Latency

graph TB
    subgraph Without Batching - Low Throughput High Latency per Request
        A1["Request 1"] -->|"Network call: 5ms"| S1["Server"]
        A2["Request 2"] -->|"Network call: 5ms"| S1
        A3["Request 3"] -->|"Network call: 5ms"| S1
        S1 -.->|"100 requests = 100 network calls<br/>Throughput: 100 RPS<br/>Latency: 5ms per request"| R1["Result"]
    end
    
    subgraph With Batching - High Throughput Higher Latency
        B1["Request 1<br/>(wait 10ms)"] --> Batch["Batch Buffer<br/>(100 requests)"]
        B2["Request 2<br/>(wait 9ms)"] --> Batch
        B3["Request 100<br/>(wait 0ms)"] --> Batch
        Batch -->|"Single network call: 5ms<br/>Process 100 requests"| S2["Server"]
        S2 -.->|"100 requests = 1 network call<br/>Throughput: 10,000 RPS (100x improvement)<br/>Latency: 0-15ms (first request waits longest)"| R2["Result"]
    end

Batching amortizes fixed costs (like network round-trips) across multiple operations, dramatically improving throughput. However, it introduces waiting time—the first request in a batch must wait for the batch to fill. Kafka producers use this pattern: batch 100 messages every 10ms to achieve 10x throughput improvement while adding up to 10ms latency.

Little’s Law: Concurrency, Latency, and Throughput Relationship

graph TB
    subgraph Little's Law Formula
        Formula["<b>Throughput = Concurrency / Latency</b><br/><br/>Example: 1,000 RPS = 100 connections / 0.1s"] 
    end
    
    subgraph Scenario 1: Reduce Latency
        S1A["Original:<br/>Latency = 100ms<br/>Concurrency = 100<br/>Throughput = 1,000 RPS"]
        S1A -->|"Add caching"| S1B["Optimized:<br/>Latency = 50ms<br/>Concurrency = 100<br/>Throughput = 2,000 RPS"]
        S1B -.->|"Same resources,<br/>2x throughput"| S1R["✓ Win-Win"]
    end
    
    subgraph Scenario 2: Increase Concurrency
        S2A["Original:<br/>Latency = 100ms<br/>Concurrency = 100<br/>Throughput = 1,000 RPS"]
        S2A -->|"Add connection pool"| S2B["Scaled:<br/>Latency = 100ms<br/>Concurrency = 500<br/>Throughput = 5,000 RPS"]
        S2B -.->|"More resources,<br/>5x throughput"| S2R["✓ Horizontal Scale"]
    end
    
    subgraph Scenario 3: Batching Trade-off
        S3A["Original:<br/>Latency = 10ms<br/>Concurrency = 100<br/>Throughput = 10,000 RPS"]
        S3A -->|"Add 50ms batching"| S3B["Batched:<br/>Latency = 60ms<br/>Concurrency = 100<br/>Throughput = 1,667 RPS"]
        S3B -.->|"Higher latency,<br/>lower throughput per node"| S3R["⚠ But 10x efficiency<br/>per batch operation"]
    end

Little’s Law (Throughput = Concurrency / Latency) reveals three optimization strategies: (1) Reduce latency through caching or optimization—improves throughput without adding resources. (2) Increase concurrency through horizontal scaling—improves throughput but requires more infrastructure. (3) Batching—reduces per-operation cost but increases individual request latency. The formula helps capacity planning: to support 10,000 RPS with 100ms latency, you need 1,000 concurrent connections.

Queueing Theory: Latency Increases Non-linearly Near Capacity

graph TB
    subgraph System Utilization vs Latency
        direction TB
        U1["30% Utilization<br/>Latency: 10ms<br/>No queueing"] -->|"Increase load"| U2["50% Utilization<br/>Latency: 12ms<br/>Minimal queueing"]
        U2 -->|"Increase load"| U3["70% Utilization<br/>Latency: 20ms<br/>Moderate queueing"]
        U3 -->|"Increase load"| U4["85% Utilization<br/>Latency: 50ms<br/>Heavy queueing"]
        U4 -->|"Increase load"| U5["95% Utilization<br/>Latency: 200ms+<br/>System near collapse"]
        
        U3 -.->|"Knee of the curve<br/>(safe operating point)"| Safe["Netflix targets<br/>60-70% utilization<br/>for latency-sensitive<br/>services"]
        U5 -.->|"Danger zone"| Danger["Exponential latency<br/>increase due to<br/>queueing delays"]
    end
    
    subgraph Why This Happens
        Q1["Low utilization:<br/>Requests arrive<br/>to idle servers"] -->|"Process immediately"| Q2["Latency = Processing time only"]
        Q3["High utilization:<br/>Requests arrive<br/>to busy servers"] -->|"Wait in queue"| Q4["Latency = Queue wait + Processing time"]
        Q4 -.->|"At 90%+ utilization,<br/>queue wait >> processing time"| Q5["Tail latency<br/>explodes"]
    end

Systems behave non-linearly near capacity. At 70% utilization, latency might be acceptable (20ms), but at 90% utilization, queueing delays push latency to 100ms+. This is why load testing at production scale is critical—systems tested at 30% load behave completely differently at 85% load. Keep production utilization below the ‘knee of the curve’ (typically 60-70%) to maintain predictable latency.

How It Works

Latency encompasses the entire request lifecycle: network transmission time, queueing delays, processing time, and response transmission. For a database query, this includes network round-trip (typically 1-5ms within a datacenter), queue wait time if the database is busy (0-100ms+), query execution (1-1000ms depending on complexity), and result serialization. Throughput depends on how many requests can be processed concurrently and how efficiently resources are utilized. A single-threaded server might handle 100 RPS with 10ms latency, but adding 10 threads could increase throughput to 1000 RPS—though latency might rise to 15ms due to context switching and resource contention. The relationship follows Little’s Law: Throughput = Concurrency / Latency. If you have 100 concurrent requests and average latency is 100ms (0.1s), your throughput is 100/0.1 = 1000 RPS. This formula reveals why reducing latency OR increasing concurrency both improve throughput.

Request Latency Components Breakdown

graph LR
    Client["Client"] -->|"1. Network transmission<br/>1-5ms (datacenter)<br/>50-200ms (cross-region)"| LB["Load Balancer"]
    LB -->|"2. Queue wait time<br/>0-100ms+ (under load)"| Queue["Request Queue"]
    Queue -->|"3. Processing time<br/>10-1000ms"| Server["Application Server"]
    Server -->|"4. Database query<br/>1-100ms"| DB[("Database")]
    DB -->|"5. Result serialization<br/>1-10ms"| Server
    Server -->|"6. Response transmission<br/>1-5ms"| Client
    
    Server -.->|"Total Latency = Sum of all components<br/>Example: 5ms + 20ms + 50ms + 30ms + 5ms + 5ms = 115ms"| Total["End-to-End<br/>Latency"]

Request latency is the sum of multiple components: network transmission, queueing delays, processing time, and database access. Under load, queueing latency often dominates. Each component can be optimized independently—caching reduces database latency, connection pooling reduces queueing, CDNs reduce network latency.

Key Principles

principle: Measure What Matters for Your Use Case explanation: Different systems optimize for different metrics. User-facing APIs prioritize p95 or p99 latency (the experience of the slowest 5% or 1% of users) because a few slow requests create bad user experiences. Batch processing systems prioritize throughput because total job completion time matters more than individual record processing time. example: Google Search optimizes for p99 latency under 200ms because users notice delays. Google’s MapReduce batch jobs optimize for throughput (petabytes processed per hour) because individual record latency is irrelevant.

principle: Batching Trades Latency for Throughput explanation: Processing requests in batches amortizes fixed costs (like network round-trips or disk seeks) across multiple operations, dramatically improving throughput. But batching introduces delay—requests must wait for a batch to fill before processing begins. The batch size determines the trade-off point. example: Kafka producers batch messages before sending to brokers. A batch size of 100 messages with 10ms wait time increases throughput 10x (one network call instead of 100) but adds up to 10ms latency for the first message in each batch. Uber’s payment system batches database writes every 100ms, processing 10,000 transactions per batch instead of 100 individual writes per second.

principle: Parallelism Can Improve Both Metrics explanation: Unlike batching, parallelism (horizontal scaling, connection pooling, async processing) can improve both latency and throughput simultaneously by utilizing more resources. However, there are diminishing returns due to coordination overhead and shared resource contention. example: Netflix’s API gateway uses connection pooling with 100 concurrent connections per backend service. This allows 100 requests to be in-flight simultaneously, improving throughput from 10 RPS (serial) to 1000 RPS (parallel) while keeping individual request latency at 100ms. Beyond 100 connections, latency increases due to backend database contention.

Deep Dive

Types / Variants

Latency has multiple definitions depending on context. Network latency is the time for a packet to travel from source to destination (typically 1-5ms within a datacenter, 50-200ms cross-continent). Processing latency is the time spent executing business logic. Queueing latency is wait time in buffers when the system is overloaded—often the dominant factor during peak traffic. Throughput also varies by scope: per-node throughput measures a single server’s capacity, system throughput measures the entire distributed system, and effective throughput accounts for retries and failed requests. For example, a database might claim 10,000 QPS throughput, but if 20% of queries timeout and retry, effective throughput is only 8,000 QPS. Percentile latencies (p50, p95, p99) are more meaningful than averages because they reveal tail behavior. A system with 10ms average latency but 500ms p99 latency has serious problems affecting 1% of users.

Trade-offs

dimension: Request Processing Strategy option_a: Synchronous (low latency, lower throughput) option_b: Asynchronous with batching (higher latency, high throughput) decision_framework: Use synchronous for user-facing APIs where responsiveness matters (payment confirmation, search results). Use async batching for background jobs where total completion time matters more than individual item latency (email sending, analytics aggregation). Uber uses sync for ride requests (users wait) but async batching for driver location updates (10-second batches are acceptable).

dimension: Resource Allocation option_a: Optimize for low latency (more resources per request, lower utilization) option_b: Optimize for high throughput (pack more requests, higher utilization) decision_framework: Trading systems run at 30-40% CPU utilization to keep latency low and predictable—headroom prevents queueing delays. Batch processing systems run at 90%+ utilization to maximize throughput. The decision depends on cost tolerance and latency SLAs. If your p99 latency SLA is 100ms, you need enough headroom that the 99th percentile request doesn’t queue.

dimension: Caching Strategy option_a: Cache-aside (slightly higher latency on miss, simpler) option_b: Write-through caching (consistent low latency, more complex) decision_framework: Cache-aside accepts occasional cache misses with 100ms+ latency while most requests hit cache at 1ms. Write-through ensures consistent latency but requires synchronous cache updates, reducing write throughput. Reddit uses cache-aside for comment threads (occasional misses acceptable) but write-through for vote counts (consistent read latency required).

Common Pitfalls

pitfall: Optimizing Average Latency Instead of Tail Latency why_it_happens: Averages hide outliers. A system with 10ms average latency might have 1% of requests taking 5 seconds due to garbage collection pauses or network retries. These tail latencies destroy user experience but don’t show up in averages. how_to_avoid: Always measure and optimize p95, p99, and p99.9 latencies. Set SLAs on percentiles, not averages. Google’s SRE teams use p99 latency as the primary metric for user-facing services. Implement timeout budgets and circuit breakers to prevent tail latencies from cascading.

pitfall: Ignoring Queueing Theory Under Load why_it_happens: Systems behave non-linearly near capacity. At 70% utilization, latency might be 10ms. At 90% utilization, queueing delays can push latency to 100ms+. Developers test at low load and are surprised when production latency spikes. how_to_avoid: Use load testing to find the knee of the latency curve—the utilization point where latency starts increasing exponentially. Keep production utilization below this point. Netflix targets 60-70% utilization for latency-sensitive services, leaving headroom for traffic spikes.

pitfall: Confusing Bandwidth with Latency why_it_happens: Bandwidth (data transfer rate) and latency (round-trip time) are independent. You can have high bandwidth but high latency (satellite internet: 50 Mbps but 600ms latency) or low bandwidth but low latency (dial-up: 56 Kbps but 30ms latency). how_to_avoid: Recognize that adding bandwidth doesn’t reduce latency—it only helps with large payloads. For small API requests (< 1KB), network latency dominates. Use CDNs and edge computing to reduce physical distance, not just bandwidth. Cloudflare’s edge network reduces latency by serving content from nearby datacenters, not by increasing bandwidth.

Math & Calculations

Formula

Little’s Law: Throughput = Concurrency / Latency

Where:

Throughput = requests per second (RPS)
Concurrency = number of requests in-flight simultaneously
Latency = average time per request (seconds)

Rearranged: Latency = Concurrency / Throughput or Concurrency = Throughput × Latency

Variables

Throughput

Measured in requests/second (RPS), queries/second (QPS), or transactions/second (TPS). For a web server, this might be 1000 RPS.

Latency

Measured in seconds or milliseconds. For a database query, this might be 0.05 seconds (50ms).

Concurrency

Number of simultaneous in-flight requests. For a connection pool, this might be 50 connections.

Worked Example

Scenario: You’re designing an API gateway for an e-commerce checkout service. Each request takes 100ms to process (latency). You need to support 10,000 RPS (throughput). How many concurrent connections do you need?

Calculation:

Concurrency = Throughput × Latency
Concurrency = 10,000 RPS × 0.1 seconds
Concurrency = 1,000 concurrent requests

Interpretation: You need a connection pool with at least 1,000 connections to your backend services. If you only provision 500 connections, requests will queue, increasing latency. At 1,000 connections with 100ms latency, you achieve exactly 10,000 RPS.

Trade-off Analysis: If you reduce latency to 50ms through caching, you only need 500 concurrent connections for the same throughput (10,000 RPS × 0.05s = 500). This is why latency optimization directly improves resource efficiency. Alternatively, if latency increases to 200ms due to a slow database, you’d need 2,000 connections—doubling infrastructure costs.

Real-World Examples

company: Google Search system: Query serving infrastructure usage_detail: Google optimizes for p99 latency under 200ms for search queries while handling 100,000+ QPS. They use aggressive caching (90%+ hit rate), speculative execution (send queries to multiple datacenters, use fastest response), and tail-tolerant techniques (cancel slow requests after 150ms, use cached results). This prioritizes latency over throughput efficiency—they’re willing to waste compute resources on redundant requests to guarantee fast responses. The trade-off: higher infrastructure costs but better user experience.

company: Uber system: Dispatch system (matching riders with drivers) usage_detail: Uber’s dispatch system prioritizes throughput during peak hours (processing 10,000+ match requests per second) while maintaining acceptable latency (under 5 seconds for a match). They batch geospatial queries—instead of checking each driver individually, they query all drivers within a radius simultaneously. This increases throughput 10x but adds 1-2 seconds of latency. The trade-off is acceptable because users tolerate a few seconds of wait time, and batching allows the system to handle surge pricing events without collapsing.

company: Netflix system: Video encoding pipeline usage_detail: Netflix’s encoding pipeline optimizes purely for throughput—processing petabytes of video per day. Individual video encoding jobs might take hours (high latency), but the system processes thousands of videos concurrently (high throughput). They use massive batch processing on AWS EC2 spot instances, accepting variable latency (2-12 hours depending on spot instance availability) to maximize cost efficiency. This is the opposite of their streaming API, which optimizes for low latency (sub-second startup time). The lesson: different subsystems in the same company optimize for different metrics.

Interview Expectations

Mid-Level

Define latency and throughput correctly with units (ms vs RPS). Recognize that batching increases throughput but raises latency. Calculate basic throughput given latency and concurrency using Little’s Law. Identify that caching reduces latency. Avoid saying ‘make it faster’ without specifying which metric. Demonstrate awareness that trade-offs exist but may not articulate specific decision frameworks.

Senior

Explain the inverse relationship between latency and throughput with concrete examples (batching, connection pooling). Use percentile latencies (p95, p99) instead of averages. Apply Little’s Law to capacity planning scenarios. Discuss queueing theory—explain why latency increases non-linearly near capacity. Propose specific optimizations: ‘We could batch writes every 50ms to increase throughput from 1,000 to 10,000 TPS, accepting 50ms additional latency.’ Reference real systems (Kafka batching, database connection pools).

Staff+

Articulate nuanced trade-offs based on business requirements: ‘For this payment API, we need p99 latency under 100ms per regulatory requirements, so we’ll optimize for latency even if it means lower throughput and higher costs.’ Discuss tail latency mitigation strategies (hedged requests, deadline propagation, circuit breakers). Explain how to measure and monitor both metrics in production (histograms, not averages). Propose architectural patterns that improve both metrics simultaneously (horizontal scaling, async processing). Demonstrate experience: ‘At my last company, we reduced p99 latency from 500ms to 50ms by…’ Recognize when to optimize for one metric vs the other based on system type (user-facing vs batch processing).

Common Interview Questions

How would you improve throughput for a database write-heavy workload? (Answer: Batching, write-behind caching, sharding)

Your API’s p99 latency is 2 seconds but average is 50ms. What’s wrong? (Answer: Tail latency problem—investigate GC pauses, slow queries, network retries)

You need to support 50,000 RPS with 20ms latency. How many concurrent connections? (Answer: 50,000 × 0.02 = 1,000 connections)

When would you optimize for throughput over latency? (Answer: Batch processing, analytics, background jobs where individual item latency doesn’t matter)

Red Flags to Avoid

Claiming you can optimize both latency and throughput without trade-offs (shows lack of real-world experience)

Using average latency instead of percentiles (misses tail latency problems)

Not asking clarifying questions about which metric matters more for the use case

Suggesting ‘add more servers’ without explaining how that affects latency vs throughput

Confusing bandwidth with latency or using terms interchangeably

Key Takeaways

Latency measures time per operation (ms), throughput measures operations per time (RPS). They’re inversely related—optimizing one often degrades the other.

Use Little’s Law (Throughput = Concurrency / Latency) for capacity planning. To support 10,000 RPS with 100ms latency, you need 1,000 concurrent connections.

Batching increases throughput by amortizing fixed costs but adds latency. Parallelism (horizontal scaling) can improve both metrics simultaneously.

Always measure percentile latencies (p95, p99), not averages. Tail latencies destroy user experience and indicate systemic problems.

Different systems optimize for different metrics: user-facing APIs prioritize latency, batch processing prioritizes throughput. Ask clarifying questions in interviews to determine which matters more.

Prerequisites

Performance vs Scalability - Understand broader performance concepts before diving into specific metrics

Availability in Numbers - SLA definitions that constrain latency requirements

Next Steps

Caching Strategies - Primary technique for reducing latency

Load Balancing - Distributing load to improve throughput

Database Optimization - Specific latency and throughput optimizations for data layer

CAP Theorem - Trade-offs between consistency and latency

Horizontal vs Vertical Scaling - Scaling strategies that affect both metrics