Throttling Pattern: Protecting Services from Overload

TL;DR

Throttling controls resource consumption by limiting the rate at which operations can be performed, protecting systems from overload while maintaining availability for legitimate users. Unlike rate limiting which focuses on request counts, throttling encompasses broader resource control including CPU, memory, bandwidth, and concurrent connections. Cheat Sheet: Token bucket for burst handling, leaky bucket for smooth rates, fixed/sliding windows for simplicity, adaptive throttling for dynamic loads. Always throttle at multiple layers (API gateway, service, database) and return 429 with Retry-After headers.

The Analogy

Think of throttling like a highway toll booth system during rush hour. The toll plaza has a fixed number of lanes (capacity), and cars must wait in line if all lanes are full. During normal traffic, cars flow through quickly. During peak hours, the system doesn’t crash—it just makes cars wait their turn. Some toll systems even have FastPass lanes (priority queues) for certain vehicles, and they can open additional lanes dynamically when traffic surges. The key insight: the highway doesn’t reject cars entirely; it controls the flow rate to prevent gridlock while maximizing throughput within safe operating limits.

Why This Matters in Interviews

Throttling appears in virtually every system design interview involving high-scale APIs, microservices, or user-facing applications. Interviewers use it to assess your understanding of availability patterns, resource protection, and operational maturity. Strong candidates distinguish throttling from rate limiting, explain multiple implementation strategies with tradeoffs, and discuss distributed coordination challenges. The topic often emerges when discussing: API design (“How do you prevent abuse?”), availability guarantees (“What happens during traffic spikes?”), or cost optimization (“How do you control cloud spending?”). Interviewers look for practical experience—mentioning specific algorithms, discussing monitoring strategies, and understanding the business impact of throttling decisions.

Core Concept

Throttling is a reliability pattern that controls the consumption of resources by limiting the rate, volume, or concurrency of operations within a system. While often conflated with rate limiting, throttling is broader: it encompasses controlling any resource consumption including API requests, database queries, CPU cycles, memory allocation, network bandwidth, and concurrent connections. The fundamental goal is maintaining system stability and meeting service level agreements (SLAs) even when demand exceeds capacity.

The pattern operates on a simple principle: when resource demand approaches or exceeds available capacity, the system deliberately slows down or rejects requests rather than allowing cascading failures. This controlled degradation is preferable to uncontrolled system collapse. A properly throttled system continues serving requests within its capacity while gracefully handling excess load through queuing, rejection, or delayed processing. The key distinction from other availability patterns is that throttling is proactive—it prevents problems before they occur rather than reacting to failures.

Throttling decisions happen at multiple layers of a system architecture. At the API gateway, you might throttle requests per user or API key. At the service layer, you throttle based on resource availability (CPU, memory, thread pools). At the database layer, you throttle query complexity or connection counts. Each layer protects different resources, and effective systems implement throttling at all critical points. The challenge lies in coordinating these layers to provide consistent behavior while avoiding over-throttling that wastes available capacity.

How It Works

Step 1: Define Resource Limits and Policies The system establishes throttling policies based on capacity planning and SLA requirements. For example, an API might support 10,000 requests per second (RPS) total capacity, allocated as 100 RPS per user with burst allowance of 200 requests. These limits derive from load testing, cost constraints, and business requirements. Policies specify what to throttle (requests, bandwidth, connections), the measurement window (per second, minute, hour), and the scope (per user, per IP, per API key, global).

Step 2: Track Resource Consumption As requests arrive, the throttling mechanism tracks consumption against defined limits. This tracking happens in real-time using counters, tokens, or sliding windows. For distributed systems, this often involves a shared state store like Redis where counters are incremented atomically. The tracking granularity matters: per-second tracking catches short bursts but requires more frequent counter resets, while per-minute tracking smooths out spikes but may allow brief overload periods.

Step 3: Evaluate Against Thresholds Each incoming request triggers an evaluation: has this client/resource exceeded its allocation? The evaluation considers multiple dimensions—the immediate rate, accumulated usage over the window, and potentially priority levels. For example, a premium user might have higher limits than free tier users. The evaluation must be fast (sub-millisecond) to avoid becoming a bottleneck itself.

Step 4: Apply Throttling Action When limits are exceeded, the system takes action. The most common response is rejection with HTTP 429 (Too Many Requests) including Retry-After headers indicating when the client can retry. Alternative actions include queuing the request for delayed processing, reducing response quality (returning cached or partial data), or applying backpressure to upstream systems. The choice depends on the use case: user-facing APIs typically reject immediately, while background job systems might queue.

Step 5: Monitor and Adapt Effective throttling systems continuously monitor throttling rates, resource utilization, and user impact. High throttling rates might indicate insufficient capacity or misconfigured limits. Adaptive throttling systems adjust limits dynamically based on observed system health—tightening limits when CPU or memory pressure increases, relaxing them when resources are underutilized. This feedback loop prevents both over-provisioning (wasted cost) and under-provisioning (poor user experience).

Token Bucket Throttling Flow

graph LR
    Client["Client"]
    Gateway["API Gateway<br/><i>Token Bucket</i>"]
    Bucket[("Token Bucket<br/>Capacity: 5000<br/>Refill: 1000/sec")]
    Service["Backend Service"]
    
    Client --"1. Send Request"--> Gateway
    Gateway --"2. Check Tokens"--> Bucket
    Bucket --"3a. Tokens Available<br/>(Consume 1 token)"--> Gateway
    Gateway --"4a. Forward Request"--> Service
    Service --"5a. Response 200"--> Client
    
    Bucket -."3b. No Tokens<br/>(Bucket Empty)".-> Gateway
    Gateway -."4b. Reject<br/>HTTP 429".-> Client
    
    Bucket --"Background: Refill<br/>+1000 tokens/sec"--> Bucket

Token bucket algorithm allows burst traffic by maintaining a bucket of tokens that refill at a constant rate. Requests consume tokens when available; when the bucket is empty, requests are throttled with HTTP 429 until tokens refill.

Key Principles

Principle 1: Fail Fast and Explicitly When throttling kicks in, reject requests immediately with clear error messages rather than letting them time out or fail mysteriously. Return HTTP 429 with Retry-After headers specifying when to retry. This explicit communication allows clients to implement proper backoff strategies and helps developers debug issues quickly. Netflix’s API gateway returns detailed throttling metadata including the limit, current usage, and reset time, enabling clients to make intelligent retry decisions.

Principle 2: Throttle at Multiple Layers Implement throttling at every critical resource boundary: API gateway, service layer, database, and external API calls. Each layer protects different resources and provides defense in depth. For example, Stripe throttles at the API gateway (requests per second), at the service layer (concurrent processing), and at the database layer (query complexity). This layered approach prevents any single resource from becoming a bottleneck and provides granular control over different resource types.

Principle 3: Differentiate by Priority Not all requests are equal—implement priority-based throttling where critical operations get preferential treatment. Uber prioritizes active ride requests over historical trip queries. During high load, background analytics jobs are throttled more aggressively than real-time ride matching. This requires classifying requests by business impact and implementing separate throttling buckets or queues for different priority levels.

Principle 4: Design for Distributed Coordination In distributed systems, throttling state must be shared across instances to enforce global limits accurately. Using local (per-instance) throttling alone leads to N×limit problem where N instances each allow the full limit. Redis or similar distributed counters provide shared state, but introduce latency and single points of failure. The tradeoff: accept some inaccuracy with local throttling plus periodic synchronization, or pay the cost of distributed coordination for strict enforcement. Twitter uses a hybrid approach: strict limits for critical operations (posting tweets) with distributed coordination, looser limits for read operations with local-only throttling.

Principle 5: Make Throttling Observable Throttling is a critical operational signal—instrument it heavily. Track throttling rates by endpoint, user, and reason. Alert when throttling exceeds thresholds, indicating either attacks, misconfigured clients, or insufficient capacity. Expose throttling metrics to users through dashboards or API responses. AWS CloudWatch provides detailed throttling metrics for all services, allowing customers to understand their usage patterns and optimize accordingly. Without observability, throttling becomes a black box that frustrates users and obscures capacity planning needs.

Multi-Layer Throttling Architecture

graph TB
    Client["Client Application"]
    
    subgraph Layer 1: Edge
        CDN["CDN/WAF<br/><i>DDoS Protection</i>"]
        LB["Load Balancer<br/><i>Connection Limits</i>"]
    end
    
    subgraph Layer 2: API Gateway
        Gateway["API Gateway<br/><i>Per-User Rate Limits</i><br/>100 RPS/user"]
    end
    
    subgraph Layer 3: Service Layer
        Service["Application Service<br/><i>Concurrent Request Limits</i><br/>Max 500 concurrent"]
        Queue["Priority Queue<br/><i>Critical vs Background</i>"]
    end
    
    subgraph Layer 4: Data Layer
        Cache["Redis Cache<br/><i>Connection Pool Limits</i>"]
        DB[("Database<br/><i>Query Complexity Limits</i><br/>Max 10 joins")]
    end
    
    Client --"1. HTTP Request"--> CDN
    CDN --"2. Pass DDoS Check"--> LB
    LB --"3. Route"--> Gateway
    Gateway --"4. Check User Quota"--> Service
    Service --"5. Enqueue by Priority"--> Queue
    Queue --"6. Process Request"--> Cache
    Cache --"7. Cache Miss"--> DB
    DB --"8. Query Result"--> Service
    Service --"9. Response"--> Client
    
    Gateway -."Throttle: 429<br/>User exceeded 100 RPS".-> Client
    Service -."Throttle: 503<br/>Service at capacity".-> Client
    DB -."Throttle: Query rejected<br/>Too complex".-> Service

Effective throttling implements controls at multiple layers—edge (DDoS), gateway (per-user limits), service (concurrency), and data (query complexity). Each layer protects different resources and provides defense in depth against overload.

Deep Dive

Types / Variants

Token Bucket Algorithm The token bucket maintains a bucket of tokens that refill at a constant rate. Each request consumes one or more tokens. If tokens are available, the request proceeds; otherwise, it’s throttled. The bucket has a maximum capacity (burst size), allowing short bursts above the steady-state rate. For example, AWS API Gateway uses token bucket with a burst of 5,000 requests and refill rate of 10,000 requests per second. This means you can send 5,000 requests instantly, then sustain 10,000 RPS indefinitely. When to use: When you need to allow bursts while maintaining average rate limits. Pros: Handles bursty traffic naturally, simple to implement, memory efficient. Cons: Burst allowance can be exploited, requires careful tuning of bucket size vs refill rate. Example: Shopify’s API uses token bucket to allow merchants to burst during flash sales while preventing sustained abuse.

Leaky Bucket Algorithm The leaky bucket processes requests at a fixed rate regardless of input rate, like water dripping from a bucket with a hole. Incoming requests fill the bucket; if the bucket overflows, requests are rejected. Unlike token bucket, this enforces strict output rate smoothing. For example, a leaky bucket with 100 RPS capacity processes exactly 100 requests per second even if 1,000 arrive simultaneously. When to use: When you need predictable, smooth output rates to protect downstream systems. Pros: Perfectly smooth output rate, protects downstream from bursts, simple conceptual model. Cons: Doesn’t allow bursts, may waste capacity during low traffic, adds latency (queuing delay). Example: Network routers use leaky bucket for traffic shaping to prevent congestion.

Fixed Window Counter Divides time into fixed windows (e.g., per minute) and counts requests in each window. When a window’s limit is reached, subsequent requests are throttled until the next window starts. Simple to implement with a counter that resets at window boundaries. For example, allow 1,000 requests per minute, resetting at :00 seconds. When to use: When simplicity matters more than precision, for coarse-grained throttling. Pros: Extremely simple to implement, memory efficient, works well with time-series databases. Cons: Boundary problem—users can send 2× limit by sending at end of one window and start of next. Example: GitHub’s API uses hourly fixed windows for rate limiting, resetting at the top of each hour.

Sliding Window Log Maintains a log of request timestamps and counts requests within a sliding time window. For each request, remove timestamps older than the window and check if the count exceeds the limit. For example, for 100 requests per minute, maintain timestamps of the last 100 requests and reject if the oldest is less than 60 seconds old. When to use: When you need precise rate limiting without boundary issues. Pros: No boundary problem, precise enforcement, handles variable request patterns well. Cons: Memory intensive (stores all timestamps), expensive to compute (requires sorting/filtering). Example: Redis-based rate limiters often use sorted sets to implement sliding window logs efficiently.

Sliding Window Counter Hybrid approach combining fixed window simplicity with sliding window accuracy. Maintains counters for current and previous windows, then estimates current sliding window count using weighted average. For example, if you’re 30% into the current minute, count = 0.7 × previous_minute + current_minute. When to use: When you need better accuracy than fixed window without the memory cost of sliding log. Pros: Memory efficient (only two counters), more accurate than fixed window, good enough for most use cases. Cons: Approximate (not perfectly accurate), slightly more complex than fixed window. Example: Cloudflare’s rate limiting uses sliding window counters to balance accuracy and performance at scale.

Adaptive Throttling Dynamically adjusts throttling limits based on system health metrics like CPU usage, memory pressure, error rates, or latency. When the system is healthy, limits are relaxed; when under stress, limits tighten automatically. Google’s load shedding uses adaptive throttling that reduces accepted traffic when backend latency increases. When to use: For systems with variable capacity or unpredictable load patterns. Pros: Automatically responds to system health, maximizes utilization, prevents cascading failures. Cons: Complex to tune, can create feedback loops, requires sophisticated monitoring. Example: Netflix’s Zuul gateway implements adaptive throttling that tightens limits when circuit breakers trip or latency increases, automatically protecting the system during partial failures.

Throttling Algorithm Comparison

graph TB
    Start["1000 Requests Arrive<br/>in 1 Second"]
    
    subgraph Token Bucket
        TB_Check{"Tokens Available?<br/>Bucket: 500 tokens<br/>Refill: 100/sec"}
        TB_Allow["✓ Allow 500<br/>Throttle 500<br/>(Burst capacity used)"]
        TB_Refill["Next second:<br/>Allow 100<br/>Throttle 900"]
    end
    
    subgraph Leaky Bucket
        LB_Queue["Queue all 1000<br/>in bucket"]
        LB_Process["✓ Process 100/sec<br/>(Fixed rate)<br/>Queue time: 10 sec"]
        LB_Overflow{"Queue Full?"}
        LB_Drop["✗ Drop overflow<br/>requests"]
    end
    
    subgraph Fixed Window
        FW_Count{"Count in window?<br/>Limit: 500/min<br/>Window: 0-60sec"}
        FW_Allow["✓ Allow 500<br/>Throttle 500"]
        FW_Reset["At 60sec: Reset<br/>Allow next 500"]
        FW_Exploit["⚠️ Boundary exploit:<br/>500 at 59sec<br/>+ 500 at 61sec<br/>= 1000 in 2sec"]
    end
    
    subgraph Sliding Window
        SW_Check["Check last 60sec<br/>Count: 450"]
        SW_Allow["✓ Allow 50 more<br/>Throttle 950<br/>(Precise limit)"]
        SW_Cost["💾 Memory: Store<br/>1000 timestamps"]
    end
    
    Start --> TB_Check
    Start --> LB_Queue
    Start --> FW_Count
    Start --> SW_Check
    
    TB_Check -->|Yes| TB_Allow
    TB_Allow --> TB_Refill
    
    LB_Queue --> LB_Process
    LB_Process --> LB_Overflow
    LB_Overflow -->|Yes| LB_Drop
    
    FW_Count --> FW_Allow
    FW_Allow --> FW_Reset
    FW_Reset -.-> FW_Exploit
    
    SW_Check --> SW_Allow
    SW_Allow --> SW_Cost

Different throttling algorithms handle the same burst traffic differently. Token bucket allows bursts up to bucket capacity, leaky bucket enforces smooth output rate with queuing, fixed window has boundary exploits, and sliding window provides precise limits at higher memory cost.

Trade-offs

Accuracy vs Performance Precise throttling (sliding window log) requires tracking every request timestamp, consuming memory and CPU for lookups. Approximate throttling (fixed window, sliding window counter) uses simple counters with minimal overhead but allows some boundary violations. Decision framework: Use precise throttling for security-critical operations (authentication attempts, payment processing) where accuracy matters more than performance. Use approximate throttling for high-volume read APIs where 10-20% inaccuracy is acceptable. Stripe uses precise throttling for payment APIs but approximate throttling for dashboard analytics.

Local vs Distributed State Local throttling (per-instance counters) is fast and resilient but allows N× the intended limit across N instances. Distributed throttling (shared Redis counters) enforces global limits accurately but introduces latency, network dependencies, and potential single points of failure. Decision framework: Use distributed state for strict limits on critical operations (API quotas, payment limits) where accuracy is essential. Use local state with over-provisioning for read-heavy operations where approximate limits suffice. Twitter uses distributed throttling for tweet posting (strict limits) but local throttling for timeline reads (approximate limits acceptable).

Rejection vs Queuing Rejecting throttled requests immediately (fail fast) provides clear feedback but wastes work already done. Queuing requests for delayed processing maintains higher success rates but adds latency and memory pressure. Decision framework: Reject for synchronous user-facing APIs where latency matters (web requests, mobile apps)—users prefer fast failures over slow responses. Queue for asynchronous background work where eventual completion matters more than immediate response (batch jobs, analytics). Shopify rejects storefront API requests immediately but queues webhook deliveries for retry.

Static vs Adaptive Limits Static limits are predictable and easy to reason about but may over-provision (wasted cost) or under-provision (poor availability). Adaptive limits maximize utilization and automatically respond to system health but can be unpredictable and create feedback loops. Decision framework: Use static limits for stable, well-understood workloads with predictable capacity (CRUD APIs, database queries). Use adaptive limits for variable workloads or shared infrastructure where capacity fluctuates (multi-tenant systems, spot instances). AWS Lambda uses adaptive throttling that increases concurrency limits gradually based on observed success rates.

Coarse vs Fine-Grained Throttling Coarse-grained throttling (per-user, per-API) is simple to implement and reason about but treats all operations equally. Fine-grained throttling (per-endpoint, per-operation-type, per-resource) provides better control but increases complexity and configuration overhead. Decision framework: Start with coarse-grained throttling for MVP and simple systems. Add fine-grained throttling when you observe specific endpoints or operations causing disproportionate load. GitHub’s API started with simple per-user limits but added per-endpoint limits when they discovered certain operations (search, GraphQL) were much more expensive than others.

Local vs Distributed Throttling Tradeoff

graph TB
    subgraph Local Throttling
        L_Client["Client"]
        L_LB["Load Balancer"]
        L_Inst1["Instance 1<br/>Local Counter<br/>Limit: 100 RPS"]
        L_Inst2["Instance 2<br/>Local Counter<br/>Limit: 100 RPS"]
        L_Inst3["Instance 3<br/>Local Counter<br/>Limit: 100 RPS"]
        L_Result["⚠️ Actual Global Limit:<br/>300 RPS<br/>(3× intended)"]
        
        L_Client --> L_LB
        L_LB --> L_Inst1 & L_Inst2 & L_Inst3
        L_Inst1 & L_Inst2 & L_Inst3 --> L_Result
        
        L_Pros["✓ Fast (no network)<br/>✓ Resilient (no SPOF)<br/>✓ Simple"]
        L_Cons["✗ Inaccurate (N× limit)<br/>✗ No global view"]
    end
    
    subgraph Distributed Throttling
        D_Client["Client"]
        D_LB["Load Balancer"]
        D_Inst1["Instance 1"]
        D_Inst2["Instance 2"]
        D_Inst3["Instance 3"]
        D_Redis[("Redis<br/>Shared Counter<br/>Global Limit: 100 RPS")]
        D_Result["✓ Actual Global Limit:<br/>100 RPS<br/>(Accurate)"]
        
        D_Client --> D_LB
        D_LB --> D_Inst1 & D_Inst2 & D_Inst3
        D_Inst1 & D_Inst2 & D_Inst3 --"Check/Increment"--> D_Redis
        D_Redis --> D_Result
        
        D_Pros["✓ Accurate global limit<br/>✓ Coordinated across instances"]
        D_Cons["✗ Network latency (+5-10ms)<br/>✗ Redis SPOF<br/>✗ Complex"]
    end
    
    Decision{"Decision Criteria"}
    UseLocal["Use Local:<br/>• Read-heavy APIs<br/>• Approximate OK<br/>• High volume"]
    UseDistributed["Use Distributed:<br/>• Write operations<br/>• Strict limits needed<br/>• Security-critical"]
    
    Decision --> UseLocal
    Decision --> UseDistributed

Local throttling is fast and resilient but allows N× the intended limit across N instances. Distributed throttling enforces accurate global limits but introduces latency and dependencies. Choose based on whether accuracy or performance is more critical for your use case.

Common Pitfalls

Pitfall 1: Not Communicating Limits to Clients Many systems throttle silently or with generic error messages, leaving clients guessing about limits and retry timing. This leads to aggressive retries that worsen the problem. Why it happens: Developers focus on implementing throttling logic but neglect the client experience. How to avoid: Always return HTTP 429 with detailed headers: X-RateLimit-Limit (total allowed), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (when limit resets), and Retry-After (when to retry). Include human-readable error messages explaining the limit and how to request increases. Document limits clearly in API documentation.

Pitfall 2: Throttling Too Late in the Request Path Throttling after expensive operations (database queries, external API calls) wastes resources and provides minimal protection. A request that’s throttled after hitting the database has already consumed the resources you’re trying to protect. Why it happens: Throttling is added as an afterthought or implemented deep in the application layer. How to avoid: Implement throttling as early as possible—ideally at the API gateway or load balancer before requests reach application servers. Use multi-layer throttling: coarse limits at the gateway, fine-grained limits at the service layer. Uber throttles at the edge (API gateway) for basic limits, then again at the service layer for operation-specific limits.

Pitfall 3: Not Accounting for Distributed System Amplification A single user request might trigger multiple internal service calls, each subject to throttling. If not coordinated, this creates cascading throttling where a 1% throttle rate at the edge becomes 10% at downstream services. Why it happens: Each service implements throttling independently without considering the call graph. How to avoid: Implement hierarchical throttling where limits are allocated proportionally across the call chain. Use request tracing to understand amplification factors. Reserve capacity for internal service-to-service calls separate from external user requests. Netflix discovered that a single user request to their recommendation API triggered 50+ internal service calls, requiring careful throttling coordination.

Pitfall 4: Ignoring Thundering Herd After Throttling When many clients are throttled simultaneously, they often retry at the same time (when Retry-After expires), creating a thundering herd that overwhelms the system again. Why it happens: Clients implement naive retry logic without jitter. How to avoid: Add jitter to Retry-After values so clients retry at slightly different times. Implement exponential backoff with jitter on the client side. Use token bucket throttling which naturally spreads out retries as tokens refill gradually. Stripe adds random jitter (±20%) to Retry-After headers to prevent synchronized retries.

Pitfall 5: Not Monitoring Throttling as a Business Metric Throttling is often treated as purely technical, but high throttling rates indicate lost revenue, poor user experience, or insufficient capacity. Teams don’t notice until customers complain. Why it happens: Throttling metrics are buried in technical dashboards, not exposed to product or business teams. How to avoid: Track throttling as a key business metric alongside error rates and latency. Alert on unusual throttling spikes. Correlate throttling with revenue impact. Expose throttling rates in user-facing dashboards so customers can self-diagnose. Twilio provides real-time throttling dashboards showing which customers are hitting limits and estimated revenue impact, allowing proactive capacity planning.

Thundering Herd After Throttling

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant C3 as Client 3
    participant API as API Gateway
    participant Service as Backend Service
    
    Note over C1,C3: 1000 clients hit rate limit simultaneously
    
    C1->>API: Request at t=0
    C2->>API: Request at t=0
    C3->>API: Request at t=0
    
    API-->>C1: 429 Retry-After: 60s
    API-->>C2: 429 Retry-After: 60s
    API-->>C3: 429 Retry-After: 60s
    
    Note over C1,C3: All clients wait exactly 60 seconds
    
    rect rgb(255, 243, 205)
        Note over C1,C3: ⚠️ Thundering Herd at t=60s
        C1->>API: Retry at t=60
        C2->>API: Retry at t=60
        C3->>API: Retry at t=60
        Note over API,Service: 1000 simultaneous requests<br/>overwhelm system again
        API-->>C1: 503 Service Unavailable
        API-->>C2: 503 Service Unavailable
        API-->>C3: 503 Service Unavailable
    end
    
    Note over C1,C3: Better: Add jitter to retry timing
    
    rect rgb(212, 237, 218)
        Note over C1,C3: ✓ With Jitter (±20%)
        API-->>C1: 429 Retry-After: 58s (60-20%)
        API-->>C2: 429 Retry-After: 60s (60+0%)
        API-->>C3: 429 Retry-After: 67s (60+20%)
        
        C1->>API: Retry at t=58
        API->>Service: Process
        Service-->>C1: 200 OK
        
        C2->>API: Retry at t=60
        API->>Service: Process
        Service-->>C2: 200 OK
        
        C3->>API: Retry at t=67
        API->>Service: Process
        Service-->>C3: 200 OK
        
        Note over C1,C3: Requests spread over 9 seconds<br/>System handles load smoothly
    end

When many clients are throttled simultaneously, they often retry at the same time (when Retry-After expires), creating a thundering herd that overwhelms the system again. Adding jitter (random variance) to retry timing spreads out the load and prevents synchronized retries.

Math & Calculations

Token Bucket Capacity Planning

Formula for token bucket sizing:

Burst capacity (B) = Maximum requests allowed in a burst
Refill rate (R) = Sustained requests per second
Bucket size (S) = B tokens
Refill interval (I) = 1/R seconds per token

Example: API Gateway Throttling

Suppose you want to allow:

Sustained rate: 1,000 requests per second
Burst allowance: 5,000 requests in first second
After burst, sustain 1,000 RPS

Configuration:

Bucket size S = 5,000 tokens
Refill rate R = 1,000 tokens/second
Each request consumes 1 token

Scenario 1: Burst then sustain

t=0s: Bucket full (5,000 tokens), client sends 5,000 requests → all succeed, bucket empty
t=1s: Bucket refilled with 1,000 tokens, client sends 1,000 requests → all succeed
t=2s: Bucket refilled with 1,000 tokens, client sends 1,000 requests → all succeed
Result: Client can sustain 1,000 RPS indefinitely after initial burst

Scenario 2: Exceeding sustained rate

t=0s: Bucket full (5,000 tokens), client sends 2,000 requests → all succeed, 3,000 tokens remain
t=1s: Bucket has 4,000 tokens (3,000 + 1,000 refill), client sends 2,000 requests → all succeed, 2,000 remain
t=2s: Bucket has 3,000 tokens (2,000 + 1,000 refill), client sends 2,000 requests → all succeed, 1,000 remain
t=3s: Bucket has 2,000 tokens (1,000 + 1,000 refill), client sends 2,000 requests → all succeed, 0 remain
t=4s: Bucket has 1,000 tokens (0 + 1,000 refill), client sends 2,000 requests → 1,000 succeed, 1,000 throttled
Result: Client can exceed sustained rate temporarily using burst capacity, then gets throttled

Calculating Throttle Rate

Throttle rate = (Requests throttled / Total requests) × 100%

If your system receives 100,000 requests/minute and throttles 5,000:

Throttle rate = (5,000 / 100,000) × 100% = 5%

Target throttle rates:

< 0.1%: Normal operation, sufficient capacity
0.1-1%: Acceptable, monitor for trends
1-5%: Warning, investigate capacity or client behavior
5%: Critical, immediate action needed

Distributed Throttling Accuracy

With N instances doing local throttling at limit L:

Actual global limit = N × L (worst case)
Example: 10 instances, 100 RPS limit each = 1,000 RPS global (10× intended)

With distributed throttling and coordination overhead C:

Effective limit = L - C
Example: 1,000 RPS limit, 50ms coordination latency, 20 requests/second overhead = 980 RPS effective

Cost-Benefit of Throttling

Calculate the cost of throttling vs over-provisioning:

Over-provisioning cost = (Peak capacity - Average capacity) × Unit cost
Throttling cost = Throttled requests × Revenue per request

Example (e-commerce API):

Average load: 10,000 RPS, Peak load: 50,000 RPS
Server cost: $100/month per 1,000 RPS capacity
Revenue per request: $0.01

Option A (No throttling, provision for peak):

Cost = 50 servers × $100 = $5,000/month
Revenue loss = $0

Option B (Throttle at 20,000 RPS):

Cost = 20 servers × $100 = $2,000/month
Peak throttling = 30,000 requests/second × 3,600 seconds/hour × 2 hours/day × 30 days = 6.48B requests/month
Revenue loss = 6.48B × $0.01 = $64.8M/month (unrealistic—assumes all throttled requests are lost sales)

Realistic calculation (80% of throttled users retry successfully):

Actual revenue loss = 6.48B × 0.20 × $0.01 = $12.96M/month
Net cost = $2,000 + $12,960,000 = $12,962,000/month

This shows throttling is only cost-effective when:

Throttle rate is low (< 1%)
Most throttled users retry successfully
Peak traffic is rare and short-lived

For sustained high traffic, provisioning more capacity is cheaper than losing revenue to throttling.