Throttling Pattern: Protecting Services from Overload
TL;DR
Throttling controls resource consumption by limiting the rate at which operations can be performed, protecting systems from overload while maintaining availability for legitimate users. Unlike rate limiting which focuses on request counts, throttling encompasses broader resource control including CPU, memory, bandwidth, and concurrent connections. Cheat Sheet: Token bucket for burst handling, leaky bucket for smooth rates, fixed/sliding windows for simplicity, adaptive throttling for dynamic loads. Always throttle at multiple layers (API gateway, service, database) and return 429 with Retry-After headers.
The Analogy
Think of throttling like a highway toll booth system during rush hour. The toll plaza has a fixed number of lanes (capacity), and cars must wait in line if all lanes are full. During normal traffic, cars flow through quickly. During peak hours, the system doesn’t crash—it just makes cars wait their turn. Some toll systems even have FastPass lanes (priority queues) for certain vehicles, and they can open additional lanes dynamically when traffic surges. The key insight: the highway doesn’t reject cars entirely; it controls the flow rate to prevent gridlock while maximizing throughput within safe operating limits.
Why This Matters in Interviews
Throttling appears in virtually every system design interview involving high-scale APIs, microservices, or user-facing applications. Interviewers use it to assess your understanding of availability patterns, resource protection, and operational maturity. Strong candidates distinguish throttling from rate limiting, explain multiple implementation strategies with tradeoffs, and discuss distributed coordination challenges. The topic often emerges when discussing: API design (“How do you prevent abuse?”), availability guarantees (“What happens during traffic spikes?”), or cost optimization (“How do you control cloud spending?”). Interviewers look for practical experience—mentioning specific algorithms, discussing monitoring strategies, and understanding the business impact of throttling decisions.
Core Concept
Throttling is a reliability pattern that controls the consumption of resources by limiting the rate, volume, or concurrency of operations within a system. While often conflated with rate limiting, throttling is broader: it encompasses controlling any resource consumption including API requests, database queries, CPU cycles, memory allocation, network bandwidth, and concurrent connections. The fundamental goal is maintaining system stability and meeting service level agreements (SLAs) even when demand exceeds capacity.
The pattern operates on a simple principle: when resource demand approaches or exceeds available capacity, the system deliberately slows down or rejects requests rather than allowing cascading failures. This controlled degradation is preferable to uncontrolled system collapse. A properly throttled system continues serving requests within its capacity while gracefully handling excess load through queuing, rejection, or delayed processing. The key distinction from other availability patterns is that throttling is proactive—it prevents problems before they occur rather than reacting to failures.
Throttling decisions happen at multiple layers of a system architecture. At the API gateway, you might throttle requests per user or API key. At the service layer, you throttle based on resource availability (CPU, memory, thread pools). At the database layer, you throttle query complexity or connection counts. Each layer protects different resources, and effective systems implement throttling at all critical points. The challenge lies in coordinating these layers to provide consistent behavior while avoiding over-throttling that wastes available capacity.
How It Works
Step 1: Define Resource Limits and Policies The system establishes throttling policies based on capacity planning and SLA requirements. For example, an API might support 10,000 requests per second (RPS) total capacity, allocated as 100 RPS per user with burst allowance of 200 requests. These limits derive from load testing, cost constraints, and business requirements. Policies specify what to throttle (requests, bandwidth, connections), the measurement window (per second, minute, hour), and the scope (per user, per IP, per API key, global).
Step 2: Track Resource Consumption As requests arrive, the throttling mechanism tracks consumption against defined limits. This tracking happens in real-time using counters, tokens, or sliding windows. For distributed systems, this often involves a shared state store like Redis where counters are incremented atomically. The tracking granularity matters: per-second tracking catches short bursts but requires more frequent counter resets, while per-minute tracking smooths out spikes but may allow brief overload periods.
Step 3: Evaluate Against Thresholds Each incoming request triggers an evaluation: has this client/resource exceeded its allocation? The evaluation considers multiple dimensions—the immediate rate, accumulated usage over the window, and potentially priority levels. For example, a premium user might have higher limits than free tier users. The evaluation must be fast (sub-millisecond) to avoid becoming a bottleneck itself.
Step 4: Apply Throttling Action When limits are exceeded, the system takes action. The most common response is rejection with HTTP 429 (Too Many Requests) including Retry-After headers indicating when the client can retry. Alternative actions include queuing the request for delayed processing, reducing response quality (returning cached or partial data), or applying backpressure to upstream systems. The choice depends on the use case: user-facing APIs typically reject immediately, while background job systems might queue.
Step 5: Monitor and Adapt Effective throttling systems continuously monitor throttling rates, resource utilization, and user impact. High throttling rates might indicate insufficient capacity or misconfigured limits. Adaptive throttling systems adjust limits dynamically based on observed system health—tightening limits when CPU or memory pressure increases, relaxing them when resources are underutilized. This feedback loop prevents both over-provisioning (wasted cost) and under-provisioning (poor user experience).
Token Bucket Throttling Flow
graph LR
Client["Client"]
Gateway["API Gateway<br/><i>Token Bucket</i>"]
Bucket[("Token Bucket<br/>Capacity: 5000<br/>Refill: 1000/sec")]
Service["Backend Service"]
Client --"1. Send Request"--> Gateway
Gateway --"2. Check Tokens"--> Bucket
Bucket --"3a. Tokens Available<br/>(Consume 1 token)"--> Gateway
Gateway --"4a. Forward Request"--> Service
Service --"5a. Response 200"--> Client
Bucket -."3b. No Tokens<br/>(Bucket Empty)".-> Gateway
Gateway -."4b. Reject<br/>HTTP 429".-> Client
Bucket --"Background: Refill<br/>+1000 tokens/sec"--> Bucket
Token bucket algorithm allows burst traffic by maintaining a bucket of tokens that refill at a constant rate. Requests consume tokens when available; when the bucket is empty, requests are throttled with HTTP 429 until tokens refill.
Key Principles
Principle 1: Fail Fast and Explicitly When throttling kicks in, reject requests immediately with clear error messages rather than letting them time out or fail mysteriously. Return HTTP 429 with Retry-After headers specifying when to retry. This explicit communication allows clients to implement proper backoff strategies and helps developers debug issues quickly. Netflix’s API gateway returns detailed throttling metadata including the limit, current usage, and reset time, enabling clients to make intelligent retry decisions.
Principle 2: Throttle at Multiple Layers Implement throttling at every critical resource boundary: API gateway, service layer, database, and external API calls. Each layer protects different resources and provides defense in depth. For example, Stripe throttles at the API gateway (requests per second), at the service layer (concurrent processing), and at the database layer (query complexity). This layered approach prevents any single resource from becoming a bottleneck and provides granular control over different resource types.
Principle 3: Differentiate by Priority Not all requests are equal—implement priority-based throttling where critical operations get preferential treatment. Uber prioritizes active ride requests over historical trip queries. During high load, background analytics jobs are throttled more aggressively than real-time ride matching. This requires classifying requests by business impact and implementing separate throttling buckets or queues for different priority levels.
Principle 4: Design for Distributed Coordination In distributed systems, throttling state must be shared across instances to enforce global limits accurately. Using local (per-instance) throttling alone leads to N×limit problem where N instances each allow the full limit. Redis or similar distributed counters provide shared state, but introduce latency and single points of failure. The tradeoff: accept some inaccuracy with local throttling plus periodic synchronization, or pay the cost of distributed coordination for strict enforcement. Twitter uses a hybrid approach: strict limits for critical operations (posting tweets) with distributed coordination, looser limits for read operations with local-only throttling.
Principle 5: Make Throttling Observable Throttling is a critical operational signal—instrument it heavily. Track throttling rates by endpoint, user, and reason. Alert when throttling exceeds thresholds, indicating either attacks, misconfigured clients, or insufficient capacity. Expose throttling metrics to users through dashboards or API responses. AWS CloudWatch provides detailed throttling metrics for all services, allowing customers to understand their usage patterns and optimize accordingly. Without observability, throttling becomes a black box that frustrates users and obscures capacity planning needs.
Multi-Layer Throttling Architecture
graph TB
Client["Client Application"]
subgraph Layer 1: Edge
CDN["CDN/WAF<br/><i>DDoS Protection</i>"]
LB["Load Balancer<br/><i>Connection Limits</i>"]
end
subgraph Layer 2: API Gateway
Gateway["API Gateway<br/><i>Per-User Rate Limits</i><br/>100 RPS/user"]
end
subgraph Layer 3: Service Layer
Service["Application Service<br/><i>Concurrent Request Limits</i><br/>Max 500 concurrent"]
Queue["Priority Queue<br/><i>Critical vs Background</i>"]
end
subgraph Layer 4: Data Layer
Cache["Redis Cache<br/><i>Connection Pool Limits</i>"]
DB[("Database<br/><i>Query Complexity Limits</i><br/>Max 10 joins")]
end
Client --"1. HTTP Request"--> CDN
CDN --"2. Pass DDoS Check"--> LB
LB --"3. Route"--> Gateway
Gateway --"4. Check User Quota"--> Service
Service --"5. Enqueue by Priority"--> Queue
Queue --"6. Process Request"--> Cache
Cache --"7. Cache Miss"--> DB
DB --"8. Query Result"--> Service
Service --"9. Response"--> Client
Gateway -."Throttle: 429<br/>User exceeded 100 RPS".-> Client
Service -."Throttle: 503<br/>Service at capacity".-> Client
DB -."Throttle: Query rejected<br/>Too complex".-> Service
Effective throttling implements controls at multiple layers—edge (DDoS), gateway (per-user limits), service (concurrency), and data (query complexity). Each layer protects different resources and provides defense in depth against overload.
Deep Dive
Types / Variants
Token Bucket Algorithm The token bucket maintains a bucket of tokens that refill at a constant rate. Each request consumes one or more tokens. If tokens are available, the request proceeds; otherwise, it’s throttled. The bucket has a maximum capacity (burst size), allowing short bursts above the steady-state rate. For example, AWS API Gateway uses token bucket with a burst of 5,000 requests and refill rate of 10,000 requests per second. This means you can send 5,000 requests instantly, then sustain 10,000 RPS indefinitely. When to use: When you need to allow bursts while maintaining average rate limits. Pros: Handles bursty traffic naturally, simple to implement, memory efficient. Cons: Burst allowance can be exploited, requires careful tuning of bucket size vs refill rate. Example: Shopify’s API uses token bucket to allow merchants to burst during flash sales while preventing sustained abuse.
Leaky Bucket Algorithm The leaky bucket processes requests at a fixed rate regardless of input rate, like water dripping from a bucket with a hole. Incoming requests fill the bucket; if the bucket overflows, requests are rejected. Unlike token bucket, this enforces strict output rate smoothing. For example, a leaky bucket with 100 RPS capacity processes exactly 100 requests per second even if 1,000 arrive simultaneously. When to use: When you need predictable, smooth output rates to protect downstream systems. Pros: Perfectly smooth output rate, protects downstream from bursts, simple conceptual model. Cons: Doesn’t allow bursts, may waste capacity during low traffic, adds latency (queuing delay). Example: Network routers use leaky bucket for traffic shaping to prevent congestion.
Fixed Window Counter Divides time into fixed windows (e.g., per minute) and counts requests in each window. When a window’s limit is reached, subsequent requests are throttled until the next window starts. Simple to implement with a counter that resets at window boundaries. For example, allow 1,000 requests per minute, resetting at :00 seconds. When to use: When simplicity matters more than precision, for coarse-grained throttling. Pros: Extremely simple to implement, memory efficient, works well with time-series databases. Cons: Boundary problem—users can send 2× limit by sending at end of one window and start of next. Example: GitHub’s API uses hourly fixed windows for rate limiting, resetting at the top of each hour.
Sliding Window Log Maintains a log of request timestamps and counts requests within a sliding time window. For each request, remove timestamps older than the window and check if the count exceeds the limit. For example, for 100 requests per minute, maintain timestamps of the last 100 requests and reject if the oldest is less than 60 seconds old. When to use: When you need precise rate limiting without boundary issues. Pros: No boundary problem, precise enforcement, handles variable request patterns well. Cons: Memory intensive (stores all timestamps), expensive to compute (requires sorting/filtering). Example: Redis-based rate limiters often use sorted sets to implement sliding window logs efficiently.
Sliding Window Counter Hybrid approach combining fixed window simplicity with sliding window accuracy. Maintains counters for current and previous windows, then estimates current sliding window count using weighted average. For example, if you’re 30% into the current minute, count = 0.7 × previous_minute + current_minute. When to use: When you need better accuracy than fixed window without the memory cost of sliding log. Pros: Memory efficient (only two counters), more accurate than fixed window, good enough for most use cases. Cons: Approximate (not perfectly accurate), slightly more complex than fixed window. Example: Cloudflare’s rate limiting uses sliding window counters to balance accuracy and performance at scale.
Adaptive Throttling Dynamically adjusts throttling limits based on system health metrics like CPU usage, memory pressure, error rates, or latency. When the system is healthy, limits are relaxed; when under stress, limits tighten automatically. Google’s load shedding uses adaptive throttling that reduces accepted traffic when backend latency increases. When to use: For systems with variable capacity or unpredictable load patterns. Pros: Automatically responds to system health, maximizes utilization, prevents cascading failures. Cons: Complex to tune, can create feedback loops, requires sophisticated monitoring. Example: Netflix’s Zuul gateway implements adaptive throttling that tightens limits when circuit breakers trip or latency increases, automatically protecting the system during partial failures.
Throttling Algorithm Comparison
graph TB
Start["1000 Requests Arrive<br/>in 1 Second"]
subgraph Token Bucket
TB_Check{"Tokens Available?<br/>Bucket: 500 tokens<br/>Refill: 100/sec"}
TB_Allow["✓ Allow 500<br/>Throttle 500<br/>(Burst capacity used)"]
TB_Refill["Next second:<br/>Allow 100<br/>Throttle 900"]
end
subgraph Leaky Bucket
LB_Queue["Queue all 1000<br/>in bucket"]
LB_Process["✓ Process 100/sec<br/>(Fixed rate)<br/>Queue time: 10 sec"]
LB_Overflow{"Queue Full?"}
LB_Drop["✗ Drop overflow<br/>requests"]
end
subgraph Fixed Window
FW_Count{"Count in window?<br/>Limit: 500/min<br/>Window: 0-60sec"}
FW_Allow["✓ Allow 500<br/>Throttle 500"]
FW_Reset["At 60sec: Reset<br/>Allow next 500"]
FW_Exploit["⚠️ Boundary exploit:<br/>500 at 59sec<br/>+ 500 at 61sec<br/>= 1000 in 2sec"]
end
subgraph Sliding Window
SW_Check["Check last 60sec<br/>Count: 450"]
SW_Allow["✓ Allow 50 more<br/>Throttle 950<br/>(Precise limit)"]
SW_Cost["💾 Memory: Store<br/>1000 timestamps"]
end
Start --> TB_Check
Start --> LB_Queue
Start --> FW_Count
Start --> SW_Check
TB_Check -->|Yes| TB_Allow
TB_Allow --> TB_Refill
LB_Queue --> LB_Process
LB_Process --> LB_Overflow
LB_Overflow -->|Yes| LB_Drop
FW_Count --> FW_Allow
FW_Allow --> FW_Reset
FW_Reset -.-> FW_Exploit
SW_Check --> SW_Allow
SW_Allow --> SW_Cost
Different throttling algorithms handle the same burst traffic differently. Token bucket allows bursts up to bucket capacity, leaky bucket enforces smooth output rate with queuing, fixed window has boundary exploits, and sliding window provides precise limits at higher memory cost.
Trade-offs
Accuracy vs Performance Precise throttling (sliding window log) requires tracking every request timestamp, consuming memory and CPU for lookups. Approximate throttling (fixed window, sliding window counter) uses simple counters with minimal overhead but allows some boundary violations. Decision framework: Use precise throttling for security-critical operations (authentication attempts, payment processing) where accuracy matters more than performance. Use approximate throttling for high-volume read APIs where 10-20% inaccuracy is acceptable. Stripe uses precise throttling for payment APIs but approximate throttling for dashboard analytics.
Local vs Distributed State Local throttling (per-instance counters) is fast and resilient but allows N× the intended limit across N instances. Distributed throttling (shared Redis counters) enforces global limits accurately but introduces latency, network dependencies, and potential single points of failure. Decision framework: Use distributed state for strict limits on critical operations (API quotas, payment limits) where accuracy is essential. Use local state with over-provisioning for read-heavy operations where approximate limits suffice. Twitter uses distributed throttling for tweet posting (strict limits) but local throttling for timeline reads (approximate limits acceptable).
Rejection vs Queuing Rejecting throttled requests immediately (fail fast) provides clear feedback but wastes work already done. Queuing requests for delayed processing maintains higher success rates but adds latency and memory pressure. Decision framework: Reject for synchronous user-facing APIs where latency matters (web requests, mobile apps)—users prefer fast failures over slow responses. Queue for asynchronous background work where eventual completion matters more than immediate response (batch jobs, analytics). Shopify rejects storefront API requests immediately but queues webhook deliveries for retry.
Static vs Adaptive Limits Static limits are predictable and easy to reason about but may over-provision (wasted cost) or under-provision (poor availability). Adaptive limits maximize utilization and automatically respond to system health but can be unpredictable and create feedback loops. Decision framework: Use static limits for stable, well-understood workloads with predictable capacity (CRUD APIs, database queries). Use adaptive limits for variable workloads or shared infrastructure where capacity fluctuates (multi-tenant systems, spot instances). AWS Lambda uses adaptive throttling that increases concurrency limits gradually based on observed success rates.
Coarse vs Fine-Grained Throttling Coarse-grained throttling (per-user, per-API) is simple to implement and reason about but treats all operations equally. Fine-grained throttling (per-endpoint, per-operation-type, per-resource) provides better control but increases complexity and configuration overhead. Decision framework: Start with coarse-grained throttling for MVP and simple systems. Add fine-grained throttling when you observe specific endpoints or operations causing disproportionate load. GitHub’s API started with simple per-user limits but added per-endpoint limits when they discovered certain operations (search, GraphQL) were much more expensive than others.
Local vs Distributed Throttling Tradeoff
graph TB
subgraph Local Throttling
L_Client["Client"]
L_LB["Load Balancer"]
L_Inst1["Instance 1<br/>Local Counter<br/>Limit: 100 RPS"]
L_Inst2["Instance 2<br/>Local Counter<br/>Limit: 100 RPS"]
L_Inst3["Instance 3<br/>Local Counter<br/>Limit: 100 RPS"]
L_Result["⚠️ Actual Global Limit:<br/>300 RPS<br/>(3× intended)"]
L_Client --> L_LB
L_LB --> L_Inst1 & L_Inst2 & L_Inst3
L_Inst1 & L_Inst2 & L_Inst3 --> L_Result
L_Pros["✓ Fast (no network)<br/>✓ Resilient (no SPOF)<br/>✓ Simple"]
L_Cons["✗ Inaccurate (N× limit)<br/>✗ No global view"]
end
subgraph Distributed Throttling
D_Client["Client"]
D_LB["Load Balancer"]
D_Inst1["Instance 1"]
D_Inst2["Instance 2"]
D_Inst3["Instance 3"]
D_Redis[("Redis<br/>Shared Counter<br/>Global Limit: 100 RPS")]
D_Result["✓ Actual Global Limit:<br/>100 RPS<br/>(Accurate)"]
D_Client --> D_LB
D_LB --> D_Inst1 & D_Inst2 & D_Inst3
D_Inst1 & D_Inst2 & D_Inst3 --"Check/Increment"--> D_Redis
D_Redis --> D_Result
D_Pros["✓ Accurate global limit<br/>✓ Coordinated across instances"]
D_Cons["✗ Network latency (+5-10ms)<br/>✗ Redis SPOF<br/>✗ Complex"]
end
Decision{"Decision Criteria"}
UseLocal["Use Local:<br/>• Read-heavy APIs<br/>• Approximate OK<br/>• High volume"]
UseDistributed["Use Distributed:<br/>• Write operations<br/>• Strict limits needed<br/>• Security-critical"]
Decision --> UseLocal
Decision --> UseDistributed
Local throttling is fast and resilient but allows N× the intended limit across N instances. Distributed throttling enforces accurate global limits but introduces latency and dependencies. Choose based on whether accuracy or performance is more critical for your use case.
Common Pitfalls
Pitfall 1: Not Communicating Limits to Clients Many systems throttle silently or with generic error messages, leaving clients guessing about limits and retry timing. This leads to aggressive retries that worsen the problem. Why it happens: Developers focus on implementing throttling logic but neglect the client experience. How to avoid: Always return HTTP 429 with detailed headers: X-RateLimit-Limit (total allowed), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (when limit resets), and Retry-After (when to retry). Include human-readable error messages explaining the limit and how to request increases. Document limits clearly in API documentation.
Pitfall 2: Throttling Too Late in the Request Path Throttling after expensive operations (database queries, external API calls) wastes resources and provides minimal protection. A request that’s throttled after hitting the database has already consumed the resources you’re trying to protect. Why it happens: Throttling is added as an afterthought or implemented deep in the application layer. How to avoid: Implement throttling as early as possible—ideally at the API gateway or load balancer before requests reach application servers. Use multi-layer throttling: coarse limits at the gateway, fine-grained limits at the service layer. Uber throttles at the edge (API gateway) for basic limits, then again at the service layer for operation-specific limits.
Pitfall 3: Not Accounting for Distributed System Amplification A single user request might trigger multiple internal service calls, each subject to throttling. If not coordinated, this creates cascading throttling where a 1% throttle rate at the edge becomes 10% at downstream services. Why it happens: Each service implements throttling independently without considering the call graph. How to avoid: Implement hierarchical throttling where limits are allocated proportionally across the call chain. Use request tracing to understand amplification factors. Reserve capacity for internal service-to-service calls separate from external user requests. Netflix discovered that a single user request to their recommendation API triggered 50+ internal service calls, requiring careful throttling coordination.
Pitfall 4: Ignoring Thundering Herd After Throttling When many clients are throttled simultaneously, they often retry at the same time (when Retry-After expires), creating a thundering herd that overwhelms the system again. Why it happens: Clients implement naive retry logic without jitter. How to avoid: Add jitter to Retry-After values so clients retry at slightly different times. Implement exponential backoff with jitter on the client side. Use token bucket throttling which naturally spreads out retries as tokens refill gradually. Stripe adds random jitter (±20%) to Retry-After headers to prevent synchronized retries.
Pitfall 5: Not Monitoring Throttling as a Business Metric Throttling is often treated as purely technical, but high throttling rates indicate lost revenue, poor user experience, or insufficient capacity. Teams don’t notice until customers complain. Why it happens: Throttling metrics are buried in technical dashboards, not exposed to product or business teams. How to avoid: Track throttling as a key business metric alongside error rates and latency. Alert on unusual throttling spikes. Correlate throttling with revenue impact. Expose throttling rates in user-facing dashboards so customers can self-diagnose. Twilio provides real-time throttling dashboards showing which customers are hitting limits and estimated revenue impact, allowing proactive capacity planning.
Thundering Herd After Throttling
sequenceDiagram
participant C1 as Client 1
participant C2 as Client 2
participant C3 as Client 3
participant API as API Gateway
participant Service as Backend Service
Note over C1,C3: 1000 clients hit rate limit simultaneously
C1->>API: Request at t=0
C2->>API: Request at t=0
C3->>API: Request at t=0
API-->>C1: 429 Retry-After: 60s
API-->>C2: 429 Retry-After: 60s
API-->>C3: 429 Retry-After: 60s
Note over C1,C3: All clients wait exactly 60 seconds
rect rgb(255, 243, 205)
Note over C1,C3: ⚠️ Thundering Herd at t=60s
C1->>API: Retry at t=60
C2->>API: Retry at t=60
C3->>API: Retry at t=60
Note over API,Service: 1000 simultaneous requests<br/>overwhelm system again
API-->>C1: 503 Service Unavailable
API-->>C2: 503 Service Unavailable
API-->>C3: 503 Service Unavailable
end
Note over C1,C3: Better: Add jitter to retry timing
rect rgb(212, 237, 218)
Note over C1,C3: ✓ With Jitter (±20%)
API-->>C1: 429 Retry-After: 58s (60-20%)
API-->>C2: 429 Retry-After: 60s (60+0%)
API-->>C3: 429 Retry-After: 67s (60+20%)
C1->>API: Retry at t=58
API->>Service: Process
Service-->>C1: 200 OK
C2->>API: Retry at t=60
API->>Service: Process
Service-->>C2: 200 OK
C3->>API: Retry at t=67
API->>Service: Process
Service-->>C3: 200 OK
Note over C1,C3: Requests spread over 9 seconds<br/>System handles load smoothly
end
When many clients are throttled simultaneously, they often retry at the same time (when Retry-After expires), creating a thundering herd that overwhelms the system again. Adding jitter (random variance) to retry timing spreads out the load and prevents synchronized retries.
Math & Calculations
Token Bucket Capacity Planning
Formula for token bucket sizing:
- Burst capacity (B) = Maximum requests allowed in a burst
- Refill rate (R) = Sustained requests per second
- Bucket size (S) = B tokens
- Refill interval (I) = 1/R seconds per token
Example: API Gateway Throttling
Suppose you want to allow:
- Sustained rate: 1,000 requests per second
- Burst allowance: 5,000 requests in first second
- After burst, sustain 1,000 RPS
Configuration:
- Bucket size S = 5,000 tokens
- Refill rate R = 1,000 tokens/second
- Each request consumes 1 token
Scenario 1: Burst then sustain
- t=0s: Bucket full (5,000 tokens), client sends 5,000 requests → all succeed, bucket empty
- t=1s: Bucket refilled with 1,000 tokens, client sends 1,000 requests → all succeed
- t=2s: Bucket refilled with 1,000 tokens, client sends 1,000 requests → all succeed
- Result: Client can sustain 1,000 RPS indefinitely after initial burst
Scenario 2: Exceeding sustained rate
- t=0s: Bucket full (5,000 tokens), client sends 2,000 requests → all succeed, 3,000 tokens remain
- t=1s: Bucket has 4,000 tokens (3,000 + 1,000 refill), client sends 2,000 requests → all succeed, 2,000 remain
- t=2s: Bucket has 3,000 tokens (2,000 + 1,000 refill), client sends 2,000 requests → all succeed, 1,000 remain
- t=3s: Bucket has 2,000 tokens (1,000 + 1,000 refill), client sends 2,000 requests → all succeed, 0 remain
- t=4s: Bucket has 1,000 tokens (0 + 1,000 refill), client sends 2,000 requests → 1,000 succeed, 1,000 throttled
- Result: Client can exceed sustained rate temporarily using burst capacity, then gets throttled
Calculating Throttle Rate
Throttle rate = (Requests throttled / Total requests) × 100%
If your system receives 100,000 requests/minute and throttles 5,000:
- Throttle rate = (5,000 / 100,000) × 100% = 5%
Target throttle rates:
- < 0.1%: Normal operation, sufficient capacity
- 0.1-1%: Acceptable, monitor for trends
- 1-5%: Warning, investigate capacity or client behavior
-
5%: Critical, immediate action needed
Distributed Throttling Accuracy
With N instances doing local throttling at limit L:
- Actual global limit = N × L (worst case)
- Example: 10 instances, 100 RPS limit each = 1,000 RPS global (10× intended)
With distributed throttling and coordination overhead C:
- Effective limit = L - C
- Example: 1,000 RPS limit, 50ms coordination latency, 20 requests/second overhead = 980 RPS effective
Cost-Benefit of Throttling
Calculate the cost of throttling vs over-provisioning:
- Over-provisioning cost = (Peak capacity - Average capacity) × Unit cost
- Throttling cost = Throttled requests × Revenue per request
Example (e-commerce API):
- Average load: 10,000 RPS, Peak load: 50,000 RPS
- Server cost: $100/month per 1,000 RPS capacity
- Revenue per request: $0.01
Option A (No throttling, provision for peak):
- Cost = 50 servers × $100 = $5,000/month
- Revenue loss = $0
Option B (Throttle at 20,000 RPS):
- Cost = 20 servers × $100 = $2,000/month
- Peak throttling = 30,000 requests/second × 3,600 seconds/hour × 2 hours/day × 30 days = 6.48B requests/month
- Revenue loss = 6.48B × $0.01 = $64.8M/month (unrealistic—assumes all throttled requests are lost sales)
Realistic calculation (80% of throttled users retry successfully):
- Actual revenue loss = 6.48B × 0.20 × $0.01 = $12.96M/month
- Net cost = $2,000 + $12,960,000 = $12,962,000/month
This shows throttling is only cost-effective when:
- Throttle rate is low (< 1%)
- Most throttled users retry successfully
- Peak traffic is rare and short-lived
For sustained high traffic, provisioning more capacity is cheaper than losing revenue to throttling.