Rate Limiting in System Design: Token Bucket & Sliding Window

After this topic, you will be able to:

Implement token bucket, leaky bucket, fixed window, and sliding window rate limiting algorithms
Compare rate limiting algorithms and select appropriate approach based on traffic patterns and requirements
Design distributed rate limiting solutions using Redis or similar shared state stores
Calculate rate limit parameters (bucket size, refill rate) for given QPS and burst requirements

TL;DR

Rate limiting controls the number of requests a client can make to a service within a time window, protecting against abuse, ensuring fair resource allocation, and controlling costs. Core algorithms include token bucket (allows bursts), leaky bucket (smooths traffic), fixed window (simple but has boundary issues), and sliding window (accurate but memory-intensive). In distributed systems, rate limiting typically uses Redis with atomic operations to maintain shared state across multiple servers.

Cheat Sheet: Token bucket for APIs needing burst tolerance (Stripe: 100/sec, burst 1000). Leaky bucket for strict output rate (video streaming). Fixed window for simple quotas (Twitter: 300 tweets/3hrs). Sliding window for precise enforcement (Shopify: prevents boundary gaming). Distributed: Redis INCR + EXPIRE for counters, Lua scripts for atomicity.

The Problem It Solves

Without rate limiting, services face three critical problems. First, malicious or misbehaving clients can overwhelm your infrastructure through denial-of-service attacks, either intentional or accidental. A single buggy client with an infinite retry loop can consume all available connections and bring down your entire service. Second, you lack fair resource allocation—a handful of power users can monopolize resources, degrading experience for everyone else. Third, you have no cost control when using metered third-party APIs or cloud resources. If a client triggers 10 million calls to your payment processor at $0.01 per call, you’re suddenly facing a $100,000 bill.

The problem becomes exponentially harder in distributed systems. With multiple API servers, each server only sees a fraction of total traffic. A client making 1000 requests per second might send 100 requests to each of 10 servers. Without coordination, each server thinks the client is well within limits while the aggregate violates your policy. You need a mechanism that works across server boundaries while maintaining low latency—adding 50ms to every request for rate limit checks defeats the purpose of having multiple servers. The solution must handle race conditions when multiple servers check limits simultaneously and gracefully degrade when the rate limiting infrastructure itself has issues.

Solution Overview

Rate limiting solves these problems by tracking request counts per client (identified by user ID, API key, or IP address) within time windows and rejecting requests that exceed defined thresholds. The pattern operates in two layers: the algorithm that decides whether to allow or reject a request, and the storage mechanism that maintains state. For single-server deployments, in-memory data structures suffice. For distributed systems, you need shared state in a fast data store like Redis.

The core workflow is straightforward: when a request arrives, extract the client identifier, check current usage against the limit, increment the counter if allowed, and return the decision. The devil is in the details—specifically, how you define “time window” and handle edge cases. A naive implementation using fixed one-minute windows creates a boundary problem: a client can make 1000 requests at 12:00:59 and another 1000 at 12:01:00, achieving 2000 requests in two seconds while technically staying within a “1000 per minute” limit. Different algorithms solve this problem with varying trade-offs in accuracy, memory usage, and burst handling.

Modern implementations return rate limit information in HTTP headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) so clients can self-regulate. When limits are exceeded, the service returns HTTP 429 (Too Many Requests) with a Retry-After header indicating when the client can try again. This creates a feedback loop where well-behaved clients back off automatically, reducing the load on your rate limiting infrastructure itself.

Distributed Rate Limiting Architecture with Redis

graph TB
    Client["Client<br/>API Key: abc123"]
    
    subgraph API Gateway Cluster
        Server1["API Server 1<br/>us-east-1a"]
        Server2["API Server 2<br/>us-east-1b"]
        Server3["API Server 3<br/>us-east-1c"]
    end
    
    subgraph Redis Cluster - Shared State
        Redis1[("Redis Primary<br/>ratelimit:abc123:2024-01-15-12<br/>count: 847")]
        Redis2[("Redis Replica<br/>Read-only")]
    end
    
    Client --"1. POST /api/process<br/>100 req/sec"--> Server1
    Client --"2. POST /api/process<br/>100 req/sec"--> Server2
    Client --"3. POST /api/process<br/>100 req/sec"--> Server3
    
    Server1 --"4. EVAL lua_script<br/>INCR + check atomically"--> Redis1
    Server2 --"5. EVAL lua_script<br/>INCR + check atomically"--> Redis1
    Server3 --"6. EVAL lua_script<br/>INCR + check atomically"--> Redis1
    
    Redis1 --"7. Replication"--> Redis2
    
    Redis1 --"8. count=847<br/>✓ Allow (limit: 1000)"--> Server1
    Redis1 --"9. count=848<br/>✓ Allow"--> Server2
    Redis1 --"10. count=1001<br/>✗ Reject 429"--> Server3
    
    Server1 --"11. 200 OK<br/>X-RateLimit-Remaining: 152"--> Client
    Server3 --"12. 429 Too Many Requests<br/>Retry-After: 60"--> Client
    
    subgraph Fallback Strategy
        RedisDown["❌ Redis Unavailable"]
        LocalLimit["Local In-Memory<br/>Rate Limit<br/>10× normal limit<br/>Degraded mode"]
        RedisDown --> LocalLimit
    end

Distributed rate limiting uses Redis as shared state across multiple API servers. Each server atomically increments the counter using Lua scripts to prevent race conditions. Without coordination, 3 servers each seeing 100 req/sec would incorrectly allow 300 req/sec aggregate. Redis ensures accurate enforcement while adding only 1-5ms latency. If Redis fails, servers fall back to local rate limiting at 10× normal limits.

How It Works

Let’s walk through the four core algorithms, building from simplest to most sophisticated.

Fixed Window Counter is the simplest approach. Divide time into fixed windows (e.g., 12:00-12:01, 12:01-12:02) and count requests in each window. Implementation: use a key like ratelimit:user123:2024-01-15-12:00 with a counter. When a request arrives at 12:00:30, increment the counter for the 12:00 window. If the counter exceeds the limit (say, 100), reject the request. Set the key to expire after the window duration to automatically clean up old data. This works well for simple quotas like “1000 API calls per day” but suffers from the boundary problem described earlier. A client can game the system by clustering requests at window boundaries.

Sliding Window Log fixes the boundary problem by maintaining a log of request timestamps. For a limit of 100 requests per minute, store the timestamp of each request in a sorted set. When a new request arrives, remove all timestamps older than one minute, count the remaining entries, and allow the request if count < 100. This provides perfect accuracy—the limit applies to any 60-second window, not just clock-aligned minutes. The downside is memory: storing 100 timestamps per user per minute adds up quickly. For 1 million active users with 100 req/min limits, you’re storing 100 million timestamps.

Sliding Window Counter optimizes memory by approximating the sliding window using two fixed windows. For a request at 12:00:30 with a 1-minute limit, calculate: (count_previous_window * 0.5) + count_current_window. The 0.5 factor represents that we’re 30 seconds into the current minute, so we weight the previous minute’s count by 50%. This approximation uses constant memory (two counters) while mostly preventing boundary gaming. The accuracy is good enough for most use cases—you might allow 105 requests instead of 100 in edge cases, but you won’t allow 2000.

Token Bucket models rate limiting as a bucket that holds tokens. Each request consumes one token. The bucket has a maximum capacity (burst size) and refills at a constant rate. For a limit of 10 requests per second with burst of 100, you have a bucket with capacity 100 that adds 10 tokens per second. A client can make 100 requests instantly (draining the bucket), then must wait for refills. Implementation: store {tokens: 95, last_refill: 1642262400}. On each request, calculate elapsed time, add elapsed_seconds * refill_rate tokens (capped at capacity), decrement by 1 if available. This elegantly handles burst traffic—legitimate clients can occasionally spike above the steady-state rate.

Leaky Bucket inverts the model: requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, new requests are rejected. This smooths bursty traffic into a steady stream, making it ideal for protecting downstream services that can’t handle spikes. A video transcoding service might use a leaky bucket to ensure it processes exactly 10 videos per second, regardless of how many users upload simultaneously. Implementation is more complex—you need an actual queue and a background worker processing at the fixed rate.

For distributed systems, the implementation shifts to Redis. A token bucket implementation using Redis might use a Lua script to ensure atomicity:

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + (elapsed * rate))

if new_tokens >= 1 then
  redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
  return 1
else
  return 0
end

This script runs atomically on Redis, preventing race conditions when multiple servers check the same user’s rate limit simultaneously.

Token Bucket Algorithm with Refill Mechanism

graph LR
    subgraph Token Bucket State
        Bucket["Token Bucket<br/>Capacity: 100<br/>Current: 45 tokens<br/>Refill: 10 tokens/sec"]
    end
    
    subgraph Time Flow
        T0["t=0s<br/>45 tokens"] --> T1["t=1s<br/>55 tokens"]
        T1 --> T2["t=2s<br/>65 tokens"]
        T2 --> T3["t=3s<br/>75 tokens"]
    end
    
    Request1["Request 1<br/>arrives"] --"1. Check tokens"--> Bucket
    Bucket --"2. 45 tokens available"--> Allow1["✓ Allow<br/>44 tokens remain"]
    
    Request2["Request 2<br/>arrives 0.1s later"] --"3. Check tokens"--> Bucket
    Bucket --"4. 44 tokens available"--> Allow2["✓ Allow<br/>43 tokens remain"]
    
    Request3["45 more requests<br/>in quick succession"] --"5. Drain bucket"--> Bucket
    Bucket --"6. 0 tokens left"--> Reject["✗ Reject 429<br/>Retry-After: 0.1s"]
    
    Time["⏱ Time passes<br/>0.1 second"] --"7. Refill 1 token"--> Bucket
    Bucket --"8. 1 token available"--> Allow3["✓ Next request<br/>allowed"]

Token bucket allows burst traffic by accumulating unused tokens up to capacity. Requests consume tokens instantly, while the bucket refills at a constant rate. This elegantly handles legitimate traffic spikes—a client can make 100 requests immediately if they haven’t used the API recently, then must wait for refills.

Fixed Window Boundary Problem

graph TB
    subgraph Window 1: 12:00:00 - 12:00:59
        W1Start["12:00:00<br/>Counter: 0"]
        W1Mid["12:00:30<br/>Counter: 500"]
        W1End["12:00:59<br/>Counter: 1000<br/>✓ At limit"]
        W1Start --> W1Mid --> W1End
    end
    
    subgraph Window 2: 12:01:00 - 12:01:59
        W2Start["12:01:00<br/>Counter: 0<br/>RESET"]
        W2Spike["12:01:01<br/>Counter: 1000<br/>✓ At limit"]
        W2Start --> W2Spike
    end
    
    W1End -."Window boundary".-> W2Start
    
    Problem["⚠️ PROBLEM<br/>2000 requests in 2 seconds<br/>Both windows show ≤1000<br/>Limit: 1000 req/min violated!"]
    
    W1End --> Problem
    W2Spike --> Problem
    
    subgraph Solution: Sliding Window Counter
        SlidingCalc["At 12:01:00<br/>Weighted count:<br/>(1000 × 0%) + (1000 × 100%)<br/>= 1000 requests<br/>✓ Prevents boundary gaming"]
    end
    
    Problem -."Fixed by".-> SlidingCalc

Fixed window counters suffer from a boundary problem where clients can make 2× the intended rate by clustering requests at window edges. At 12:00:59, the client makes 1000 requests (allowed). At 12:01:00, the counter resets and they make another 1000 requests (allowed), achieving 2000 requests in 2 seconds. Sliding window counter solves this by weighting the previous window’s count based on time elapsed.

Variants

Per-User Rate Limiting tracks limits by authenticated user ID. This is the most common variant for APIs where users have accounts. Stripe uses this approach: each API key has a rate limit of 100 requests per second. The key design is ratelimit:user:{user_id}:{window}. This provides fair allocation—one user’s traffic doesn’t affect others. However, it doesn’t protect against attacks from unauthenticated endpoints or scenarios where attackers create many accounts.

Per-IP Rate Limiting uses the client’s IP address as the identifier, useful for unauthenticated endpoints like login or signup. The challenge is handling legitimate users behind NAT or corporate proxies, where thousands of users share one IP. You need higher limits for these scenarios or risk blocking entire offices. Cloudflare uses IP-based rate limiting for DDoS protection but combines it with fingerprinting techniques to distinguish individual clients behind shared IPs.

Per-API-Key Rate Limiting is common for third-party API platforms. Each API key gets its own quota, often with different tiers (free: 1000/day, pro: 100,000/day). Twitter’s API uses this model extensively. Implementation is identical to per-user limiting but with different key prefixes and limit values pulled from a subscription database.

Hierarchical Rate Limiting applies multiple limits at different scopes. For example, Shopify enforces: 2 requests per second per store, 40 requests per second per app across all stores, and 10,000 requests per hour per app. A request must pass all three checks. This prevents both individual store abuse and aggregate abuse across an app’s entire user base. Implementation requires checking multiple Redis keys per request, which adds latency but provides comprehensive protection.

Adaptive Rate Limiting dynamically adjusts limits based on system load. When CPU usage exceeds 80%, reduce all rate limits by 50%. When error rates spike, tighten limits further. This variant requires integration with monitoring systems and careful tuning to avoid oscillation. Netflix uses adaptive rate limiting in their API gateway to protect during partial outages—if one availability zone fails, the remaining zones automatically reduce their rate limits to prevent overload.

Algorithm Comparison Matrix

Choosing the right algorithm depends on your specific requirements across five key dimensions.

Burst Handling: Token bucket excels here, allowing clients to accumulate unused quota and spend it in bursts. A client with 10 req/sec limit can save up tokens and make 100 requests instantly if they haven’t used the API in 10 seconds. Leaky bucket is the opposite—it enforces strict output rate regardless of input bursts, making it ideal for protecting downstream services. Fixed and sliding windows fall in between, allowing bursts up to the window limit but no accumulation across windows.

Memory Usage: Fixed window is most efficient, requiring only a single counter per client per window. Sliding window counter needs two counters. Token bucket needs two values (token count and timestamp). Sliding window log is the memory hog, storing individual timestamps for every request within the window. For 1 million users with 100 req/min limits, fixed window uses ~8MB (1M counters × 8 bytes), while sliding window log uses ~800MB (1M users × 100 timestamps × 8 bytes).

Accuracy: Sliding window log is perfectly accurate—it enforces the limit over any rolling time period. Sliding window counter is approximately accurate (typically within 1-2% of the true limit). Token bucket and leaky bucket are accurate for their specific models but define “rate” differently than simple request counting. Fixed window has the boundary problem, potentially allowing 2× the intended rate in edge cases.

Implementation Complexity: Fixed window is trivial—a single INCR and EXPIRE in Redis. Sliding window counter requires two counters and weighted calculation. Token bucket needs timestamp math and refill calculations. Sliding window log requires sorted set operations (ZADD, ZREMRANGEBYSCORE, ZCARD). Leaky bucket is most complex, requiring queue management and background processing.

Use Cases: Fixed window works for simple daily/hourly quotas where precision doesn’t matter (“1000 emails per day”). Sliding window counter is the sweet spot for API rate limiting—good accuracy with reasonable memory. Token bucket fits APIs where burst tolerance is important (Stripe’s payment API allows bursts for batch processing). Leaky bucket suits scenarios where you need to smooth traffic to a constant rate (video processing queues, database write throttling). Sliding window log is for high-value scenarios where perfect accuracy justifies the memory cost (premium API tiers, financial transactions).

Rate Limiting Algorithm Selection Decision Tree

flowchart TB
    Start["Choose Rate Limiting<br/>Algorithm"]
    
    Start --> Q1{"Need to handle<br/>burst traffic?"}
    
    Q1 -->|Yes| Q2{"Accumulate unused<br/>quota over time?"}
    Q1 -->|No| Q3{"Strict output rate<br/>required?"}
    
    Q2 -->|Yes| TokenBucket["✓ Token Bucket<br/><br/>Use Case: Stripe API<br/>100 req/sec, burst 1000<br/><br/>Memory: 2 values<br/>Accuracy: Good<br/>Complexity: Medium"]
    
    Q2 -->|No| Q4{"Perfect accuracy<br/>required?"}
    
    Q3 -->|Yes| LeakyBucket["✓ Leaky Bucket<br/><br/>Use Case: Video transcoding<br/>Smooth 10 videos/sec<br/><br/>Memory: Queue size<br/>Accuracy: Perfect<br/>Complexity: High"]
    
    Q3 -->|No| Q4
    
    Q4 -->|Yes| SlidingLog["✓ Sliding Window Log<br/><br/>Use Case: Financial APIs<br/>Premium tier enforcement<br/><br/>Memory: High (100× counters)<br/>Accuracy: Perfect<br/>Complexity: Medium"]
    
    Q4 -->|No| Q5{"Simple daily/hourly<br/>quota sufficient?"}
    
    Q5 -->|Yes| FixedWindow["✓ Fixed Window<br/><br/>Use Case: Email limits<br/>1000 emails/day<br/><br/>Memory: 1 counter<br/>Accuracy: ~50% error at boundaries<br/>Complexity: Low"]
    
    Q5 -->|No| SlidingCounter["✓ Sliding Window Counter<br/><br/>Use Case: General API limiting<br/>Shopify, Twitter APIs<br/><br/>Memory: 2 counters<br/>Accuracy: ~1-2% error<br/>Complexity: Low-Medium"]

Algorithm selection depends on burst handling needs, accuracy requirements, and memory constraints. Token bucket is ideal for APIs needing burst tolerance (Stripe). Leaky bucket smooths traffic for downstream protection. Fixed window suits simple quotas where precision doesn’t matter. Sliding window counter is the sweet spot for most API rate limiting—good accuracy with reasonable memory usage.

Trade-offs

Accuracy vs. Memory: Sliding window log provides perfect accuracy but consumes 10-100× more memory than fixed window counters. For most applications, the 1-2% error rate of sliding window counter is acceptable and saves significant infrastructure cost. Choose perfect accuracy only when rate limit violations have serious consequences (financial APIs, security-critical endpoints). The decision framework: if violating the limit by 5% costs you less than the Redis memory to prevent it, use sliding window counter.

Centralized vs. Distributed State: Centralized rate limiting (Redis) provides consistent enforcement across all servers but adds latency (1-5ms per request) and creates a single point of failure. Local rate limiting (in-memory on each server) is faster (microseconds) and more resilient but only enforces limits per-server. For a limit of 1000 req/sec across 10 servers, local limiting effectively allows 10,000 req/sec. The hybrid approach: use local rate limiting for coarse-grained protection (10× the intended limit) and centralized for precise enforcement. If Redis is down, fall back to local limits—degraded but not broken.

Strict vs. Permissive Limits: Strict enforcement (reject at exactly 100 requests) protects your infrastructure but creates poor user experience—a client at 99 requests doesn’t know if their next request will succeed. Permissive enforcement (allow small overages, use soft limits with warnings) provides better UX but requires more capacity headroom. Stripe uses a permissive model: they allow brief overages and send warning headers before hard enforcement. The decision depends on your cost model—if excess traffic costs you real money (third-party API calls), be strict. If it’s just CPU cycles, be permissive.

Synchronous vs. Asynchronous Enforcement: Synchronous rate limiting checks limits before processing the request, providing immediate feedback but adding latency to every request. Asynchronous rate limiting processes the request first and checks limits in the background, then blocks future requests if limits are exceeded. This reduces latency but allows temporary violations. High-throughput systems like Twitter use asynchronous enforcement for non-critical endpoints—they’d rather serve a few extra tweets than add 2ms to every request.

Global vs. Regional Limits: Global rate limiting enforces limits across all data centers, requiring cross-region state synchronization. Regional limiting enforces per-region, allowing higher aggregate throughput but risking uneven load. For a global limit of 1000 req/sec with users in US and EU, you need to synchronize state between regions (adding 50-100ms latency) or accept that users could achieve 2000 req/sec by splitting traffic. Most companies use regional limits with monitoring to detect abuse patterns.

When to Use (and When Not To)

Rate limiting is essential for any API exposed to external clients, whether public or partner-facing. If you’re building a REST API, GraphQL endpoint, or webhook receiver, you need rate limiting from day one. The pattern is also critical for protecting expensive operations—database writes, third-party API calls, compute-intensive tasks like image processing or AI inference. Even internal services benefit from rate limiting to prevent cascading failures when one service misbehaves.

Use token bucket when your API serves batch operations or clients with bursty traffic patterns. Stripe chose token bucket because merchants often process payments in batches—they need to handle 1000 payment requests in a few seconds, then nothing for hours. Use leaky bucket when you’re protecting a downstream service with strict throughput limits, like a database that can handle exactly 1000 writes per second. Use fixed window for simple quotas where precision doesn’t matter and you want minimal complexity—daily email limits, monthly API call quotas. Use sliding window counter for general-purpose API rate limiting where you need good accuracy without excessive memory usage.

Avoid rate limiting for internal service-to-service communication in trusted environments—it adds latency without providing value if you control both services. Instead, use circuit breakers and load shedding (see reliability module). Don’t use rate limiting as your only defense against DDoS attacks—it’s one layer, but you need network-level protection (Cloudflare, AWS Shield) for volumetric attacks. Don’t apply the same rate limits to all endpoints—a GET request for user profile is cheap, while a POST to process a video is expensive. Use different limits based on operation cost.

Anti-pattern: implementing rate limiting in application code without using atomic operations. This creates race conditions where multiple threads or servers increment counters simultaneously, allowing limit violations. Always use atomic operations (Redis INCR, database transactions) or accept that your limits are approximate. Another anti-pattern: returning generic 500 errors instead of 429 with Retry-After headers. Clients need actionable information to implement proper backoff strategies.

Real-World Examples

Stripe uses token bucket rate limiting with a base limit of 100 requests per second and burst capacity of 1,000 requests. Their API returns detailed rate limit headers: Stripe-RateLimit-Limit: 100, Stripe-RateLimit-Remaining: 87, Stripe-RateLimit-Reset: 1642262460. The interesting detail is their tiered approach—different API keys have different limits based on business volume. High-volume merchants get 1,000 req/sec base limits. Stripe also implements hierarchical limiting: per-API-key limits and per-account aggregate limits. This prevents a single account with multiple API keys from overwhelming the system. They use Redis Cluster for distributed rate limiting across their global infrastructure, with regional clusters to minimize latency.

Shopify employs a sophisticated leaky bucket implementation for their API, processing requests at a fixed rate of 2 requests per second per store. The bucket has a capacity of 40 requests, allowing bursts up to that size. What makes Shopify’s approach interesting is their “cost-based” rate limiting—different API endpoints consume different amounts of the bucket. A simple GET request costs 1 unit, while a complex query costs 10 units. This prevents clients from gaming the system by making cheap requests to stay under limits while occasionally making expensive ones. They return a custom header X-Shopify-Shop-Api-Call-Limit: 32/40 showing bucket usage. Their implementation uses Redis with Lua scripts to ensure atomic bucket updates across their distributed infrastructure.

Twitter uses a combination of fixed window and sliding window rate limiting depending on the endpoint. For tweet creation, they enforce 300 tweets per 3-hour window using fixed windows—precision doesn’t matter much here, and the simplicity reduces infrastructure cost. For their API endpoints, they use sliding window counter with 15-minute windows. The interesting detail is their multi-level limiting: per-user limits (300 tweets/3hrs), per-app limits (different for each OAuth app), and per-IP limits for unauthenticated endpoints. They also implement “velocity limiting” that detects unusual spikes in activity and temporarily reduces limits for suspicious accounts. Their rate limit headers follow the standard format but add x-rate-limit-reset as a Unix timestamp, making it easy for clients to calculate wait times.

Shopify Cost-Based Rate Limiting with Leaky Bucket

graph LR
    subgraph Client Requests
        R1["GET /products<br/>Cost: 1 unit"]
        R2["GET /orders<br/>Cost: 1 unit"]
        R3["POST /products/search<br/>Complex query<br/>Cost: 10 units"]
        R4["GET /customers<br/>Cost: 1 unit"]
    end
    
    subgraph Leaky Bucket - Per Store
        Bucket["Bucket Capacity: 40 units<br/>Current: 28 units used<br/>Leak Rate: 2 units/sec<br/><br/>Processing Queue"]
    end
    
    R1 --"1. Consume 1 unit"--> Bucket
    R2 --"2. Consume 1 unit"--> Bucket
    R3 --"3. Consume 10 units"--> Bucket
    R4 --"4. Consume 1 unit"--> Bucket
    
    Bucket --"5. 28/40 units used<br/>✓ All allowed"--> Response1["200 OK<br/>X-Shopify-Shop-Api-Call-Limit:<br/>28/40"]
    
    R5["POST /products/bulk<br/>Cost: 15 units"] --"6. Would exceed capacity<br/>(28 + 15 = 43 > 40)"--> Bucket
    
    Bucket --"7. Bucket full"--> Response2["429 Too Many Requests<br/>X-Shopify-Shop-Api-Call-Limit:<br/>40/40<br/>Retry-After: 6"]
    
    Time["⏱ 6 seconds pass"] --"8. Leak 12 units<br/>(6 sec × 2 units/sec)"--> Bucket
    
    Bucket --"9. 16/40 units used<br/>Space available"--> Response3["✓ Retry succeeds<br/>31/40 units used"]
    
    subgraph Cost Assignment
        Simple["Simple GET: 1 unit"]
        Complex["Complex Query: 10 units"]
        Bulk["Bulk Operation: 15 units"]
    end

Shopify’s cost-based rate limiting assigns different costs to API endpoints based on resource consumption. Simple GET requests cost 1 unit, while complex queries cost 10+ units. The leaky bucket processes requests at a fixed rate of 2 units per second with a capacity of 40 units. This prevents clients from gaming the system by making many cheap requests while occasionally making expensive ones, ensuring fair resource allocation based on actual load.

Interview Essentials

Mid-Level

At the mid-level, demonstrate understanding of the core algorithms and their trade-offs. Explain token bucket and fixed window clearly, including the boundary problem with fixed windows. Implement a simple rate limiter using Redis: INCR key, check if count exceeds limit, EXPIRE key 60. Discuss why you need atomic operations—two servers checking simultaneously could both see count=99 and both allow the request, resulting in 101 total. Explain the standard HTTP response: 429 status code with Retry-After header. Calculate basic parameters: for 100 requests per minute, you need a refill rate of 100/60 ≈ 1.67 tokens per second. Show awareness that rate limiting alone doesn’t prevent DDoS—you need multiple layers of defense.

Senior

Senior engineers should compare algorithms systematically and justify choices based on requirements. When asked to design rate limiting for an API, clarify: What’s the traffic pattern—steady or bursty? What’s the cost of violations—does exceeding limits cost real money or just CPU? What’s the scale—millions of users or thousands? Then choose the algorithm: token bucket for burst tolerance, sliding window counter for general-purpose API limiting, leaky bucket for protecting downstream services. Design the distributed implementation: Redis with Lua scripts for atomicity, key design like ratelimit:{user_id}:{window}, TTL for automatic cleanup. Discuss the failure mode: if Redis is down, do you fail open (allow all requests) or fail closed (reject all)? Most systems fail open with local rate limiting as backup. Address the thundering herd problem: if 1 million users all hit rate limit reset at exactly 12:00:00, you get a massive spike. Solution: add jitter to reset times or use sliding windows. Explain hierarchical limiting: per-user, per-IP, per-API-key, and how to efficiently check multiple limits (parallel Redis calls, pipelining).

Staff+

Staff+ engineers should discuss system-wide implications and edge cases. How do you handle rate limiting across multiple regions? Options: global state with cross-region replication (adds latency), regional limits with monitoring (allows higher aggregate), or hybrid with eventual consistency. Discuss the CAP theorem implications—you can’t have both perfect consistency and low latency for global rate limiting. Address clock synchronization issues in distributed systems: if servers have clock skew of ±1 second, token bucket refill calculations become inaccurate. Solution: use logical clocks or accept approximate enforcement. Explain how to implement adaptive rate limiting that responds to system load: integrate with metrics systems, reduce limits when error rates spike, implement gradual recovery to avoid oscillation. Discuss the economics: for 1 billion requests per day with 1ms Redis latency, you’re spending 1 million seconds of waiting time daily. Is it worth optimizing to local caching with eventual consistency? Design a rate limiting system that handles 1 million requests per second: you need Redis Cluster with 100+ nodes, client-side caching to reduce Redis load, and probabilistic algorithms (counting Bloom filters) to trade perfect accuracy for massive scale. Explain how to debug rate limiting issues in production: logging every rate limit decision is too expensive, so you need sampling and aggregate metrics. Discuss the user experience implications: how do you communicate rate limits to developers (documentation, headers, error messages), and how do you handle the support burden when legitimate users hit limits?

Common Interview Questions

How would you implement rate limiting for 1 million requests per second? (Answer: Redis Cluster with client-side caching, Lua scripts for atomicity, consider probabilistic algorithms like counting Bloom filters to reduce state size)

What happens if your Redis cluster goes down? (Answer: Fail open with local in-memory rate limiting at 10× the normal limit, monitor for abuse, have runbooks for manual intervention)

How do you prevent the boundary problem in fixed window rate limiting? (Answer: Use sliding window counter or sliding window log, or accept approximate enforcement if the use case allows it)

How would you rate limit GraphQL APIs where different queries have different costs? (Answer: Implement cost-based rate limiting—assign complexity scores to queries, consume tokens proportional to cost, similar to Shopify’s approach)

A client complains they’re being rate limited but their code only makes 90 requests per minute against a 100/min limit. What could be wrong? (Answer: Multiple instances of their code running, clock skew causing window misalignment, or they’re hitting a hierarchical limit at a different scope)

Red Flags to Avoid

Implementing rate limiting without atomic operations, leading to race conditions

Using the same rate limit for all endpoints regardless of operation cost

Not returning proper HTTP 429 status codes and Retry-After headers

Failing to consider the distributed systems implications—designing only for single-server deployment

Not having a fallback strategy when the rate limiting infrastructure fails

Implementing rate limiting in a blocking way that adds excessive latency to every request

Key Takeaways

Rate limiting protects against abuse, ensures fair resource allocation, and controls costs. It’s essential for any externally-facing API and for protecting expensive operations even in internal systems.

Choose your algorithm based on requirements: token bucket for burst tolerance (Stripe), leaky bucket for strict output rate (video processing), fixed window for simple quotas (daily limits), sliding window counter for general-purpose API limiting with good accuracy and reasonable memory usage.

Distributed rate limiting requires shared state (Redis) with atomic operations (Lua scripts) to prevent race conditions. Design for failure—if Redis is down, fail open with local rate limiting rather than rejecting all traffic.

Implement hierarchical rate limiting (per-user, per-IP, per-API-key) to prevent abuse at multiple levels. Use cost-based limiting for APIs where different operations have different resource costs.

Return proper HTTP 429 responses with rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After) so clients can implement intelligent backoff strategies and self-regulate their traffic.