Caching in System Design: Complete Overview

After this topic, you will be able to:

Explain the fundamental purpose and benefits of caching in distributed systems
Identify the different layers where caching can be applied in system architecture
Describe the key metrics used to evaluate cache performance (hit ratio, latency reduction)
Compare the trade-offs between cache complexity and performance gains

TL;DR

Caching stores frequently accessed data in fast, temporary storage to dramatically reduce latency and load on backend systems. It’s the single most effective performance optimization in distributed systems, appearing at every layer from browsers to databases. Understanding where to cache, what to cache, and how to keep cached data fresh is fundamental to designing scalable systems.

Cheat Sheet: Cache hit ratio = (hits / (hits + misses)) × 100%. Aim for 80%+ hit ratio. Common layers: client (browser), CDN (static assets), application (Redis/Memcached), database (query cache). Key trade-off: freshness vs performance. TTL controls staleness. Invalidation keeps data consistent.

Why This Matters

Every system design interview eventually asks “how would you make this faster?” and the answer almost always starts with caching. When Twitter’s timeline loads in 200ms instead of 2 seconds, that’s caching. When Netflix serves 250 million users without melting their origin servers, that’s caching. When Stripe processes payments with 99.99% uptime while handling massive traffic spikes, caching plays a critical role.

The numbers tell the story: reading from memory (cache) takes nanoseconds, while reading from disk takes milliseconds—a difference of six orders of magnitude. Reading from a remote database over the network? Add another order of magnitude. For a system serving millions of requests per second, this difference determines whether you need 10 servers or 1,000 servers. At Netflix scale, caching isn’t just an optimization—it’s the difference between a $10 million and $100 million infrastructure bill.

But caching isn’t just about speed. It’s about resilience. When your database goes down, a well-designed cache can keep your application running in degraded mode. When traffic spikes 10x during a product launch, your cache absorbs the load. When you’re paying per database query, caching directly reduces your cloud bill. Understanding caching deeply means understanding how to build systems that are fast, reliable, and cost-effective—exactly what interviewers want to see.

The challenge is that caching introduces complexity. You now have two sources of truth: the cache and the database. Keeping them synchronized is hard. Deciding what to cache, where to cache it, and when to invalidate it requires careful thought. This module teaches you to navigate these trade-offs with confidence, using patterns proven at companies like Amazon, Google, and Facebook.

Fundamental Concepts

Before diving into caching strategies and architectures, let’s establish the core concepts that underpin all caching systems. These terms will appear throughout the module, so understanding them now creates a solid foundation.

Cache Hit and Cache Miss: When a request arrives, the system first checks the cache. A cache hit means the data exists in the cache and can be returned immediately—this is the happy path. A cache miss means the data isn’t cached, forcing the system to fetch it from the slower backend (database, API, disk). The system typically then stores this data in the cache for future requests. Every caching strategy is fundamentally about maximizing hits and minimizing misses.

Hit Ratio: This is your primary cache performance metric, calculated as (cache hits / total requests) × 100%. A hit ratio of 80% means 80% of requests are served from cache, while 20% hit the backend. In production, you want hit ratios above 80% for most use cases—anything below 70% suggests your caching strategy needs work. At companies like Reddit, they monitor hit ratios per cache layer and per endpoint, treating drops as production incidents. A 90% hit ratio means you’ve reduced backend load by 10x, which translates directly to cost savings and improved latency.

Time-To-Live (TTL): Every cached item needs an expiration policy. TTL defines how long data remains in the cache before it’s considered stale and removed. A TTL of 300 seconds (5 minutes) means cached data expires after 5 minutes, forcing the next request to fetch fresh data from the backend. TTL is your primary tool for balancing freshness and performance. Short TTLs (seconds) keep data fresh but reduce hit ratios. Long TTLs (hours or days) maximize performance but risk serving stale data. Choosing the right TTL requires understanding your data’s tolerance for staleness—product prices might need 1-minute TTLs, while user profile data might tolerate 1-hour TTLs.

Cache Coherence: In distributed systems with multiple cache servers, coherence refers to keeping cached data consistent across all nodes. When you update a user’s profile, should all cache servers immediately reflect this change? Perfect coherence is expensive and often unnecessary. Most systems accept eventual consistency—caches might be slightly out of sync for a brief period. The challenge is deciding which data requires strong coherence (payment information) versus which can tolerate brief inconsistencies (social media likes count).

Cache Warming: A cold cache (empty cache) provides no benefit—every request is a miss. Cache warming is the practice of pre-populating the cache with likely-needed data before traffic arrives. When Netflix deploys a new cache server, they don’t wait for organic traffic to fill it. They proactively load popular movie metadata and user preferences. This prevents the cold start problem where initial requests are slow while the cache fills. In interviews, mentioning cache warming shows you think about operational realities, not just steady-state performance.

Cold Start Problem: This occurs when a cache is empty (after restart, deployment, or eviction) and every request misses, hammering the backend. Imagine a cache server restarts at 2 PM. For the next 10 minutes, your database sees 10x normal load as the cache rebuilds. This can cascade into a cache stampede—so many misses that the database falls over, preventing the cache from ever warming up. Solutions include gradual traffic ramp-up, cache warming, and request coalescing (covered in cache-invalidation). Understanding this problem shows maturity—you’re thinking about failure modes, not just happy paths.

Cache Hit vs Cache Miss Flow

graph LR
    Client["Client Application"]
    Cache[("Cache Layer<br/><i>Redis/Memcached</i>")]
    DB[("Database<br/><i>PostgreSQL</i>")]
    
    Client --"1. Request data<br/>(key: user:123)"--> Cache
    Cache --"2a. Cache HIT<br/>Return data (1ms)"--> Client
    Cache --"2b. Cache MISS<br/>Data not found"--> DB
    DB --"3. Query database<br/>(50ms)"--> Cache
    Cache --"4. Store in cache<br/>(TTL: 300s)"--> Cache
    Cache --"5. Return data"--> Client

Cache hit returns data in ~1ms from memory, while cache miss requires a database query (~50ms) plus cache population. The system stores fetched data in cache with a TTL to serve future requests faster. Hit ratio = hits / (hits + misses) determines overall performance gain.

The Landscape

Caching isn’t a single technology—it’s an architectural pattern that appears at every layer of modern systems. Understanding the landscape means knowing where caching happens, what technologies power each layer, and how these layers work together.

The Multi-Layer Reality: Production systems at scale use caching at 5-7 different layers simultaneously. Your browser caches static assets. A CDN caches content at edge locations worldwide. Your load balancer might cache SSL session data. Your application server runs an in-memory cache like Redis. Your database has its own query cache. Each layer serves a different purpose and operates at a different scale. The art of caching is knowing which layer to use for which data.

Client-Side Caching: This is the fastest cache because it eliminates network requests entirely. Browsers automatically cache static assets (images, CSS, JavaScript) based on HTTP headers like Cache-Control and ETag. Modern single-page applications use service workers to cache API responses and enable offline functionality. Mobile apps cache user data locally using SQLite or key-value stores. The trade-off is control—once data is cached on a million client devices, you can’t invalidate it instantly. You rely on TTLs and versioned URLs (app.js?v=123) to force refreshes.

CDN Caching: Content Delivery Networks like CloudFlare, Fastly, and AWS CloudFront cache content at edge locations near users. When a user in Tokyo requests an image, the CDN serves it from a Tokyo data center instead of routing to your origin server in Virginia. This reduces latency from 200ms to 20ms and shields your origin from traffic spikes. CDNs excel at caching static content (images, videos, JavaScript bundles) and increasingly cache dynamic content with short TTLs. Companies like Shopify serve 90% of traffic from CDN cache, hitting their origin servers only for checkout and admin operations.

Application-Level Caching: This is where most interview discussions focus. Technologies like Redis and Memcached provide distributed, in-memory key-value stores that sit between your application and database. Redis is the dominant choice today, offering data structures (lists, sets, sorted sets), persistence options, and pub/sub capabilities. Memcached is simpler and slightly faster for pure key-value workloads. Application caches store database query results, computed values (recommendation scores), session data, and rate-limiting counters. This layer typically achieves sub-millisecond latency and handles millions of operations per second per node.

Database Caching: Databases have built-in caching at multiple levels. MySQL’s query cache stores result sets for identical queries. InnoDB’s buffer pool caches frequently accessed pages in memory. PostgreSQL’s shared buffers serve the same purpose. These caches are transparent to applications but critical for performance. However, database caches are limited by the database server’s memory and don’t scale horizontally easily—this is why application-level caches exist. Understanding that databases already cache helps you avoid redundant caching and focus on what databases can’t cache effectively (cross-table aggregations, computed values).

Specialized Caching: Beyond these standard layers, specialized caches exist for specific use cases. Full-page caching stores entire HTML pages (used by WordPress, Drupal). Object caches store serialized application objects. GPU caches accelerate machine learning inference. DNS caching reduces name resolution latency. Each has its own trade-offs and invalidation challenges. In interviews, mentioning specialized caches when relevant (“for this read-heavy blog, I’d use full-page caching with Varnish”) demonstrates breadth of knowledge.

Multi-Layer Caching Architecture

graph TB
    User["User<br/><i>Browser/Mobile</i>"]
    
    subgraph Client Layer
        Browser["Browser Cache<br/><i>HTTP Cache, Service Workers</i><br/>TTL: Hours-Days"]
    end
    
    subgraph Edge Layer
        CDN["CDN<br/><i>CloudFront, Cloudflare</i><br/>TTL: Minutes-Hours"]
    end
    
    subgraph Application Layer
        LB["Load Balancer"]
        App1["App Server 1"]
        App2["App Server 2"]
        Redis[("Redis Cluster<br/><i>Application Cache</i><br/>TTL: Seconds-Minutes")]
    end
    
    subgraph Data Layer
        DB[("Database<br/><i>Query Cache, Buffer Pool</i><br/>TTL: Automatic")]
    end
    
    User --"1. Request"--> Browser
    Browser --"2. If miss"--> CDN
    CDN --"3. If miss"--> LB
    LB --> App1 & App2
    App1 & App2 --"4. Check cache"--> Redis
    Redis --"5. If miss"--> DB
    
    DB --"6. Return data"--> Redis
    Redis --"7. Cache & return"--> App1
    App1 --"8. Return"--> CDN
    CDN --"9. Cache & return"--> Browser
    Browser --"10. Cache & display"--> User

Production systems use caching at every layer, each optimized for different data types and latency requirements. Browser cache eliminates network requests entirely. CDN serves static assets from edge locations. Application cache (Redis) handles dynamic data. Database cache is transparent but limited. Each layer reduces load on downstream systems exponentially.

Key Areas

The caching landscape divides into five key areas, each addressing different aspects of building effective caching systems. These areas map directly to the topics in this module.

1. Cache Placement and Architecture: Where you place caches in your system architecture fundamentally determines their effectiveness. Inline caches (cache-aside, read-through, write-through) sit between the application and database, intercepting requests. Sidecar caches run alongside application instances, providing local, low-latency access. Distributed caches span multiple servers, offering horizontal scalability but adding network hops. The choice depends on your consistency requirements, latency targets, and scale. Instagram uses a multi-tier cache architecture: local in-process caches for hot data, regional Redis clusters for shared data, and a global cache layer for truly universal data like celebrity profiles. Understanding these patterns (covered in cache-strategies) helps you design caching that matches your system’s needs.

2. Cache Eviction Policies: Caches have finite memory, so they must decide what to keep and what to discard. LRU (Least Recently Used) evicts the oldest-accessed items, working well for time-based access patterns. LFU (Least Frequently Used) evicts rarely-accessed items, better for popularity-based patterns. FIFO (First In, First Out) is simple but naive. Random eviction is surprisingly effective for certain workloads. The wrong eviction policy can tank your hit ratio—imagine using LRU for a cache where old items are frequently accessed (user profiles). Redis defaults to LRU variants, but understanding when to use alternatives (covered in cache-eviction-policies) is crucial. At Spotify, they use different eviction policies for different cache tiers: LRU for song metadata, LFU for playlist data.

3. Cache Invalidation and Consistency: Phil Karlton famously said, “There are only two hard things in Computer Science: cache invalidation and naming things.” Keeping cached data synchronized with the source of truth is genuinely difficult. Time-based invalidation (TTL) is simple but imprecise—data might be stale for the entire TTL period. Event-based invalidation (invalidate on write) is precise but complex, requiring coordination between writers and caches. Write-through and write-behind strategies offer different consistency guarantees. The challenge multiplies in distributed systems where multiple caches might hold the same data. Facebook’s TAO system uses a sophisticated invalidation protocol to keep social graph data consistent across thousands of cache servers. This topic (covered in cache-invalidation) is where senior engineers separate themselves—showing you understand the trade-offs between consistency, performance, and complexity.

4. Cache Warming and Resilience: A cache is only useful if it contains the right data. Cache warming strategies proactively populate caches to avoid cold start problems. Predictive warming uses analytics to identify likely-needed data. Lazy warming lets organic traffic fill the cache but risks initial slowness. Background warming runs batch jobs to refresh cache contents. Resilience patterns handle cache failures gracefully—circuit breakers prevent cascading failures when caches go down, fallback strategies serve stale data rather than failing, and request coalescing prevents thundering herds during cache misses. Netflix’s EVCache system includes sophisticated warming and resilience features, allowing them to lose entire cache clusters without user-visible impact. These operational concerns (covered in cache-warming-strategies) show you think beyond the happy path.

5. Layer-Specific Caching: Each caching layer has unique characteristics, technologies, and best practices. Browser caching uses HTTP headers and service workers. CDN caching requires understanding edge computing and cache purging APIs. Application caching with Redis involves data structure selection, clustering strategies, and persistence trade-offs. Database query caching requires understanding query patterns and cache invalidation triggers. Choosing the right layer for each type of data is a key skill. Static assets belong in CDNs. User session data belongs in application caches. Computed aggregations might belong in both (CDN for popular items, application cache for long-tail). The topics covering specific cache types (client-side-caching, cdn-caching, application-caching) provide deep dives into each layer’s nuances.

Cache Invalidation Strategies Comparison

graph TB
    subgraph TTL-Based Invalidation
        Write1["Write Operation"]
        Cache1[("Cache<br/>TTL: 300s")]
        DB1[("Database")]
        Read1["Read Operation"]
        
        Write1 --"1. Update"--> DB1
        Read1 --"2. Read (may be stale)"--> Cache1
        Cache1 -."3. Expires after TTL".-> Cache1
        Cache1 --"4. Fetch fresh data"--> DB1
    end
    
    subgraph Event-Based Invalidation
        Write2["Write Operation"]
        Cache2[("Cache")]
        DB2[("Database")]
        Queue["Message Queue<br/><i>Kafka/RabbitMQ</i>"]
        Read2["Read Operation"]
        
        Write2 --"1. Update"--> DB2
        Write2 --"2. Publish event"--> Queue
        Queue --"3. Invalidate"--> Cache2
        Read2 --"4. Read (always fresh)"--> Cache2
        Cache2 --"5. If miss, fetch"--> DB2
    end
    
    subgraph Write-Through
        Write3["Write Operation"]
        Cache3[("Cache")]
        DB3[("Database")]
        Read3["Read Operation"]
        
        Write3 --"1. Update cache"--> Cache3
        Cache3 --"2. Update DB (sync)"--> DB3
        Read3 --"3. Read (always fresh)"--> Cache3
    end
    
    
    class TTL-Based simple
    class Event-Based complex
    class Write-Through consistent

TTL-based invalidation is simple but allows stale data for the entire TTL period. Event-based invalidation provides immediate consistency but requires coordination infrastructure. Write-through guarantees consistency by updating cache and database synchronously but adds write latency. Choose based on consistency requirements and operational complexity tolerance.

How Things Connect

Caching topics aren’t isolated—they form an interconnected system of decisions and trade-offs. Understanding how these pieces fit together is what separates memorized knowledge from true expertise.

Start with cache placement (cache-strategies): where you put caches determines everything else. An inline cache-aside pattern requires explicit invalidation logic. A write-through cache guarantees consistency but adds write latency. This decision cascades into your invalidation strategy (cache-invalidation)—write-through caches invalidate synchronously, while cache-aside requires asynchronous invalidation or TTL-based expiration. Your invalidation strategy then influences your eviction policy (cache-eviction-policies)—if you’re using TTL-based invalidation, your eviction policy matters less because items expire naturally. But if you’re using event-based invalidation with long TTLs, eviction policy becomes critical for managing memory.

Cache warming (cache-warming-strategies) connects to both eviction and invalidation. If you’re using aggressive eviction (short TTLs, aggressive LRU), you need robust warming to maintain hit ratios. If you’re using event-based invalidation, warming becomes simpler because you’re not fighting TTL expiration. Warming also connects to resilience patterns—a well-warmed cache can serve stale data during database outages, but only if your invalidation strategy allows serving slightly stale data.

The layer-specific topics (client-side-caching, cdn-caching, application-caching, database-caching) all implement these core patterns differently. Browser caching uses TTL-based invalidation exclusively (you can’t push invalidations to millions of browsers). CDN caching uses a mix of TTL and explicit purging. Application caching with Redis offers the most flexibility—you can implement any invalidation strategy. Database caching is mostly transparent, using internal heuristics.

In interviews, demonstrating these connections shows systems thinking. When discussing cache-aside, mention “this requires careful invalidation—I’d use a write-through pattern for critical data and TTL-based expiration for less critical data, which connects to our eviction policy choice.” When proposing Redis for application caching, mention “we’ll need cache warming after deployments to avoid cold start problems, and I’d use LRU eviction with a memory limit to handle traffic spikes.” These connections show you’re not just listing technologies—you’re designing a coherent system.

The meta-pattern is this: caching is always a trade-off between performance (hit ratio, latency), consistency (freshness, coherence), and complexity (operational overhead, failure modes). Every decision you make shifts the balance. Understanding how the pieces connect lets you navigate these trade-offs deliberately rather than accidentally.

Caching Decision Flow and Trade-offs

graph TB
    Start["Design Decision:<br/>Add Caching"]
    
    Start --> Placement{"Cache Placement<br/>Strategy?"}
    
    Placement -->|"Cache-Aside"| CacheAside["Application manages<br/>cache explicitly"]
    Placement -->|"Write-Through"| WriteThrough["Writes go to<br/>cache + DB sync"]
    Placement -->|"Read-Through"| ReadThrough["Cache loads<br/>data automatically"]
    
    CacheAside --> Invalidation1{"Invalidation<br/>Strategy?"}
    WriteThrough --> Invalidation2{"Invalidation<br/>Strategy?"}
    ReadThrough --> Invalidation3{"Invalidation<br/>Strategy?"}
    
    Invalidation1 -->|"TTL-Based"| TTL1["Simple, eventual<br/>consistency"]
    Invalidation1 -->|"Event-Based"| Event1["Complex, strong<br/>consistency"]
    
    Invalidation2 -->|"Automatic"| Auto["Built-in consistency<br/>Higher write latency"]
    
    Invalidation3 -->|"TTL-Based"| TTL2["Transparent to app<br/>Possible staleness"]
    
    TTL1 --> Eviction{"Eviction<br/>Policy?"}
    Event1 --> Eviction
    Auto --> Eviction
    TTL2 --> Eviction
    
    Eviction -->|"LRU"| LRU["Time-based access<br/>Good for recency"]
    Eviction -->|"LFU"| LFU["Frequency-based<br/>Good for popularity"]
    Eviction -->|"TTL"| TTLEvict["Automatic expiration<br/>Predictable memory"]
    
    LRU --> Warming{"Need Cache<br/>Warming?"}
    LFU --> Warming
    TTLEvict --> Warming
    
    Warming -->|"High Traffic"| WarmYes["Proactive warming<br/>Avoid cold start"]
    Warming -->|"Low Traffic"| WarmNo["Lazy loading<br/>Simpler ops"]
    
    WarmYes --> Monitor["Monitor:<br/>Hit ratio 80%+<br/>Latency p99<br/>Eviction rate"]
    WarmNo --> Monitor

Caching decisions cascade through interconnected trade-offs. Cache placement strategy (cache-aside vs write-through) determines invalidation complexity. Invalidation strategy (TTL vs event-based) influences eviction policy importance. High-traffic systems require cache warming to avoid cold starts. Each decision shifts the balance between performance, consistency, and operational complexity. Monitor hit ratios and latency to validate your choices.

Real-World Context

Let’s ground these concepts in how real companies use caching at scale. These examples illustrate the patterns, trade-offs, and operational realities you’ll encounter in production systems.

Netflix’s Multi-Tier Caching: Netflix serves 250 million subscribers with a sophisticated caching architecture. At the edge, they use AWS CloudFront (CDN) to cache video chunks and static assets, achieving 90%+ offload from origin servers. Behind the CDN, they run EVCache, a distributed memcached-based system that caches movie metadata, user preferences, and recommendation scores. EVCache uses multiple cache tiers: a small, fast L1 cache on each application server (in-process, sub-millisecond latency) and a larger L2 cache in dedicated Redis clusters (1-2ms latency). They use TTL-based invalidation for most data (5-60 minute TTLs) and event-based invalidation for critical data like subscription status. During peak hours, their caching system handles 50+ million requests per second, reducing database load by 100x. Their cache warming strategy is particularly clever—when deploying new cache servers, they gradually route traffic while monitoring hit ratios, automatically adjusting the ramp-up rate to prevent cold start problems.

Facebook’s TAO: Facebook’s social graph (friendships, likes, comments) is cached in TAO, a distributed cache sitting in front of MySQL. TAO uses a write-through caching pattern—writes go to both cache and database synchronously, ensuring consistency. They use a sophisticated invalidation protocol: when a user likes a post, the write updates the database and broadcasts invalidation messages to all cache servers holding that post’s data. This ensures all caches see the update within milliseconds. TAO handles trillions of requests per day with a 99%+ hit ratio. The key insight is that social graph data has strong locality—users interact with a small subset of the graph repeatedly (their friends, favorite pages). This makes caching extremely effective. They use LRU eviction but with a twist: they track access patterns and proactively evict data that’s unlikely to be accessed again (one-time viral posts).

Stripe’s Payment Caching: Stripe caches aggressively despite handling financial data where consistency is critical. They cache customer data, payment methods, and fraud scores in Redis with short TTLs (30-60 seconds). For truly critical data like account balances, they don’t cache at all—every request hits the database. This selective caching strategy balances performance and correctness. They use cache-aside pattern with event-based invalidation for customer updates—when a customer updates their payment method, Stripe invalidates the cache entry immediately and lets the next request repopulate it. During traffic spikes (Black Friday), their cache absorbs 95% of read load, allowing their database to focus on writes. They monitor cache hit ratios per endpoint and alert when ratios drop below 85%, treating it as a potential performance regression.

Twitter’s Timeline Caching: Twitter’s home timeline is one of the most cache-intensive features in tech. When you load your timeline, Twitter doesn’t query the database for your 1,000 followed accounts’ latest tweets—that would be impossibly slow. Instead, they maintain a pre-computed, cached timeline for each user in Redis. When someone you follow tweets, Twitter’s fanout service updates your cached timeline (and millions of others). This is write-heavy caching—they accept high write costs to make reads instant. For celebrity accounts with 100 million followers, they use a hybrid approach: cache the celebrity’s tweets separately and merge them at read time, avoiding the fanout explosion. They use TTL-based eviction (timelines expire after 7 days of inactivity) and aggressive cache warming—when you log in after a long absence, they rebuild your timeline in the background while showing you a partial, cached version.

These examples share common patterns: multi-tier caching (edge + application + database), selective caching (cache what matters, skip what doesn’t), monitoring and alerting (hit ratios are production metrics), and operational sophistication (warming, gradual rollouts, failure handling). In interviews, referencing these real-world examples demonstrates you understand caching at scale, not just in theory.

Netflix Multi-Tier Cache Architecture (Simplified)

graph LR
    User["User Device"]
    
    subgraph Edge - CloudFront CDN
        CDN["CDN Cache<br/><i>Video chunks, static assets</i><br/>Hit ratio: 90%+"]
    end
    
    subgraph Application Tier
        App["App Server"]
        L1["L1 Cache<br/><i>In-process</i><br/>Latency: <1ms"]
    end
    
    subgraph EVCache Tier
        L2[("L2 Cache<br/><i>Redis Cluster</i><br/>Latency: 1-2ms<br/>Metadata, preferences")]
    end
    
    subgraph Data Tier
        DB[("Database<br/><i>Cassandra</i><br/>Latency: 10-50ms")]
    end
    
    User --"1. Request movie"--> CDN
    CDN --"2. If miss (10%)"--> App
    App --"3. Check L1"--> L1
    L1 --"4. If miss"--> L2
    L2 --"5. If miss (1%)"--> DB
    
    DB --"6. Return"--> L2
    L2 --"7. Cache (5-60min TTL)"--> L2
    L2 --"8. Return"--> L1
    L1 --"9. Cache (1-5min TTL)"--> L1
    L1 --"10. Return"--> App
    App --"11. Return"--> CDN
    CDN --"12. Cache & stream"--> User

Netflix uses three cache tiers to serve 250M+ subscribers: CDN for video content (90%+ hit ratio), L1 in-process cache for hot data (<1ms latency), and L2 Redis cluster for shared metadata (1-2ms latency). Each tier has different TTLs optimized for data freshness requirements. This architecture reduces database load by 100x and handles 50M+ requests/second during peak hours.

Interview Essentials

Mid-Level

At the mid-level, interviewers expect you to know caching exists and can articulate its basic benefits. You should be able to propose adding a cache (usually Redis) to reduce database load and explain cache hits versus misses. You should understand TTL conceptually and mention it when discussing cache freshness. When asked “how would you make this system faster?”, caching should be your first answer. You should know that caches sit between the application and database and that they store frequently accessed data. You don’t need to design sophisticated invalidation strategies or discuss eviction policies in depth—just show you understand caching’s role in reducing latency and load. A common mid-level question is “Design a URL shortener”—you should propose caching popular short URLs to avoid database lookups. Mention a specific technology (Redis) and a rough TTL (“maybe 1 hour”). That’s sufficient at this level.

Senior

Senior engineers must demonstrate depth in caching strategy and trade-offs. You should articulate the difference between cache-aside, read-through, and write-through patterns and choose the right one for the use case. You should discuss cache invalidation explicitly—when asked about consistency, you should mention TTL-based versus event-based invalidation and explain the trade-offs. You should know common eviction policies (LRU, LFU) and when to use each. You should propose multi-tier caching when appropriate (“CDN for static assets, Redis for API responses”) and explain why each layer matters. You should mention cache warming for high-traffic systems and discuss cold start problems. When discussing scale, you should mention cache hit ratios as a key metric and propose monitoring them. A senior-level question might be “Design Instagram”—you should propose caching user profiles and feed data in Redis, explain how you’d invalidate profile caches when users update their bio, discuss using CDN for images, and mention cache warming for celebrity profiles. You should also discuss failure modes: what happens if Redis goes down? (Fallback to database, serve stale data, circuit breakers.)

Staff+

Staff+ engineers must show mastery of caching at scale, including operational concerns and advanced patterns. You should discuss cache coherence in distributed systems and explain how to keep multiple cache servers synchronized. You should mention specific technologies and their trade-offs (Redis Cluster vs. Redis Sentinel, Memcached vs. Redis). You should discuss cache stampede problems and solutions (request coalescing, probabilistic early expiration). You should propose sophisticated invalidation strategies like write-behind caching with async replication or event-driven invalidation using message queues. You should discuss cache observability—not just hit ratios but also latency percentiles, eviction rates, and memory pressure. You should mention capacity planning: how much cache memory do you need for a given hit ratio? You should discuss cache security (encrypting sensitive data in cache, preventing cache poisoning). When discussing global systems, you should mention geo-distributed caching and the challenges of keeping caches consistent across regions. A staff-level question might be “Design a global e-commerce platform”—you should propose regional Redis clusters with cross-region replication for product catalogs, discuss how to handle inventory updates (strong consistency required, so maybe no caching or very short TTLs), explain cache warming strategies for product launches, and discuss how to handle cache failures during peak traffic (serve stale data with staleness indicators, gradual degradation). You should also discuss cost: caching reduces database costs but increases memory costs—how do you optimize the trade-off?

Common Interview Questions

How would you cache data in [system X]? (Expect to specify cache layer, technology, TTL, and invalidation strategy)

What happens if your cache goes down? (Discuss fallback strategies, circuit breakers, serving stale data)

How do you keep your cache consistent with the database? (Discuss TTL vs. event-based invalidation, write-through vs. cache-aside)

What’s your cache eviction policy and why? (Discuss LRU vs. LFU vs. TTL-based, match to access patterns)

How do you prevent cache stampede / thundering herd? (Discuss request coalescing, cache warming, probabilistic expiration)

How do you decide what to cache? (Discuss read-heavy vs. write-heavy data, cost of cache miss, tolerance for staleness)

Explain the difference between cache-aside and write-through caching. (Discuss consistency, latency, complexity trade-offs)

How do you monitor cache performance? (Discuss hit ratio, latency, eviction rate, memory usage)

Red Flags to Avoid

Proposing caching without mentioning invalidation or TTL (shows lack of understanding of staleness problem)

Caching everything indiscriminately (shows no judgment about what benefits from caching)

Not mentioning cache failures or fallback strategies (shows lack of operational thinking)

Confusing caching layers (e.g., proposing CDN for database query results)

Not knowing the difference between Redis and Memcached (shows lack of practical experience)

Ignoring cache warming for high-traffic systems (shows lack of understanding of cold start problems)

Not discussing hit ratio or any performance metrics (shows lack of data-driven thinking)

Proposing complex invalidation strategies without justifying the complexity (over-engineering)

Key Takeaways

Caching is the highest-leverage performance optimization: It reduces latency by orders of magnitude (milliseconds → microseconds) and offloads backend systems by 10-100x. Every scalable system uses caching at multiple layers—client, CDN, application, and database.

The core trade-off is freshness vs. performance: Longer TTLs mean better hit ratios but staler data. Shorter TTLs mean fresher data but more cache misses. Choose TTLs based on your data’s tolerance for staleness—seconds for prices, hours for user profiles, days for static content.

Cache invalidation is the hard part: TTL-based invalidation is simple but imprecise. Event-based invalidation is precise but complex. Write-through caching guarantees consistency but adds latency. Cache-aside is flexible but requires careful invalidation logic. Choose based on your consistency requirements.

Monitor hit ratios religiously: A cache with a 60% hit ratio might be worse than no cache (added complexity without enough benefit). Aim for 80%+ hit ratios. Track hit ratios per cache layer and per endpoint. Treat drops as production incidents.

Think operationally: Caches fail, need warming, and can cause cascading failures (cache stampede, cold start). Design for failure with circuit breakers, fallback strategies, and gradual traffic ramp-up. In interviews, discussing these operational concerns separates senior engineers from junior ones.