Application-Level Caching: In-Process & Redis

TL;DR

Application caching uses in-memory data stores (like Redis or Memcached) positioned between your application servers and databases to dramatically reduce latency and database load. By storing frequently accessed data in RAM, you can serve requests in microseconds instead of milliseconds, enabling systems to handle 10-100x more traffic with the same database infrastructure.

Cheat Sheet: Redis for rich data structures + persistence, Memcached for pure speed. Use TTL + LRU eviction. Avoid file-based caching in distributed systems. Cache aside pattern is most common. Warm cache on deployment to avoid thundering herd.

The Analogy

Think of application caching like keeping your most-used kitchen ingredients on the counter instead of in the basement pantry. Sure, you could walk downstairs every time you need salt, but keeping it within arm’s reach makes cooking 10x faster. Your counter space is limited (RAM), so you only keep what you use daily. Items you rarely use stay in the basement (database). When your counter gets full, you move the least-used items back downstairs (LRU eviction). The key insight: the fastest data store is the one closest to where you’re working.

Why This Matters in Interviews

Application caching appears in virtually every system design interview because it’s the most impactful performance optimization you can make. Interviewers want to see that you understand when to cache (read-heavy workloads), what to cache (hot data, expensive computations), and how to handle the complexity it introduces (cache invalidation, consistency). Strong candidates discuss specific cache technologies, eviction policies, and real-world tradeoffs. Weak candidates just say “add a cache” without explaining the implications. This topic often leads into discussions about cache invalidation strategies, distributed caching, and CAP theorem tradeoffs.

Core Concept

Application caching sits at the heart of modern high-performance systems, acting as a high-speed buffer between your application logic and slower data sources. Unlike CDN caching (which serves static content at the edge) or database query caching (which operates inside the database), application caching gives you explicit control over what data lives in memory and how long it stays there. This is where you cache user sessions, API responses, computed results, and frequently accessed database records.

The fundamental principle is simple: RAM is 100-1000x faster than disk, but also 10-100x more expensive and limited. A typical database query might take 10-50ms, while a cache lookup takes 0.1-1ms. This 10-50x speedup compounds across millions of requests. At Netflix scale, caching reduces database load from potentially 100,000 queries per second to just 1,000, allowing the same database infrastructure to serve 100x more users. The tradeoff is complexity: you now have two sources of truth, and keeping them synchronized becomes your problem.

Application caches are typically deployed as separate services (Redis, Memcached) that your application servers connect to over the network. This architecture allows multiple application servers to share the same cache, avoiding duplicate data and enabling consistent views across your fleet. The cache becomes a critical dependency—if it fails, your database gets hammered with the full request load, potentially causing a cascading failure. This is why cache architecture decisions matter so much in interviews.

How It Works

Step 1: Request Arrives - A user request hits your application server asking for data (e.g., “get user profile for user_id=12345”). Your application code first checks if this data exists in the cache before touching the database.

Step 2: Cache Lookup - The application sends a GET request to the cache service (Redis/Memcached) with a key like “user:12345”. This network call typically takes 0.5-2ms on a local network. The cache performs an O(1) hash table lookup in RAM.

Step 3: Cache Hit Path - If the data exists (cache hit), the cache returns the serialized value immediately. Your application deserializes it and returns the response to the user. Total time: 1-3ms. You’ve avoided a database query entirely. This is the happy path that handles 80-99% of requests in well-cached systems.

Step 4: Cache Miss Path - If the data doesn’t exist (cache miss), your application queries the database. This takes 10-50ms depending on query complexity and database load. After getting the result, you write it to the cache with a TTL (time-to-live) like 300 seconds. Future requests for the same data will hit the cache.

Step 5: Eviction and Expiration - The cache has limited memory, so it must decide what to keep. When memory fills up, the LRU (Least Recently Used) algorithm evicts cold data to make room for hot data. Additionally, TTLs cause entries to expire automatically, ensuring stale data doesn’t live forever. A cache with 32GB RAM and 1KB average value size can hold ~32 million entries.

Step 6: Cache Invalidation - When data changes (user updates their profile), you must invalidate or update the cache entry. This is the hardest problem in caching. You can delete the key (forcing a cache miss on next read), update it immediately (requires knowing what changed), or rely on TTL expiration (accepts temporary staleness).

Cache-Aside Pattern: Read Flow with Cache Miss and Hit

graph LR
    Client["Client Request<br/><i>GET /user/12345</i>"]
    App["Application Server<br/><i>Cache-Aside Logic</i>"]
    Cache[("Redis Cache<br/><i>In-Memory Store</i>")]
    DB[("PostgreSQL<br/><i>Source of Truth</i>")]
    
    Client --"1. Request user data"--> App
    App --"2. Check cache<br/>GET user:12345"--> Cache
    Cache --"3a. Cache HIT<br/>Return data (1-3ms)"--> App
    Cache -."3b. Cache MISS<br/>Key not found"..-> App
    App -."4. Query database<br/>SELECT * FROM users (10-50ms)"..-> DB
    DB -."5. Return user data"..-> App
    App -."6. Populate cache<br/>SET user:12345 TTL=300s"..-> Cache
    App --"7. Return response"--> Client
    
    linkStyle 2 stroke:#28a745,stroke-width:3px
    linkStyle 3,4,5,6 stroke:#ffc107,stroke-width:2px,stroke-dasharray: 5 5

The cache-aside pattern shows two paths: cache hit (solid green, 1-3ms) serves data directly from cache, while cache miss (dashed yellow, 10-50ms) fetches from database and populates cache for future requests. This pattern gives 10-50x speedup for cached data.

Key Principles

Principle 1: Cache What’s Expensive, Not Everything - Not all data deserves caching. Cache data that’s expensive to compute or fetch (complex joins, aggregations, external API calls) and accessed frequently. Caching rarely-accessed data wastes memory and adds complexity for no benefit. Netflix caches movie metadata and user viewing history but doesn’t cache every user’s billing address—it’s cheap to fetch and rarely accessed. The decision framework: if fetching from the database takes <5ms and happens <10 times per second, caching probably isn’t worth the complexity.

Principle 2: Treat Cache as Volatile - Your cache can fail, restart, or evict entries at any time. Your application must function correctly with an empty cache, just slower. This means every cache read must have a fallback to the source of truth (database). Never store data exclusively in the cache—that’s a data store, not a cache. Stripe’s payment system caches merchant configurations but can always reconstruct them from the database. This principle prevents cache failures from becoming data loss incidents.

Principle 3: Shorter TTLs for Consistency, Longer for Performance - TTL (time-to-live) is your primary consistency knob. A 5-second TTL means data can be stale for up to 5 seconds, but you’ll have high cache hit rates. A 5-minute TTL accepts more staleness for even better hit rates. The right TTL depends on your consistency requirements. Twitter caches tweet counts with 30-second TTLs—users don’t notice if the count is slightly off. Banking systems use 1-second TTLs or active invalidation because stale account balances are unacceptable. Calculate your acceptable staleness window and set TTL accordingly.

Principle 4: Namespace Your Keys - Cache keys must be globally unique across your application. Use prefixes like “user:profile:12345” or “product:inventory:SKU-789”. This prevents collisions (user ID 100 vs product ID 100) and enables bulk operations (delete all keys matching “session:*”). Airbnb uses hierarchical keys like “listing:123:availability:2024-01” to cache listing availability by month. Good key design also helps with debugging—you can inspect the cache and immediately understand what each key represents.

Principle 5: Monitor Cache Hit Rate Religiously - Cache hit rate (hits / (hits + misses)) is your primary health metric. A 95% hit rate means 95% of requests avoid the database. If hit rate drops from 95% to 80%, your database load increases 3x. Monitor hit rate per cache key pattern to identify problems. Uber noticed their “driver location” cache hit rate dropped to 60% during peak hours because drivers were moving too fast—locations expired before they could be reused. They increased TTL and added predictive caching. Set alerts for hit rate drops below your baseline.

Cache Hit Rate Impact on Database Load

graph TB
    subgraph "Without Cache"
        Traffic1["10,000 req/s"]
        DB1[("Database<br/>10,000 queries/s<br/><i>Overloaded</i>")]
        Traffic1 --> DB1
    end
    
    subgraph "90% Hit Rate"
        Traffic2["10,000 req/s"]
        Cache2[("Cache<br/>9,000 hits/s")]
        DB2[("Database<br/>1,000 queries/s<br/><i>10x reduction</i>")]
        Traffic2 --> Cache2
        Cache2 -."1,000 misses/s"..-> DB2
    end
    
    subgraph "95% Hit Rate"
        Traffic3["10,000 req/s"]
        Cache3[("Cache<br/>9,500 hits/s")]
        DB3[("Database<br/>500 queries/s<br/><i>20x reduction</i>")]
        Traffic3 --> Cache3
        Cache3 -."500 misses/s"..-> DB3
    end
    
    subgraph "99% Hit Rate"
        Traffic4["10,000 req/s"]
        Cache4[("Cache<br/>9,900 hits/s")]
        DB4[("Database<br/>100 queries/s<br/><i>100x reduction</i>")]
        Traffic4 --> Cache4
        Cache4 -."100 misses/s"..-> DB4
    end

Cache hit rate exponentially reduces database load. Going from 90% to 95% hit rate doubles the load reduction (10x to 20x), while 99% hit rate achieves 100x reduction. Every percentage point matters at scale—this is why monitoring hit rate is critical.

Deep Dive

Types / Variants

Redis: The Swiss Army Knife - Redis is an in-memory data structure store that supports strings, hashes, lists, sets, sorted sets, and more. Unlike simple key-value stores, Redis lets you manipulate data structures directly (e.g., increment a counter, add to a sorted set) without round-tripping through your application. It offers optional persistence (snapshots + append-only logs) so you can recover data after restarts. Use Redis when you need rich data structures, atomic operations, or persistence. Instagram uses Redis sorted sets to store user feeds—each user’s feed is a sorted set of post IDs ordered by timestamp. The tradeoff: Redis is single-threaded per instance, so CPU becomes a bottleneck at extreme scale. It’s also more complex to operate than Memcached. Typical latency: 0.5-2ms for simple operations.

Memcached: Pure Speed - Memcached is a distributed memory caching system focused on simplicity and raw performance. It only supports simple key-value storage with TTL-based expiration. No persistence, no complex data structures, no atomic operations beyond set/get. This simplicity makes it extremely fast and easy to scale horizontally. Use Memcached when you need pure caching speed and don’t require Redis’s advanced features. Facebook uses Memcached extensively for caching database query results—they run thousands of Memcached servers storing terabytes of data. The tradeoff: you lose data on restart, and you must handle all data structure logic in your application. Typical latency: 0.2-1ms for get operations.

Application-Level In-Process Caching - Instead of a separate cache service, you can cache data directly in your application’s memory (e.g., a HashMap in Java, a dictionary in Python). This is the fastest option—no network latency, just a memory lookup. Use in-process caching for small, read-only data that’s identical across all application instances (configuration, feature flags, static lookup tables). Shopify caches product category hierarchies in-process because they’re small (<10MB) and change rarely. The tradeoff: each application instance has its own cache, so you waste memory with duplicates. Cache invalidation is harder because you must notify all instances. If your data changes frequently or is user-specific, use a shared cache instead.

Distributed Cache with Consistent Hashing - For massive scale, you can shard your cache across multiple servers using consistent hashing. Each cache key is hashed to determine which server stores it. This allows you to scale cache capacity horizontally by adding more servers. Both Redis and Memcached support this pattern through client libraries or proxy layers (Twemproxy, Redis Cluster). Use distributed caching when your cache data exceeds what a single server can hold (>100GB) or when you need higher throughput than one server can provide. Pinterest shards their cache across 100+ Redis instances, each handling a subset of pins based on pin ID. The tradeoff: complexity increases significantly—you need to handle server failures, rebalancing, and hotspot keys that overload individual shards.

Write-Through vs Write-Behind Caching - Write-through caching updates both cache and database synchronously on every write. This ensures cache and database are always consistent but adds latency to writes. Write-behind (write-back) caching updates the cache immediately and asynchronously updates the database later. This makes writes fast but risks data loss if the cache fails before persisting to the database. Use write-through for critical data where consistency matters (financial transactions). Use write-behind for high-volume, less critical data where you can tolerate occasional loss (analytics events, logs). LinkedIn uses write-behind caching for user activity tracking—losing a few “profile view” events is acceptable for the performance gain.

Redis vs Memcached: Architecture and Use Cases

graph TB
    subgraph "Memcached Architecture"
        MC_Client["Application<br/><i>Multi-threaded client</i>"]
        MC1["Memcached Node 1<br/><i>Multi-threaded</i><br/>Simple key-value<br/>No persistence"]
        MC2["Memcached Node 2<br/><i>Multi-threaded</i><br/>Simple key-value<br/>No persistence"]
        MC3["Memcached Node 3<br/><i>Multi-threaded</i><br/>Simple key-value<br/>No persistence"]
        MC_Client --"Consistent hashing<br/>distributes keys"--> MC1
        MC_Client --> MC2
        MC_Client --> MC3
        MC_Use["Use Cases:<br/>• Pure caching speed<br/>• Simple key-value<br/>• High throughput<br/>• Facebook query cache"]
    end
    
    subgraph "Redis Architecture"
        R_Client["Application<br/><i>Single-threaded per instance</i>"]
        R1["Redis Node 1<br/><i>Single-threaded</i><br/>Data structures<br/>Optional persistence<br/>Pub/Sub, Lua scripts"]
        R2["Redis Node 2<br/><i>Single-threaded</i><br/>Data structures<br/>Optional persistence<br/>Pub/Sub, Lua scripts"]
        R3["Redis Node 3<br/><i>Single-threaded</i><br/>Data structures<br/>Optional persistence<br/>Pub/Sub, Lua scripts"]
        R_Client --"Redis Cluster<br/>or client-side sharding"--> R1
        R_Client --> R2
        R_Client --> R3
        R_Use["Use Cases:<br/>• Rich data structures<br/>• Atomic operations<br/>• Persistence needed<br/>• Instagram feeds (sorted sets)"]
    end

Memcached excels at simple, high-throughput key-value caching with multi-threading, while Redis offers rich data structures, atomic operations, and persistence at the cost of single-threaded performance per instance. Choose based on your complexity needs and performance requirements.

Trade-offs

Consistency vs Performance - Stronger consistency requires more aggressive cache invalidation, which reduces cache hit rates and performance. Option A: Eventual consistency with long TTLs (5-10 minutes) gives you 95%+ hit rates but users might see stale data. Option B: Strong consistency with immediate invalidation gives you always-fresh data but hit rates drop to 70-80% because you’re constantly invalidating. Decision framework: What’s the business impact of stale data? For social media likes, staleness is fine. For inventory counts, it causes overselling. Most systems choose eventual consistency with TTLs tuned to acceptable staleness windows (30-60 seconds). Use active invalidation only for critical data paths.

Cache Aside vs Cache Through - Cache aside (lazy loading) means your application manages the cache—it checks the cache, and on miss, fetches from the database and populates the cache. Cache through means the cache sits in front of the database and automatically populates itself on misses. Option A: Cache aside gives you full control and works with any database, but requires more application code. Option B: Cache through is simpler application code but requires database integration and makes the cache a critical dependency. Decision framework: Use cache aside for most applications—it’s more flexible and fails gracefully. Use cache through only when you have a caching layer tightly integrated with your database (like MySQL query cache or DynamoDB DAX).

Single Large Cache vs Multiple Specialized Caches - You can use one cache for everything or separate caches for different data types. Option A: Single cache is simpler to operate and allows memory to be shared across data types. Option B: Multiple caches allow different eviction policies and TTLs per data type, and failures are isolated. Decision framework: Start with a single cache for simplicity. Split into multiple caches when you have conflicting requirements—for example, session data (never evict, short TTL) vs computed results (evict freely, long TTL). Twitter uses separate Redis clusters for timelines, user data, and tweet content because they have different access patterns and consistency requirements.

Local Cache + Remote Cache (L1/L2) - You can combine in-process caching (L1) with a shared remote cache (L2) for the best of both worlds. L1 gives you sub-millisecond latency for hot data, L2 gives you shared state across instances. The tradeoff is complexity—you now have two caches to invalidate and two places to check. Decision framework: Use L1/L2 when you have extremely hot data (accessed thousands of times per second per instance) and can tolerate slightly stale data. Google uses L1 caching in frontend servers for user preferences—each server caches the preferences of its active users, with a shared L2 cache for the full dataset.

Proactive Warming vs Reactive Caching - Reactive caching (cache on demand) is simple but causes cache misses on cold starts and after deployments. Proactive warming pre-populates the cache with predicted hot data. Option A: Reactive is simpler and works for most cases. Option B: Proactive warming prevents thundering herd problems but requires predicting what data will be hot. Decision framework: Use reactive caching by default. Add proactive warming for critical paths that can’t tolerate cold cache performance. Netflix pre-warms caches with popular movie metadata before peak hours to ensure consistent performance.

Common Pitfalls

Pitfall 1: Cache Stampede (Thundering Herd) - When a popular cache entry expires, hundreds of requests simultaneously discover it’s missing and all query the database at once, potentially overloading it. This happens because cache misses aren’t coordinated. Why it happens: A viral tweet’s cache entry expires, and suddenly 10,000 requests try to fetch it from the database simultaneously. How to avoid: Use request coalescing (only one request fetches, others wait) or probabilistic early expiration (randomly expire entries slightly before TTL to spread out the load). Redis supports single-flight requests through Redlock. Alternatively, use a background job to refresh popular entries before they expire.

Pitfall 2: Caching Large Objects - Storing multi-megabyte objects in the cache wastes memory and causes network timeouts. A 5MB user object might seem fine until you realize you can only cache 6,000 of them in 32GB of RAM. Why it happens: Developers cache entire API responses or database rows without considering size. How to avoid: Cache only the data you need. Instead of caching a full user object with profile, settings, and history, cache just the profile. Break large objects into smaller cacheable pieces. Spotify caches individual song metadata (1KB) rather than entire playlists (100KB+). Set a max value size limit (e.g., 100KB) and reject larger entries.

Pitfall 3: Cache Invalidation Bugs - The cache contains stale data because invalidation logic is wrong or incomplete. A user updates their email, but the cached profile still shows the old email. Why it happens: Invalidation logic is scattered across the codebase, or developers forget to invalidate related keys. How to avoid: Centralize cache invalidation logic. When data changes, invalidate all related cache keys. Use a write-through cache pattern where writes go through a single code path that handles both database and cache. Document which cache keys depend on which database tables. Uber uses a cache invalidation service that subscribes to database change logs and automatically invalidates affected keys.

Pitfall 4: Ignoring Cache Failure Modes - The application crashes or returns errors when the cache is unavailable, even though the database is healthy. Why it happens: Developers treat cache failures as fatal errors instead of degraded performance. How to avoid: Wrap all cache operations in try-catch blocks and fall back to the database on cache failures. Set aggressive timeouts on cache operations (50-100ms) so a slow cache doesn’t block requests. Monitor cache availability separately from application availability. Airbnb’s services continue operating at reduced performance when Redis is down—they just hit the database directly.

Pitfall 5: Caching Non-Deterministic Data - Caching data that includes timestamps, random values, or user-specific tokens causes subtle bugs. Why it happens: A cached API response includes “generated_at: 2024-01-15 10:30:00” and users see inconsistent timestamps. How to avoid: Only cache deterministic data that’s the same for all users requesting it. If you must cache user-specific data, include the user ID in the cache key. Remove timestamps and random values before caching, or generate them at read time. Stripe caches API responses but strips out request IDs and timestamps that would differ per request.

Cache Stampede (Thundering Herd) Problem and Solution

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant C3 as Client 3
    participant App as Application
    participant Cache as Redis Cache
    participant DB as Database
    
    Note over Cache: Popular entry expires at T=0
    
    rect rgb(255, 243, 205)
        Note over C1,DB: Problem: All requests hit DB simultaneously
        C1->>App: Request data
        C2->>App: Request data
        C3->>App: Request data
        App->>Cache: GET key (miss)
        App->>Cache: GET key (miss)
        App->>Cache: GET key (miss)
        App->>DB: Query (10,000 concurrent)
        App->>DB: Query (10,000 concurrent)
        App->>DB: Query (10,000 concurrent)
        Note over DB: Database overload!
        DB-->>App: Data
        DB-->>App: Data
        DB-->>App: Data
    end
    
    rect rgb(212, 237, 218)
        Note over C1,DB: Solution: Request coalescing with lock
        C1->>App: Request data
        C2->>App: Request data
        C3->>App: Request data
        App->>Cache: GET key (miss)
        App->>Cache: Acquire lock (success)
        App->>Cache: GET key (miss, wait)
        App->>Cache: GET key (miss, wait)
        App->>DB: Single query
        DB-->>App: Data
        App->>Cache: SET key + release lock
        App->>Cache: GET key (hit)
        App->>Cache: GET key (hit)
        Note over App: Only 1 DB query instead of 10,000
    end

Cache stampede occurs when a popular cache entry expires and thousands of requests simultaneously query the database. Request coalescing uses distributed locks to ensure only one request fetches from the database while others wait, preventing database overload.

Math & Calculations

Cache Hit Rate and Database Load Reduction

Formula: Database Load Reduction = 1 / (1 - Hit Rate)

Variables:

Hit Rate: Percentage of requests served from cache (0.0 to 1.0)
Database Load Reduction: How many times fewer database queries you make

Worked Example: Your application receives 10,000 requests per second. Without caching, all 10,000 hit the database. With a 90% cache hit rate:

Cache hits: 10,000 × 0.90 = 9,000 requests/sec (served from cache)
Cache misses: 10,000 × 0.10 = 1,000 requests/sec (hit database)
Database load reduction: 1 / (1 - 0.90) = 10x

You’ve reduced database load from 10,000 to 1,000 queries per second—a 10x improvement. If you improve hit rate to 95%:

Database load reduction: 1 / (1 - 0.95) = 20x
Database queries: 10,000 × 0.05 = 500 requests/sec

Going from 90% to 95% hit rate doubles your database load reduction. This is why every percentage point matters at scale.

Cache Memory Sizing

Formula: Required Memory = (Number of Entries × Average Entry Size) / Hit Rate Target

Worked Example: You want to cache user profiles. You have 10 million active users, each profile is 2KB serialized, and you want a 95% hit rate (meaning you can cache 95% of users):

Entries to cache: 10,000,000 × 0.95 = 9,500,000 profiles
Memory required: 9,500,000 × 2KB = 19,000,000KB ≈ 18.6GB
Add 20% overhead for Redis metadata: 18.6GB × 1.2 ≈ 22GB

You need a 32GB Redis instance to safely cache 95% of user profiles. If you only have 16GB, you can cache ~7 million profiles (70% hit rate), reducing database load by only 3.3x instead of 20x.

TTL and Staleness Window

Formula: Maximum Staleness = TTL + Cache Write Time

Worked Example: You set a 60-second TTL on product inventory counts. A product’s inventory changes at time T=0, but the cache entry was written at T=-30 (30 seconds before the change):

The cache entry will live until T=30 (60 seconds after it was written)
Maximum staleness: 60 + 30 = 90 seconds

Users might see inventory counts that are up to 90 seconds old. If this is unacceptable, reduce TTL to 10 seconds, giving maximum staleness of 40 seconds. The tradeoff: shorter TTL means more cache misses and higher database load.

Real-World Examples

Netflix: Multi-Tier Caching for Streaming Metadata - Netflix uses EVCache (a wrapper around Memcached) to cache movie metadata, user viewing history, and personalization data across multiple regions. They run thousands of Memcached instances storing terabytes of data, with cache hit rates exceeding 99% for popular content. The interesting detail: Netflix uses a multi-tier caching strategy where edge servers have local caches (L1) for extremely hot data like homepage recommendations, backed by regional EVCache clusters (L2) for the full catalog. When a new show launches, they proactively warm caches in all regions to prevent thundering herd problems. They also use probabilistic early expiration—cache entries expire randomly within a window (e.g., 290-310 seconds for a 300-second TTL) to spread out database load. This architecture allows Netflix to serve 200+ million users with relatively modest database infrastructure.

Twitter: Timeline Caching with Redis - Twitter caches user timelines (the list of tweets you see on your homepage) in Redis using sorted sets. Each user’s timeline is a sorted set of tweet IDs ordered by timestamp. When you load Twitter, the app fetches your timeline from Redis (typically 0.5ms), then fetches the actual tweet content for those IDs (also cached). This two-level approach allows them to cache timelines efficiently—a timeline is just a list of IDs (a few KB), not full tweet objects (hundreds of KB). The interesting detail: Twitter uses a write-through cache for timelines. When someone you follow tweets, Twitter immediately updates your cached timeline by adding the new tweet ID to your sorted set. This is expensive (requires updating millions of timelines for popular accounts) but ensures users see new tweets instantly. For extremely popular accounts (celebrities with 100M+ followers), they use a hybrid approach: only update timelines for active users, and rebuild timelines on-demand for inactive users.

Stripe: Caching Merchant Configurations - Stripe caches merchant account configurations (API keys, webhook URLs, payment settings) in Redis with 5-minute TTLs. These configurations are read on every API request to validate the merchant and determine routing rules, making them extremely hot data. The interesting detail: Stripe uses a cache-aside pattern with request coalescing to prevent stampedes. When a cache entry expires, the first request to discover the miss acquires a distributed lock (using Redis) and fetches from the database. Other concurrent requests wait for the lock holder to populate the cache, then read the fresh value. This prevents 1,000 simultaneous requests from all hitting the database. They also use hierarchical caching—each API server has a small in-process cache (L1) for the hottest 1,000 merchants, backed by a shared Redis cluster (L2) for all merchants. This reduces Redis load by 80% while keeping latency under 1ms for the most active merchants.

Netflix Multi-Tier Caching Architecture

graph TB
    subgraph "Edge Servers (Regional)"
        User["User Request<br/><i>Homepage load</i>"]
        L1["L1 Cache<br/><i>In-process</i><br/>Hot recommendations<br/>1-10ms TTL<br/>99.9% hit rate"]
    end
    
    subgraph "Regional EVCache Cluster"
        L2_1["EVCache Node 1<br/><i>Memcached</i>"]
        L2_2["EVCache Node 2<br/><i>Memcached</i>"]
        L2_3["EVCache Node 3<br/><i>Memcached</i>"]
        L2["L2 Cache Layer<br/><i>Full catalog metadata</i><br/>300s TTL<br/>99% hit rate"]
        L2 -.-> L2_1
        L2 -.-> L2_2
        L2 -.-> L2_3
    end
    
    subgraph "Data Layer"
        DB[("Cassandra<br/><i>Source of truth</i><br/>Movie metadata<br/>User history")]
        Warm["Cache Warming Service<br/><i>Proactive population</i><br/>Pre-warm popular content<br/>Probabilistic expiration"]
    end
    
    User --"1. Request"--> L1
    L1 --"2. L1 miss<br/>(0.1% of requests)"--> L2
    L2 --"3. L2 miss<br/>(1% of L1 misses)"--> DB
    DB --"4. Return data"--> L2
    L2 --"5. Populate L2"--> L2
    L2 --"6. Return to L1"--> L1
    L1 --"7. Populate L1"--> L1
    Warm --"Pre-warm before<br/>new show launch"--> L2

Netflix uses L1 in-process caching for extremely hot data (homepage recommendations) with sub-millisecond latency, backed by L2 regional EVCache clusters for the full catalog. Cache warming service proactively populates caches before new content launches to prevent thundering herd, achieving 99%+ combined hit rate.

Interview Expectations

Mid-Level

What You Should Know: Explain the basic cache-aside pattern (check cache, on miss fetch from database and populate cache). Understand Redis vs Memcached differences at a high level (Redis has data structures and persistence, Memcached is simpler and faster). Know what TTL means and why you need eviction policies like LRU. Be able to calculate cache hit rate and explain why it matters. Understand that caching introduces consistency challenges—the cache can be stale.

Bonus Points: Mention cache warming strategies to avoid cold start problems. Discuss how to choose appropriate TTLs based on data freshness requirements. Explain the thundering herd problem and suggest request coalescing as a solution. Show awareness that cache failures should degrade performance, not break the application. Mention monitoring cache hit rates as a key operational metric.

Senior

What You Should Know: Design a complete caching layer including key naming conventions, TTL strategies per data type, and invalidation approaches. Explain the tradeoffs between cache-aside, read-through, and write-through patterns with specific use cases for each. Calculate cache memory requirements based on data size and hit rate targets. Discuss distributed caching with consistent hashing and how to handle hotspot keys. Understand cache coherence in multi-region deployments and eventual consistency tradeoffs.

Bonus Points: Propose L1/L2 caching architectures and explain when the added complexity is worth it. Discuss probabilistic early expiration to prevent synchronized cache misses. Explain how to handle cache invalidation in microservices architectures (event-driven invalidation, CDC patterns). Show experience with cache failure modes and circuit breaker patterns. Mention specific Redis features like pipelining, Lua scripts for atomic operations, or Redis Cluster for sharding. Discuss cache observability—what metrics to track beyond hit rate (latency percentiles, eviction rate, memory fragmentation).

Staff+

What You Should Know: Design caching strategies for global-scale systems with multiple regions and consistency requirements. Explain how to migrate from one caching technology to another without downtime (dual-write patterns, gradual rollout). Discuss cache economics—when caching saves money vs when it adds cost without benefit. Design cache invalidation strategies for complex data dependencies (graph invalidation, dependency tracking). Understand the CAP theorem implications for distributed caches and how to make conscious consistency tradeoffs.

Distinguishing Signals: Propose novel caching strategies for specific workloads (e.g., predictive caching using ML, adaptive TTLs based on access patterns). Discuss cache security concerns (cache poisoning, timing attacks) and mitigation strategies. Explain how to debug cache-related production issues (cache key analysis, hit rate anomaly detection). Show experience designing caching for specific domains (financial systems requiring strong consistency, social systems optimizing for eventual consistency). Discuss organizational aspects—how to prevent cache misuse across teams, establishing caching guidelines and best practices. Mention cutting-edge approaches like RDMA-based caches or persistent memory caching.

Common Interview Questions

Q1: How would you design a caching layer for a social media feed?

60-second answer: Use Redis sorted sets to store feed entries (post IDs) ordered by timestamp. Cache the feed structure (list of IDs) separately from post content. Set 30-60 second TTLs to balance freshness with hit rate. Use write-through caching to update feeds immediately when new posts arrive. Implement pagination by fetching ranges from the sorted set.

2-minute answer: I’d use a two-tier approach: cache feed structure (post IDs) in Redis sorted sets with user_id as the key, and cache individual post content separately with post_id as the key. The feed structure would use write-through caching—when someone posts, we immediately update the sorted sets for all their followers (or at least active followers). This is expensive but ensures real-time updates. For post content, I’d use cache-aside with 5-minute TTLs since post content doesn’t change. To handle celebrities with millions of followers, I’d use a hybrid approach: maintain cached feeds only for active users (logged in within 24 hours) and rebuild feeds on-demand for inactive users. I’d also implement pagination by fetching ranges from the sorted set (ZRANGE command) rather than loading entire feeds. Monitor cache hit rates per user segment—active users should have 95%+ hit rates, while inactive users will have lower rates. The key tradeoff is write amplification (updating millions of feeds per post) vs read performance (instant feed loads).

Red flags: Suggesting caching entire feed objects instead of just IDs (wastes memory). Not considering write-through vs cache-aside tradeoffs. Ignoring the celebrity problem (popular accounts with millions of followers). Not mentioning pagination or how to handle feed updates.

Q2: Your cache hit rate suddenly drops from 95% to 60%. How do you debug this?

60-second answer: First, check if cache servers are healthy (memory usage, eviction rate, network latency). Look for changes in traffic patterns—did a new feature launch or did a marketing campaign drive unusual traffic? Analyze cache key patterns to see if specific key types are missing more often. Check if TTLs were recently changed. Look for deployment changes that might have altered cache key generation logic.

2-minute answer: I’d follow a systematic debugging process. First, verify cache infrastructure health: check memory usage (are we evicting aggressively due to memory pressure?), CPU usage, network latency, and error rates. High eviction rates suggest we need more memory or have a memory leak. Next, segment hit rate by cache key pattern to identify which data types are missing. If “user:profile:” keys have 95% hit rate but “product:inventory:” keys have 30%, the problem is isolated to inventory data. Then, correlate with recent changes: deployments, configuration changes, traffic spikes, or new features. A common culprit is cache key generation logic changing—if a deployment changed how keys are formatted, old cache entries become unreachable. Check application logs for cache errors or timeouts. Analyze traffic patterns: did we get a sudden influx of new users (cold cache for new user IDs) or is a specific feature driving unusual access patterns? Finally, check if TTLs were inadvertently shortened or if a cache flush was triggered. I’d also look at the time-series graph of hit rate to see if the drop was sudden (suggests a deployment or infrastructure issue) or gradual (suggests traffic pattern change or memory pressure).

Red flags: Immediately assuming it’s a cache infrastructure problem without checking application changes. Not segmenting hit rate by key pattern to isolate the issue. Suggesting to “just add more cache servers” without understanding root cause. Not considering that the drop might be expected (e.g., Black Friday traffic with many new users).

Q3: When would you NOT use caching?

60-second answer: Don’t cache when data changes frequently (every second) and must be strongly consistent—caching adds complexity without benefit. Don’t cache data that’s rarely accessed (cold data)—you waste memory. Don’t cache data that’s cheap to compute or fetch (<5ms). Don’t cache when you have strict compliance requirements that prohibit data duplication. Don’t cache user-specific data if you have millions of users and each user’s data is accessed infrequently.

2-minute answer: Several scenarios make caching counterproductive. First, highly volatile data with strong consistency requirements—if data changes every second and users must see changes immediately, caching adds complexity without performance benefit. Financial account balances are an example: the cost of cache invalidation overhead exceeds the benefit of caching. Second, cold data that’s rarely accessed—caching your entire database when only 5% of data is hot wastes memory and reduces hit rates for actually hot data. Third, cheap operations—if a database query takes 2ms and caching adds 1ms of network latency plus serialization overhead, you’ve gained nothing. Fourth, compliance and security constraints—some industries prohibit caching sensitive data (PII, health records) due to data residency or encryption requirements. Fifth, write-heavy workloads—if you’re writing more than reading, you spend all your time invalidating cache entries. Finally, when operational complexity exceeds benefit—if your team lacks experience with caching and your scale doesn’t justify it (e.g., 10 requests/second), the operational burden of running and debugging a cache outweighs the performance gain. The key principle: caching is an optimization that adds complexity. Only add it when the performance benefit clearly justifies the cost.

Red flags: Saying “always cache everything” without considering tradeoffs. Not mentioning consistency requirements or data access patterns. Suggesting caching for write-heavy workloads without discussing the invalidation complexity. Not considering operational overhead.

Q4: How do you handle cache invalidation in a microservices architecture?

60-second answer: Use event-driven invalidation where services publish events when data changes, and other services subscribe to invalidate their caches. Alternatively, use short TTLs (10-30 seconds) to accept eventual consistency. For critical paths, use a cache invalidation service that subscribes to database change logs (CDC) and invalidates affected cache keys across all services.

2-minute answer: Cache invalidation in microservices is challenging because multiple services might cache the same data, and you need to invalidate all copies when data changes. I’d use an event-driven approach: when a service modifies data, it publishes an event (e.g., “UserProfileUpdated”) to a message bus (Kafka, RabbitMQ). Services that cache user profiles subscribe to this event and invalidate their local cache entries. This decouples services—the data owner doesn’t need to know who’s caching its data. For more complex scenarios, I’d implement a cache invalidation service that subscribes to database change logs (using CDC tools like Debezium) and publishes invalidation events. This centralizes invalidation logic and ensures consistency. Another approach is hierarchical caching with ownership: only the service that owns the data caches it, and other services call that service’s API rather than caching themselves. This reduces invalidation complexity but increases inter-service traffic. For less critical data, I’d use short TTLs (10-30 seconds) and accept eventual consistency—this is simpler and works for many use cases. The key is choosing the right strategy per data type: use active invalidation for critical data (user auth tokens), TTL-based invalidation for less critical data (user preferences), and no caching for highly volatile data (real-time inventory).

Red flags: Suggesting each service directly invalidates other services’ caches (tight coupling). Not considering event-driven architectures. Proposing a single shared cache for all services without discussing ownership and invalidation complexity. Not mentioning TTL-based invalidation as a simpler alternative for non-critical data.

Q5: Explain the difference between Redis and Memcached and when to use each.

60-second answer: Memcached is a simple, fast key-value cache with no persistence—great for pure caching where you just need speed. Redis is more feature-rich with data structures (lists, sets, sorted sets), persistence options, and atomic operations—use it when you need more than simple key-value storage or when you want to recover cache data after restarts.

2-minute answer: Memcached is designed for one thing: fast key-value caching. It stores strings only, has no persistence, and uses a simple LRU eviction policy. It’s multi-threaded, so it can utilize multiple CPU cores efficiently. Use Memcached when you need pure caching speed, your data is simple key-value pairs, and you don’t care about losing cache data on restart. Facebook uses Memcached for caching database query results—they run thousands of instances and accept that restarts mean cache misses. Redis is more versatile: it supports rich data structures (hashes, lists, sets, sorted sets, bitmaps), atomic operations on these structures, optional persistence (RDB snapshots and AOF logs), pub/sub messaging, and Lua scripting. It’s single-threaded per instance, so CPU can become a bottleneck, but you can run multiple instances per server. Use Redis when you need data structures (e.g., sorted sets for leaderboards), atomic operations (e.g., incrementing counters), or persistence (e.g., session storage where losing data is unacceptable). Redis is also better for complex caching scenarios like caching parts of objects or maintaining relationships between cached data. The tradeoff: Redis is more complex to operate and has higher memory overhead per key. For simple caching at massive scale, Memcached’s simplicity and multi-threading can be advantageous. For anything more complex, Redis’s features justify the added complexity.

Red flags: Saying Redis is always better because it has more features (ignoring Memcached’s simplicity and multi-threading). Not mentioning specific use cases for each. Claiming Redis is slower without context (it’s single-threaded per instance, but that’s a design choice for simplicity). Not discussing persistence as a key differentiator.

Red Flags to Avoid

Red Flag 1: “Caching solves all performance problems” - Why it’s wrong: Caching is one optimization technique, but it doesn’t help with write-heavy workloads, CPU-bound operations, or network bottlenecks. It also introduces complexity (consistency, invalidation, operational overhead) that can create new problems. What to say instead: “Caching is highly effective for read-heavy workloads with hot data, but we should profile the system first to confirm that database latency is the bottleneck. If the problem is CPU-bound computation or write throughput, caching won’t help. We also need to consider the operational complexity of maintaining cache consistency.”

Red Flag 2: “Just set TTL to infinity so we never have cache misses” - Why it’s wrong: Infinite TTLs mean your cache will serve stale data forever. When data changes in the database, users will see outdated information until you manually invalidate the cache. This also wastes memory on data that’s no longer relevant. What to say instead: “TTL is our primary consistency knob. We should set TTLs based on how stale data can be for each use case. For user profiles, 5 minutes might be acceptable. For inventory counts, we might need 10 seconds or active invalidation. We can also use probabilistic early expiration to prevent thundering herd problems when popular entries expire.”

Red Flag 3: “Cache everything from the database to maximize hit rate” - Why it’s wrong: Caching cold data wastes memory and reduces hit rates for actually hot data. If only 10% of your data is frequently accessed, caching 100% means 90% of your cache is wasted. This also increases eviction rates and operational complexity. What to say instead: “We should identify hot data through access pattern analysis and cache only what’s frequently accessed. The Pareto principle often applies: 20% of data accounts for 80% of traffic. We can monitor access patterns and cache the top 10-20% of most-accessed keys. This maximizes hit rate while minimizing memory usage and operational complexity.”

Red Flag 4: “Cache failures should return errors to users” - Why it’s wrong: The cache is an optimization, not a critical dependency. If the cache fails but the database is healthy, the application should continue working at reduced performance. Treating cache failures as fatal errors causes unnecessary outages. What to say instead: “Cache operations should be wrapped in try-catch blocks with fallback to the database. If Redis is down, we log the error, increment a metric, and fetch from the database. The user experience is slower but functional. We should also set aggressive timeouts (50-100ms) on cache operations so a slow cache doesn’t block requests. This is called graceful degradation.”

Red Flag 5: “Use cache-through pattern for everything” - Why it’s wrong: Cache-through (where the cache automatically populates itself on misses) requires tight integration with your database and makes the cache a critical dependency. It also doesn’t work well with complex queries or multiple data sources. Most applications use cache-aside (application manages the cache) for flexibility. What to say instead: “Cache-aside is more common because it’s flexible and works with any database. The application checks the cache, and on miss, fetches from the database and populates the cache. This gives us full control over what to cache and how to serialize it. Cache-through is useful for specific scenarios like DynamoDB DAX or when you have a caching layer tightly integrated with your database, but it’s not a general-purpose pattern.”

Key Takeaways

Application caching uses in-memory stores (Redis, Memcached) between application servers and databases to reduce latency from 10-50ms to 0.1-1ms, enabling 10-100x database load reduction with 90-95% cache hit rates.
The cache-aside pattern (check cache → on miss, fetch from database and populate cache) is most common because it’s flexible and fails gracefully. Always treat the cache as volatile—your application must function correctly with an empty cache.
Choose Redis for rich data structures, atomic operations, and persistence; choose Memcached for pure caching speed and simplicity. TTL is your primary consistency knob—shorter TTLs give fresher data but lower hit rates.
Cache invalidation is the hardest problem: use event-driven invalidation for critical data, TTL-based expiration for less critical data, and monitor cache hit rates religiously to detect issues early.
Common pitfalls include cache stampede (thundering herd), caching large objects, invalidation bugs, not handling cache failures gracefully, and caching non-deterministic data. Always calculate cache memory requirements based on data size and target hit rate.

Prerequisites: Database Fundamentals - Understanding database query performance and indexes is essential before optimizing with caching. API Design - Knowing how APIs work helps you understand what to cache at the application layer.

Related Topics in This Module: CDN Caching - Caching at the edge for static content, complementary to application caching. Cache Invalidation Strategies - Deep dive into solving the hardest problem in caching. Distributed Caching - Scaling caches horizontally with consistent hashing and sharding.

Next Steps: Load Balancing - After optimizing individual servers with caching, learn to distribute load across multiple servers. Database Replication - Caching and replication are complementary strategies for handling read-heavy workloads.