Cache Invalidation - System Design Interview Guide

After this topic, you will be able to:

Explain why cache invalidation is considered one of the hardest problems in computer science
Differentiate between TTL-based, event-based, and manual invalidation strategies
Analyze the cache stampede problem and evaluate mitigation techniques
Design an invalidation strategy that balances consistency and performance

TL;DR

Cache invalidation is the process of removing or updating stale data from a cache to maintain consistency with the source of truth. It’s famously one of the hardest problems in computer science because it requires balancing data freshness, system performance, and consistency guarantees across distributed systems. The core challenge: how do you know when cached data is outdated, and how do you update it without causing cascading failures?

Cheat Sheet: TTL-based (time expiration), write-through invalidation (update on write), event-driven (message queue triggers), manual purge (explicit API calls). Watch for cache stampede when many requests hit an expired key simultaneously.

Mental Model

Think of cache invalidation like expiration dates on grocery store milk. The store (cache) stocks milk (data) to serve customers quickly without going to the dairy farm (database) every time. But milk spoils, so the store needs a strategy: print expiration dates on cartons (TTL), remove milk when the dairy delivers fresh batches (write-through invalidation), or have the dairy send alerts when recipes change (event-driven). The hard part? If everyone rushes to buy milk the moment it expires and the shelf is empty, you get a stampede to the dairy farm. Cache invalidation is about keeping the milk fresh without causing chaos when it runs out.

Why This Matters

Phil Karlton’s famous quote—“There are only two hard things in computer science: cache invalidation and naming things”—exists because cache invalidation sits at the intersection of consistency, availability, and performance trade-offs. In interviews, this topic separates candidates who understand distributed systems from those who just memorize patterns. Companies like Meta have published papers on achieving 10-nines consistency (99.99999999%) in their TAO cache specifically because getting invalidation wrong means users see stale friend lists, incorrect inventory counts, or outdated pricing.

Real-world impact: Uber’s surge pricing cache must invalidate within seconds or riders see wrong prices. Stripe’s payment cache must stay consistent or merchants get duplicate charges. Netflix’s metadata cache affects what 200M+ users see on their homepage. Every major outage investigation includes “did we invalidate the cache correctly?” as a root cause question. Mastering invalidation means you can reason about consistency guarantees and design systems that don’t lie to users.

Core Concept

Cache invalidation is the mechanism for removing or updating cached data when the underlying source of truth changes. The fundamental problem: caches are copies of data, and copies can become stale. Unlike cache eviction (removing data due to memory pressure—see Cache Eviction Policies), invalidation is about correctness: ensuring users don’t see outdated information.

The challenge compounds in distributed systems. When you update a database record, how do you notify all cache nodes across multiple data centers? What happens if the invalidation message arrives before the database write completes? What if the cache node is temporarily unreachable? These race conditions make invalidation notoriously difficult. The core trade-off: aggressive invalidation improves consistency but increases database load; lenient invalidation improves performance but risks serving stale data.

Request Coalescing Pattern

sequenceDiagram
    participant R1 as Request 1<br/>(First)
    participant R2 as Request 2
    participant R3 as Request 3
    participant Cache
    participant Lock as Distributed Lock<br/>(Redis SETNX)
    participant DB as Database
    
    R1->>Cache: GET homepage_feed
    Cache-->>R1: Cache miss
    R1->>Lock: SETNX lock:homepage_feed<br/>TTL=10s
    Lock-->>R1: Lock acquired ✓
    
    par Concurrent Requests
        R2->>Cache: GET homepage_feed
        Cache-->>R2: Cache miss
        R2->>Lock: SETNX lock:homepage_feed
        Lock-->>R2: Lock exists ✗
        
        R3->>Cache: GET homepage_feed
        Cache-->>R3: Cache miss
        R3->>Lock: SETNX lock:homepage_feed
        Lock-->>R3: Lock exists ✗
    end
    
    R1->>DB: SELECT * FROM feed<br/>(Only one query)
    DB-->>R1: Feed data
    R1->>Cache: SET homepage_feed<br/>TTL=3600s
    R1->>Lock: DEL lock:homepage_feed
    R1-->>R1: Return data
    
    loop Poll for data
        R2->>Cache: GET homepage_feed
        Cache-->>R2: Cache hit ✓
        R2-->>R2: Return data
        
        R3->>Cache: GET homepage_feed
        Cache-->>R3: Cache hit ✓
        R3-->>R3: Return data
    end
    
    Note over R1,DB: Only 1 database query<br/>for N concurrent requests

Request coalescing uses distributed locks to ensure only one request fetches data during a cache miss. Subsequent requests wait and poll the cache until the first request completes. This reduces database load from N concurrent queries to 1, preventing stampede at the cost of increased latency for waiting requests.

How It Works

Cache invalidation operates through four primary strategies, each with different consistency and performance characteristics:

1. Time-To-Live (TTL) Expiration: Set an expiration timestamp when caching data. After TTL expires, the cache treats the entry as invalid and fetches fresh data on the next request. Simple and predictable, but you’re guaranteed stale data for up to TTL duration. Netflix uses 24-hour TTLs for movie metadata because catalog changes are infrequent. For TTL basics, see Caching Overview.

2. Write-Through Invalidation: When writing to the database, immediately invalidate (or update) the corresponding cache entry. Provides strong consistency but adds latency to write operations. The cache must be updated synchronously, so a slow cache can block writes. See Write-Through for the full pattern.

3. Event-Driven Invalidation: The database publishes change events to a message queue (Kafka, RabbitMQ), and cache nodes subscribe to invalidate affected keys. Decouples writes from invalidation, allowing asynchronous updates. Meta’s TAO uses this approach with a custom invalidation pipeline. The risk: message delays mean temporary inconsistency.

4. Manual/API-Driven Invalidation: Application code explicitly calls cache.delete() or cache.invalidate() when it knows data changed. Common in Cache Aside patterns where the application controls cache lifecycle. Flexible but error-prone—developers must remember to invalidate every relevant key.

Each strategy answers three questions differently: (1) When do we invalidate? (2) Who triggers invalidation? (3) What consistency guarantees do we provide?

Cache Invalidation Strategies Comparison

graph TB
    subgraph TTL-Based
        T1["Client Request"]
        T2["Cache Check"]
        T3{"Expired?"}
        T4["Return Cached"]
        T5["Query Database"]
        T6["Update Cache<br/>with new TTL"]
        T1 --> T2
        T2 --> T3
        T3 --"No"--> T4
        T3 --"Yes"--> T5
        T5 --> T6
        T6 --> T4
    end
    
    subgraph Write-Through
        W1["Write Request"]
        W2["Update Database"]
        W3["Invalidate Cache"]
        W4["Return Success"]
        W1 --> W2
        W2 --> W3
        W3 --> W4
    end
    
    subgraph Event-Driven
        E1["Database Write"]
        E2["Publish Event<br/>to Queue"]
        E3["Cache Nodes<br/>Subscribe"]
        E4["Invalidate<br/>Locally"]
        E1 --> E2
        E2 --> E3
        E3 --> E4
    end

Three primary cache invalidation strategies with different consistency-performance trade-offs. TTL provides eventual consistency with predictable expiration, write-through offers strong consistency at the cost of write latency, and event-driven decouples invalidation from writes but introduces message delay.

Key Principles

principle: Invalidate, Don’t Update (Usually) explanation: When data changes, deleting the cache entry is often safer than updating it in place. Deletion is idempotent—deleting twice has the same effect as deleting once. Updates can create race conditions where you overwrite newer data with older data if messages arrive out of order. The next read will fetch fresh data from the database. example: Twitter’s timeline cache deletes entries when new tweets arrive rather than trying to insert tweets into cached timelines. The next timeline request rebuilds from the database, ensuring consistency even if invalidation messages arrive out of order.

principle: Invalidation Must Be Faster Than Writes explanation: If invalidation messages lag behind database writes, readers can fetch stale data from the database and re-cache it after the invalidation arrives. This creates a window where the cache is correct, then becomes stale again. Invalidation pipelines must have lower latency than database replication lag. example: Meta’s TAO ensures invalidation messages propagate to all cache nodes within milliseconds, faster than their database replication lag of ~100ms. This ordering guarantee prevents re-caching stale data.

principle: Scope Invalidation Carefully explanation: Invalidating too broadly (e.g., clearing entire cache regions) wastes the cache’s value. Invalidating too narrowly (missing dependent keys) causes inconsistency. You must understand data dependencies: when user profile changes, which cached queries become invalid? example: When a user changes their profile photo on Facebook, TAO invalidates: the user object, the user’s profile page cache, any friend lists that include that user’s photo, and notification previews. Missing any of these leaves stale images visible to other users.

principle: Plan for Invalidation Failures explanation: Invalidation messages can fail: network partitions, full message queues, cache nodes down for maintenance. Your system must degrade gracefully. Combine invalidation with TTLs as a safety net—even if invalidation fails, data eventually expires. example: Stripe uses 5-minute TTLs on payment method cache even though they have event-driven invalidation. If the invalidation pipeline fails, stale payment methods automatically expire within 5 minutes rather than persisting indefinitely.

principle: Measure Invalidation Lag explanation: The time between database write and cache invalidation is your consistency window. Instrument this metric. If invalidation lag spikes, you’re serving stale data. Track p50, p99, and p99.9 latencies separately—tail latencies matter for user-facing caches. example: Uber monitors invalidation lag for their driver location cache. If lag exceeds 2 seconds, riders see outdated driver positions on the map. They alert on p99 lag > 1 second to catch issues before users notice.

How It Works

Let’s walk through a concrete invalidation scenario: updating a user’s profile in a social network.

Step 1: Write to Database. Application receives a request to update user 12345’s display name. It writes the new name to the primary database. At this moment, the cache still contains the old name—we have inconsistency.

Step 2: Trigger Invalidation. Depending on strategy: (a) Write-through: application immediately calls cache.delete("user:12345") before returning success to the client. (b) Event-driven: database triggers a change event to Kafka topic “user-updates” with payload {user_id: 12345, field: "display_name"}. (c) Manual: application code explicitly invalidates after the database write.

Step 3: Propagate Invalidation. For event-driven systems, cache nodes consume from Kafka and execute local deletes. For write-through, the application waits for cache acknowledgment. This is where distributed systems complexity emerges: what if one cache node is down? Do we retry? Do we proceed?

Step 4: Handle Dependent Keys. The user’s profile isn’t the only affected cache entry. Friend lists, search results, notification previews—all contain the display name. The invalidation logic must identify and clear these dependent keys. Some systems use “tag-based invalidation” where cache entries are tagged with user IDs, allowing bulk invalidation of all entries related to user 12345.

Step 5: Serve Fresh Data. Next request for user 12345 finds no cache entry (cache miss), fetches from database, and populates the cache with the new name. Subsequent requests hit the cache.

The Race Condition: Between steps 1 and 2, a reader might fetch the old name from the database and cache it. Then invalidation arrives and deletes it. Then another reader fetches and caches the old name again before database replication completes. This is why invalidation must be faster than replication lag, and why some systems use versioning (cache entries include a version number; invalidation includes the version; only invalidate if versions match).

User Profile Update with Event-Driven Invalidation

sequenceDiagram
    participant Client
    participant App as Application Server
    participant DB as Primary Database
    participant Queue as Kafka Topic
    participant Cache1 as Cache Node 1
    participant Cache2 as Cache Node 2
    participant Reader as Read Request
    
    Client->>App: POST /users/12345<br/>{name: "New Name"}
    App->>DB: UPDATE users SET name='New Name'<br/>WHERE id=12345
    DB-->>App: Write confirmed
    DB->>Queue: Publish event<br/>{user_id: 12345, field: "name"}
    App-->>Client: 200 OK
    
    Note over Cache1,Cache2: Old name still cached
    
    Queue->>Cache1: Consume event<br/>{user_id: 12345}
    Cache1->>Cache1: DELETE key "user:12345"
    Queue->>Cache2: Consume event<br/>{user_id: 12345}
    Cache2->>Cache2: DELETE key "user:12345"<br/>DELETE key "friends:*:12345"
    
    Reader->>Cache1: GET user:12345
    Cache1-->>Reader: Cache miss
    Reader->>DB: SELECT * FROM users<br/>WHERE id=12345
    DB-->>Reader: {name: "New Name"}
    Reader->>Cache1: SET user:12345<br/>{name: "New Name"}
    Reader-->>Reader: Return fresh data

Event-driven invalidation flow showing the race condition window between database write and cache invalidation. The consistency window (time between DB write and cache delete) is bounded by message queue latency, typically 10-100ms in production systems.

The Cache Stampede Problem

What It Is

Cache stampede (also called thundering herd) occurs when a popular cache entry expires and multiple requests simultaneously discover it’s missing. All requests hit the database concurrently to regenerate the value, creating a sudden spike in database load. For expensive queries (complex joins, aggregations), this can overwhelm the database and cause cascading failures.

The scenario: imagine a homepage feed cache with 1-hour TTL serving 10,000 requests/second. When it expires, the next 10,000 requests in that second all miss the cache and query the database. If the query takes 2 seconds, you’ve just sent 20,000 concurrent queries to the database—far exceeding its capacity.

Cache Stampede Scenario

graph TB
    subgraph "t=0: Cache Valid"
        C1["Cache Entry<br/>key: homepage_feed<br/>TTL: 3600s<br/>Serving 10K QPS"]
    end
    
    subgraph "t=3600s: Cache Expires"
        R1["Request 1"]
        R2["Request 2"]
        R3["Request 3"]
        R4["Request ..."]
        R5["Request 10,000"]
        C2["Cache<br/>(empty)"]
        
        R1 & R2 & R3 & R4 & R5 --> C2
    end
    
    subgraph "t=3600.1s: Stampede"
        DB[("Database<br/>Overwhelmed")]
        Q1["Query 1<br/>2s latency"]
        Q2["Query 2<br/>2s latency"]
        Q3["Query 3<br/>2s latency"]
        Q4["Query ...<br/>2s latency"]
        Q5["Query 10,000<br/>2s latency"]
        
        C2 -."All miss".-> Q1 & Q2 & Q3 & Q4 & Q5
        Q1 & Q2 & Q3 & Q4 & Q5 --> DB
    end
    
    subgraph "Impact"
        I1["❌ DB CPU: 100%"]
        I2["❌ Query latency: 2s → 30s"]
        I3["❌ Connection pool exhausted"]
        I4["❌ Cascading failures"]
    end
    
    DB --> I1
    I1 --> I2
    I2 --> I3
    I3 --> I4

Cache stampede occurs when a popular cache entry expires and thousands of concurrent requests simultaneously query the database. With 10K QPS and 2-second query time, this creates 20K concurrent database queries, overwhelming capacity and causing cascading failures across the entire system.

Why It Happens

Stampedes happen because cache expiration is deterministic and synchronized. If you cache a value at 10:00 AM with 1-hour TTL, it expires at 11:00 AM sharp. Every request after 11:00 AM sees a miss until someone repopulates it. In high-traffic systems, “every request” might be thousands per second. The problem compounds with probabilistic eviction policies—if memory pressure causes eviction of hot keys, the stampede is even worse because there’s no TTL warning.

Impact

The impact cascades: (1) Database CPU spikes to 100%. (2) Query latency increases from milliseconds to seconds. (3) More requests time out, causing retries. (4) Retry storms amplify the load. (5) Database becomes unresponsive, affecting all queries, not just the stampeded key. (6) Application servers queue up waiting for database responses, exhausting connection pools. (7) Load balancers mark application servers as unhealthy. (8) The entire service goes down.

Real incident: A major e-commerce site had their product catalog cache expire during a flash sale. The stampede sent 50,000 concurrent queries to their MySQL database, which locked up. The site was down for 12 minutes, costing millions in lost sales.

Mitigation Strategies

strategy: Request Coalescing (Locking) explanation: When the first request discovers a cache miss, it acquires a lock on that key and fetches from the database. Subsequent requests for the same key see the lock and wait for the first request to complete and populate the cache. Only one database query executes per key. Libraries like Groupcache (used at Google) implement this pattern. implementation: Use a distributed lock (Redis SETNX, Memcached add) with the cache key as the lock name. First request to acquire the lock fetches data; others poll the cache until the value appears. Set a lock timeout (e.g., 10 seconds) to handle failures—if the fetching request crashes, the lock expires and another request can try. tradeoff: Adds latency for waiting requests (they must poll or block). If the database query is slow, many requests queue up. But it’s better than overwhelming the database.

strategy: Probabilistic Early Expiration explanation: Instead of waiting for TTL to expire, randomly refresh the cache before expiration with probability that increases as expiration approaches. The formula: if (now - cached_time) / TTL > random(0,1) * beta, refresh early. This spreads out cache refreshes over time rather than synchronizing them at expiration. implementation: When reading from cache, calculate time until expiration. If it’s within a “danger zone” (e.g., last 10% of TTL), probabilistically trigger a background refresh. The beta parameter controls aggressiveness—higher beta means more early refreshes. Typical values: beta = 1.0 to 2.0. tradeoff: Increases cache refresh rate (more database load) but spreads it evenly. You’re trading steady low load for spiky high load. Works best for read-heavy workloads where the extra refreshes are negligible.

strategy: Stale-While-Revalidate explanation: Serve stale cached data while asynchronously fetching fresh data in the background. Users get instant responses (even if slightly outdated), and the cache updates without blocking. HTTP Cache-Control headers support this with stale-while-revalidate=<seconds>. implementation: Store both data and expiration timestamp in cache. On read, if expired but within revalidation window, return stale data immediately and trigger async refresh. Next request gets fresh data. Requires accepting eventual consistency. tradeoff: Users may see stale data for one request after expiration. Not suitable for financial data or inventory counts where staleness causes real problems. Perfect for social feeds, recommendations, or content where slight staleness is acceptable.

strategy: External Cache Warming explanation: Proactively refresh cache entries before they expire using a background job. A separate service monitors cache TTLs and refreshes popular keys before expiration. Decouples cache refresh from user requests entirely. implementation: Maintain a priority queue of cache keys sorted by expiration time. A worker process continuously pops keys nearing expiration and refreshes them. Track access frequency to prioritize hot keys. Uber uses this for driver location caches. tradeoff: Requires infrastructure for the warming service. Must track which keys are hot (not worth warming cold keys). Adds complexity but provides the smoothest user experience—no stampedes, no stale data.

Common Misconceptions

misconception: “Just use short TTLs to keep cache fresh” why_wrong: Short TTLs mean frequent cache misses, which defeats the purpose of caching. If your TTL is 10 seconds and queries take 100ms, you’re hitting the database 10x more often than necessary. The database load increases proportionally as TTL decreases. truth: TTL should match your staleness tolerance, not your desired freshness. If you can tolerate 5 minutes of stale data, use a 5-minute TTL. If you need fresher data, use event-driven invalidation instead of shortening TTL. Combine long TTLs (for performance) with explicit invalidation (for consistency).

misconception: “Invalidation guarantees consistency” why_wrong: Invalidation is asynchronous in distributed systems. Even with event-driven invalidation, there’s a window between database write and cache invalidation where readers see stale data. Network delays, message queue lag, and cache node failures all extend this window. truth: Invalidation provides eventual consistency, not strong consistency. If you need strong consistency (read-your-writes), don’t cache, or use write-through patterns where writes block until cache updates. For most applications, eventual consistency with bounded staleness (TTL as safety net) is acceptable.

misconception: “Deleting cache entries is always safe” why_wrong: Aggressive invalidation can cause cache stampedes and increase database load beyond capacity. If you invalidate a hot key serving 10,000 QPS, you’ve just sent 10,000 queries to your database. Also, invalidating dependent keys incorrectly can cause cascading invalidations that clear your entire cache. truth: Invalidation must be rate-limited and scoped carefully. Use request coalescing to prevent stampedes. Understand your data dependencies to avoid over-invalidation. Monitor cache hit rates—if they drop suddenly after deploying invalidation logic, you’re invalidating too aggressively.

misconception: “Cache invalidation is a caching problem” why_wrong: Invalidation is a distributed systems consistency problem that happens to involve caches. The hard parts—ordering guarantees, handling failures, race conditions—are the same challenges you face with distributed databases, message queues, and replicated state machines. truth: Solving invalidation requires understanding distributed systems primitives: causality, happens-before relationships, idempotency, and failure modes. This is why it’s considered hard. You’re not just managing cache; you’re maintaining consistency across multiple independent systems with network delays and partial failures.

misconception: “Updating cache is better than invalidating” why_wrong: Updates are not idempotent and create race conditions. If two invalidation messages arrive out of order (update to value B arrives before update to value A), you cache the wrong value. Deletions are idempotent—deleting twice has the same effect as deleting once. truth: Invalidate (delete) by default. Only update cache in place if: (1) you have strict ordering guarantees (version numbers, sequence IDs), (2) the update is idempotent, and (3) the performance benefit justifies the complexity. Most systems delete and let the next read repopulate.

Real-World Usage

Meta Tao

Meta’s TAO (The Associations and Objects) cache serves trillions of requests per day for Facebook’s social graph. They’ve published extensively on their invalidation system, which achieves 10-nines consistency (99.99999999%). Their approach: database writes trigger invalidation messages to a dedicated pipeline that propagates to all cache nodes within milliseconds. They use versioning to handle out-of-order messages and have fallback TTLs (hours) as a safety net. The key insight: invalidation messages must propagate faster than database replication lag to prevent re-caching stale data.

Meta TAO Invalidation Architecture

graph LR
    subgraph "Write Path"
        Client["Client Request<br/>(Update Profile)"]
        AppServer["Application<br/>Server"]
        MySQL[("MySQL<br/>Primary DB")]
        Client --"1. POST /profile"--> AppServer
        AppServer --"2. UPDATE"--> MySQL
    end
    
    subgraph "Invalidation Pipeline"
        Trigger["DB Trigger<br/>(on write)"]
        InvQueue["Invalidation<br/>Queue<br/><i>Custom pipeline</i>"]
        MySQL --"3. Trigger event"--> Trigger
        Trigger --"4. Publish<br/>{user_id, version}"--> InvQueue
    end
    
    subgraph "Cache Layer - Region 1"
        Cache1["TAO Cache<br/>Node 1"]
        Cache2["TAO Cache<br/>Node 2"]
        InvQueue --"5. Propagate<br/>(< 10ms)"--> Cache1
        InvQueue --"5. Propagate<br/>(< 10ms)"--> Cache2
    end
    
    subgraph "Cache Layer - Region 2"
        Cache3["TAO Cache<br/>Node 3"]
        Cache4["TAO Cache<br/>Node 4"]
        InvQueue --"5. Propagate<br/>(< 10ms)"--> Cache3
        InvQueue --"5. Propagate<br/>(< 10ms)"--> Cache4
    end
    
    subgraph "Consistency Guarantees"
        V1["✓ Versioned deletes<br/>(handle out-of-order)"]
        V2["✓ Invalidation < replication lag<br/>(< 100ms)"]
        V3["✓ Fallback TTL: hours<br/>(safety net)"]
    end
    
    Cache1 & Cache2 -.-> V1
    InvQueue -.-> V2
    Cache3 & Cache4 -.-> V3
    
    ReadReq["Read Request"] --"6. Cache miss"--> Cache1
    Cache1 --"7. Fetch fresh"--> MySQL

Meta’s TAO cache invalidation architecture achieves 10-nines consistency by ensuring invalidation messages propagate faster than database replication lag (< 10ms vs ~100ms). Versioned deletes handle out-of-order messages, and fallback TTLs provide a safety net if the invalidation pipeline fails.

Uber Geospatial

Uber’s driver location cache must invalidate within 1-2 seconds or riders see outdated driver positions. They use a hybrid approach: drivers publish location updates to a stream (Kafka), which triggers cache invalidation. But they also use 5-second TTLs as a safety net. For surge pricing cache, they use write-through invalidation because pricing must be consistent—showing wrong prices causes customer support nightmares. They’ve shared that cache stampede mitigation (request coalescing) is critical during surge events when millions of riders check prices simultaneously.

Stripe Payments

Stripe caches payment method data (credit cards, bank accounts) with strict consistency requirements. They use event-driven invalidation: when a customer updates their payment method, the database publishes an event that invalidates all related cache entries (customer object, payment method list, default payment method). They combine this with 5-minute TTLs because payment data staleness can cause duplicate charges or failed payments. Their invalidation pipeline has sub-second latency and includes retry logic with exponential backoff for failed invalidations.

Netflix Metadata

Netflix caches movie/show metadata (titles, descriptions, artwork) with 24-hour TTLs because catalog changes are infrequent. They use manual invalidation via internal APIs when content teams update metadata. For personalized recommendations, they use shorter TTLs (minutes) because recommendation models update frequently. They’ve shared that cache warming is critical—they pre-populate cache with popular titles before traffic spikes (new season releases) to avoid stampedes. Their CDN cache uses stale-while-revalidate to serve slightly outdated artwork while refreshing in the background.

Interview Essentials

Mid-Level

Explain the three main invalidation strategies (TTL, write-through, event-driven) and when to use each. Describe cache stampede and at least one mitigation (request coalescing or probabilistic early expiration). Understand the consistency trade-off: aggressive invalidation improves freshness but increases database load. Be able to calculate: if a cache entry serves 1000 QPS with 1-hour TTL, how many database queries per hour? (Answer: ~1, plus stampede risk at expiration). Recognize that invalidation is about correctness, not performance.

Senior

Design an invalidation strategy for a specific system (e.g., e-commerce product catalog, social media feed). Justify your choice of strategy based on consistency requirements, traffic patterns, and failure modes. Explain how to handle dependent keys (when product price changes, invalidate product page, search results, cart totals). Discuss race conditions: what if invalidation arrives before database write completes? How do you prevent re-caching stale data? Describe monitoring: what metrics indicate invalidation problems? (Cache hit rate drops, invalidation lag spikes, database load increases). Explain how to test invalidation logic (chaos engineering: delay messages, drop messages, reorder messages).

Staff+

Architect an invalidation system for global scale (multiple data centers, millions of QPS). Discuss cross-region consistency: how do you invalidate caches in US-East when database writes in US-West? Explain versioning schemes to handle out-of-order messages. Design for failure: what if the invalidation pipeline is down for 10 minutes? How do you recover? Discuss the CAP theorem implications: you can’t have strong consistency and high availability with network partitions—how do you choose? Explain how Meta achieves 10-nines consistency (faster invalidation than replication lag, versioning, idempotent deletes). Discuss alternative approaches: CRDT-based caches, consensus-based invalidation (Raft/Paxos). Quantify the business impact: if invalidation lag increases from 100ms to 1 second, what’s the user impact? How do you measure it?

Common Interview Questions

How would you invalidate cache when a user updates their profile?

What’s the difference between cache invalidation and cache eviction?

How do you prevent cache stampede?

When would you use TTL-based invalidation vs. event-driven invalidation?

How do you handle invalidation failures?

What metrics would you monitor for cache invalidation?

How do you invalidate dependent cache entries?

Explain a time when cache invalidation caused a production issue.

Red Flags to Avoid

Confusing invalidation with eviction (they’re different problems)

Suggesting “just use short TTLs” without considering database load

Not mentioning cache stampede when discussing TTL expiration

Claiming invalidation provides strong consistency in distributed systems

Ignoring race conditions between writes and invalidation

Not considering failure modes (what if invalidation message is lost?)

Over-engineering (suggesting Paxos for a simple cache when TTL would work)

No mention of monitoring or measuring invalidation lag

Key Takeaways

Cache invalidation is hard because it’s a distributed consistency problem, not just a caching problem. You must handle race conditions, message delays, and partial failures while balancing consistency and performance.

Choose invalidation strategy based on consistency requirements: TTL for eventual consistency, write-through for strong consistency, event-driven for decoupled systems. Combine strategies (event-driven + TTL safety net) for production robustness.

Cache stampede occurs when popular cache entries expire and multiple requests hit the database simultaneously. Mitigate with request coalescing, probabilistic early expiration, or stale-while-revalidate depending on your consistency tolerance.

Invalidate (delete) rather than update cache entries—deletions are idempotent and avoid race conditions. Only update in place if you have strict ordering guarantees and the complexity is justified.

Always have a fallback: combine explicit invalidation with TTLs as a safety net. Monitor invalidation lag, cache hit rates, and database load to detect problems before users notice stale data.