Write-Behind Cache: Async DB Writes Explained
After this topic, you will be able to:
- Implement the write-behind pattern with appropriate queuing and batching strategies
- Evaluate the data loss risks and mitigation strategies for write-behind caching
- Compare write-behind with write-through and recommend the appropriate choice
- Design a write-behind system that balances performance and durability
TL;DR
Write-behind (also called write-back) is a caching pattern where writes go to cache first and are asynchronously propagated to the database later, dramatically improving write performance at the cost of potential data loss if the cache fails before persistence. The pattern excels in write-heavy workloads where you can tolerate eventual consistency and brief windows of data vulnerability.
Cheat Sheet: Cache write → Queue → Async DB write | Best for: high-frequency updates (location, counters) | Risk: data loss on cache failure | Mitigation: write-ahead logs, replication | Used by: Uber (location updates), gaming leaderboards
The Problem It Solves
Database writes are expensive. Every synchronous write to disk involves seek time, write amplification, and transaction overhead. When you have write-heavy workloads—think Uber drivers updating their location every few seconds, or gaming systems updating player scores thousands of times per minute—forcing every write through to the database creates crushing bottlenecks. The database becomes the limiting factor, not because it can’t handle the data volume, but because the synchronous nature of writes creates latency that cascades through your entire system.
The pain is particularly acute when you have temporal locality in your writes. If a user updates the same record multiple times in quick succession (editing a document, adjusting settings, moving on a map), you’re hammering the database with writes where only the final state matters. You’re paying the full cost of persistence for intermediate states that have no lasting value. This is wasteful and slow.
Write-behind solves this by decoupling write acknowledgment from write durability. The application gets instant confirmation that the write succeeded (from cache), while the actual database persistence happens later, asynchronously. This transforms write latency from 10-50ms (database round-trip) to sub-millisecond (cache write), a 10-100x improvement that fundamentally changes what your system can handle.
Solution Overview
Write-behind caching inverts the traditional write path. Instead of writing to the database first (or simultaneously, as in write-through), the application writes exclusively to the cache and immediately returns success. A background process—typically a queue consumer or scheduled worker—reads from the cache and persists changes to the database on a delayed schedule.
The magic happens in the gap between cache write and database write. During this window, multiple writes to the same key can be coalesced into a single database operation. If a user updates their profile three times in five seconds, only the final state hits the database. This write coalescing dramatically reduces database load, often by 10-100x in high-frequency update scenarios.
The pattern introduces a queue (logical or physical) between cache and database. This queue can be as simple as a background thread polling the cache for dirty entries, or as sophisticated as a distributed message queue like Kafka. The queue provides buffering, batching, and retry logic, transforming bursty write traffic into smooth, manageable database load. The trade-off is clear: you get exceptional write performance, but you accept a window of vulnerability where data exists only in cache, not on durable storage.
Write Coalescing: Multiple Updates Batched
sequenceDiagram
participant App as Application
participant Cache as Redis Cache
participant Worker as Background Worker
participant DB as Database
Note over App,DB: Time: T=0s
App->>Cache: Write user:123 location = (10, 20)
Cache-->>App: OK
Note over App,DB: Time: T=2s
App->>Cache: Write user:123 location = (15, 25)
Cache-->>App: OK
Note over App,DB: Time: T=4s
App->>Cache: Write user:123 location = (18, 30)
Cache-->>App: OK
Note over App,DB: Time: T=5s (Batch Flush)
Worker->>Cache: Get dirty entries
Cache-->>Worker: user:123 = (18, 30)
Note over Worker: Only latest value!<br/>3 writes coalesced to 1
Worker->>DB: UPDATE user:123 SET location = (18, 30)
DB-->>Worker: Success
Worker->>Cache: Clear dirty flag
Write coalescing in action: three location updates for the same user arrive within 5 seconds, but only the final state (18, 30) is written to the database. This reduces database load by 67% in this example, and can achieve 10-100x reduction in high-frequency update scenarios where intermediate states have no lasting value.
How It Works
Step 1: Application writes to cache. When the application needs to persist data, it writes directly to the cache (Redis, Memcached, or an in-process cache) and marks the entry as “dirty” or “pending persistence.” The write completes in microseconds, and the application immediately returns success to the user. At this point, the data exists only in volatile memory.
Step 2: Cache tracks dirty entries. The cache maintains metadata about which entries need database persistence. This might be a separate “dirty set” data structure, timestamps on cache entries, or flags in the cache value itself. For example, Redis might use a sorted set where scores are timestamps, allowing workers to efficiently find entries that have been dirty for more than N seconds.
Step 3: Background workers poll for dirty entries. Asynchronous workers (threads, processes, or separate services) periodically scan the cache for dirty entries. The polling interval is a critical tuning parameter: too frequent and you lose batching benefits; too infrequent and you increase the data loss window. Typical intervals range from 100ms to several seconds, depending on your durability requirements.
Step 4: Batching and coalescing. Workers collect multiple dirty entries and batch them into efficient database operations. If the same key was updated multiple times, only the latest value is written—this is write coalescing. A batch might contain 100-1000 writes, transformed into a single bulk insert or update statement. This is where write-behind achieves its massive efficiency gains.
Step 5: Database persistence. The worker writes the batch to the database. If the write succeeds, the dirty flag is cleared from cache. If it fails, the worker retries with exponential backoff. Some implementations remove the cache entry after successful persistence (treating cache as a write buffer), while others keep it for reads (treating cache as both write buffer and read cache).
Step 6: Handling concurrent updates. If new writes arrive while a batch is being persisted, they’re marked dirty again and picked up in the next cycle. Version numbers or timestamps prevent older writes from overwriting newer ones. For example, if the cache has version 5 and the database has version 3, the worker writes version 5. If version 6 arrives during persistence, it’s queued for the next batch.
Write-Behind Flow with Background Worker
graph LR
App["Application<br/><i>Service</i>"]
Cache[("Redis Cache<br/><i>In-Memory</i>")]
DirtySet["Dirty Set<br/><i>Pending Writes</i>"]
Worker["Background Worker<br/><i>Async Process</i>"]
DB[("PostgreSQL<br/><i>Persistent Storage</i>")]
App --"1. Write data<br/>(sub-ms)"---> Cache
App --"2. Mark as dirty"---> DirtySet
App --"3. Return success<br/>immediately"---> App
Worker --"4. Poll every 5s"---> DirtySet
Worker --"5. Batch dirty entries"---> Worker
Worker --"6. Persist batch<br/>(10-50ms)"---> DB
Worker --"7. Clear dirty flag"---> DirtySet
Write-behind flow showing the complete cycle: application writes to cache and returns immediately (steps 1-3), while a background worker asynchronously polls, batches, and persists dirty entries to the database (steps 4-7). The key insight is the decoupling of write acknowledgment (sub-millisecond) from write durability (seconds later).
Variants
Time-based batching: Workers flush dirty entries on a fixed schedule (every 5 seconds, every minute). This is simple to implement and provides predictable database load, but it means some writes might sit in cache for nearly a full interval before persistence. Best for workloads where you can tolerate several seconds of data loss and want smooth, predictable database traffic. Used in analytics systems where eventual consistency is acceptable.
Size-based batching: Workers flush when the dirty set reaches a certain size (1000 entries, 10MB of data). This optimizes for database efficiency—you’re always writing full batches—but it means flush timing is unpredictable and depends on write volume. During low-traffic periods, data might sit in cache for extended periods. Best for high-throughput systems where write volume is consistently high. Gaming leaderboards often use this approach.
Hybrid batching: Combines time and size triggers—flush every 5 seconds OR when 1000 entries are dirty, whichever comes first. This balances predictable durability (time bound) with efficiency (size bound). Most production systems use this variant because it handles both high and low traffic gracefully.
Write-behind with write-ahead log: Before acknowledging the write to the application, the cache appends it to a durable write-ahead log (WAL). This adds a few milliseconds of latency but dramatically reduces data loss risk—even if cache crashes, the WAL can replay writes. This is how databases themselves work internally. Best when you need write-behind performance but can’t tolerate data loss. Cassandra’s commit log is an example of this pattern.
Batching Strategies Comparison
graph TB
subgraph Time-Based Batching
T1["Flush every 5 seconds<br/><i>Predictable timing</i>"]
T2["Pros: Smooth DB load<br/>Bounded data loss window"]
T3["Cons: Writes may wait<br/>full interval"]
end
subgraph Size-Based Batching
S1["Flush at 1000 entries<br/><i>Optimal batch size</i>"]
S2["Pros: Maximum DB efficiency<br/>Always full batches"]
S3["Cons: Unpredictable timing<br/>Long waits in low traffic"]
end
subgraph Hybrid Batching
H1["Flush every 5s OR 1000 entries<br/><i>Whichever comes first</i>"]
H2["Pros: Bounded latency + efficiency<br/>Handles variable traffic"]
H3["Cons: More complex logic<br/>Slightly higher overhead"]
end
Best["Production Choice:<br/>Hybrid Batching"] --> H1
Three batching strategies for write-behind systems: time-based provides predictable durability windows, size-based optimizes database efficiency, and hybrid combines both benefits. Most production systems use hybrid batching because it gracefully handles both high and low traffic patterns while maintaining bounded data loss windows.
Trade-offs
Write latency vs. durability: Write-behind gives you sub-millisecond write acknowledgment, but data isn’t durable until it hits the database (seconds to minutes later). Write-through gives you 10-50ms writes, but data is immediately durable. Choose write-behind when user experience depends on write speed and you can tolerate brief data loss windows. Choose write-through when every write must be durable before acknowledgment (financial transactions, inventory updates).
Database load vs. complexity: Write-behind can reduce database write load by 10-100x through batching and coalescing, but you must implement queue management, retry logic, failure handling, and monitoring. Write-through is architecturally simpler—just write to both cache and database—but your database must handle peak write load. Choose write-behind when database capacity is your bottleneck and you have engineering resources to manage complexity. Choose write-through when simplicity and operational safety matter more than peak performance.
Consistency vs. performance: Write-behind creates eventual consistency—reads might see stale data if they hit the database before writes are flushed. Write-through provides strong consistency—reads always see the latest write. Choose write-behind when your read path goes through cache (so reads see the latest writes anyway) or when eventual consistency is acceptable. Choose write-through when you need read-after-write consistency across all clients.
Data loss risk vs. throughput: Write-behind accepts that cache failure means data loss for unflushed writes. Write-through guarantees no data loss because writes are synchronously persisted. Mitigation strategies (WAL, replication) reduce but don’t eliminate write-behind’s risk. Choose write-behind when the data is reconstructible (location updates, session state) or when you’ve implemented robust durability strategies. Choose write-through when data loss is unacceptable (user account changes, payment records).
Durability and Data Loss Prevention
Write-ahead logging: Before marking a cache entry as successfully written, append the write to a durable log on disk or distributed storage. If the cache crashes, replay the log to recover unflushed writes. This is the gold standard for durability but adds 2-5ms to write latency. Redis AOF (Append-Only File) and Cassandra’s commit log use this approach. The log can be local (fast but single point of failure) or distributed (slower but survives node failures).
Cache replication: Run multiple cache replicas and synchronously replicate writes across them before acknowledging success. If one cache node fails, others have the data. This adds network round-trip latency (1-2ms within a datacenter) but provides high availability. Redis Sentinel and Memcached replication use this pattern. The trade-off is increased infrastructure cost and complexity.
Periodic snapshots: Periodically dump the entire cache state to durable storage (every 5 minutes, every hour). If the cache crashes, restore from the latest snapshot and accept data loss for writes since the last snapshot. This is cheaper than WAL (less I/O) but has a larger data loss window. Redis RDB snapshots work this way. Best when you can tolerate losing several minutes of writes.
Acknowledgment strategies: Instead of immediately returning success, wait for the write to be queued in a durable queue (Kafka, RabbitMQ) before acknowledging. This adds 5-10ms but ensures writes survive cache failures. The application still gets fast acknowledgment (compared to database writes), but you’ve moved the durability boundary forward. Uber’s location updates use Kafka as a durable buffer between cache and database.
Recovery procedures: Design your system to detect and recover from data loss. Maintain checksums or version numbers so you can identify gaps. Implement reconciliation jobs that compare cache and database state and repair inconsistencies. For example, if a user’s profile shows version 100 in cache but version 95 in the database, you know versions 96-100 were lost and can trigger recovery (re-request from the user, restore from backup, or accept the loss). The key is making data loss visible and recoverable rather than silent and corrupting.
Write-Behind with Write-Ahead Log (WAL)
graph TB
App["Application"]
Cache[("Redis Cache<br/><i>Volatile Memory</i>")]
WAL["Write-Ahead Log<br/><i>Durable Storage</i>"]
Worker["Background Worker"]
DB[("Database")]
Recovery["Recovery Process<br/><i>On Cache Crash</i>"]
App --"1. Write data"---> Cache
App --"2. Append to WAL<br/>(+2-5ms)"---> WAL
App --"3. Return success<br/>after WAL write"---> App
Worker --"4. Poll cache"---> Cache
Worker --"5. Persist batch"---> DB
Worker --"6. Truncate WAL<br/>after success"---> WAL
WAL -."7. Replay on<br/>cache failure".-> Recovery
Recovery -."8. Restore<br/>lost writes".-> Cache
Write-ahead logging adds durability to write-behind by appending writes to a durable log before acknowledging success. This adds 2-5ms latency but eliminates data loss risk—if the cache crashes, the recovery process replays the WAL to restore unflushed writes. This hybrid approach provides write-behind performance with write-through durability guarantees.
When to Use (and When Not To)
Use write-behind when: (1) You have high-frequency writes to the same keys—location updates, counters, session state, real-time game scores. Write coalescing provides massive efficiency gains. (2) Write latency directly impacts user experience—mobile apps where every millisecond of lag is felt, real-time collaboration tools, gaming. (3) Your database is the bottleneck—you’re hitting write capacity limits and need to reduce load. (4) You can tolerate eventual consistency—reads can be slightly stale, or your read path goes through cache anyway. (5) Data is reconstructible or low-value—if you lose a few seconds of writes, you can recover or accept the loss.
Avoid write-behind when: (1) Data loss is unacceptable—financial transactions, user account changes, inventory management. Use write-through with strong durability guarantees instead. (2) You need strong consistency—reads must immediately reflect writes across all clients. Write-behind creates consistency windows. (3) Writes are infrequent or already batched—if you’re writing once per minute, write-behind adds complexity without performance benefits. (4) You lack operational maturity—write-behind requires sophisticated monitoring, alerting, and recovery procedures. If you can’t detect and respond to data loss, don’t use this pattern. (5) Compliance requires audit trails—some regulations mandate that every write is immediately durable and traceable. Write-behind’s asynchronous nature complicates compliance.
Real-World Examples
Uber’s location updates: Uber drivers send location updates every 4 seconds, generating millions of writes per minute globally. Writing each update directly to the database would be prohibitively expensive. Instead, Uber uses write-behind caching with Kafka as the durable queue. Driver locations are written to Redis, batched by Kafka, and asynchronously persisted to Cassandra. Write coalescing reduces database writes by 50-70%—if a driver sends three updates before the batch flushes, only the latest position is stored. The system tolerates brief data loss because location data is continuously refreshed; losing a few seconds of positions doesn’t impact the rider experience.
Gaming leaderboards: High-traffic games like Fortnite or League of Legends update player scores thousands of times per second. Writing every score change to the database would crush write capacity. Instead, scores are written to Redis with write-behind persistence. The system batches score updates every 10 seconds, coalescing multiple updates per player into a single database write. During peak hours, this reduces database load by 100x. The trade-off is acceptable: if the cache crashes, players might lose a few seconds of score progress, but the game state is continuously regenerated through ongoing play.
E-commerce shopping carts: Amazon-scale e-commerce systems use write-behind for shopping cart updates. When a user adds items to their cart, the change is written to a distributed cache (ElastiCache) and asynchronously persisted to DynamoDB. This provides instant feedback to the user (critical for perceived performance) while batching database writes. The system uses write-ahead logging to prevent cart data loss—each cart update is logged to Kinesis before acknowledgment. If the cache fails, Kinesis replays the log to recover carts. This hybrid approach gives write-behind performance with write-through durability for high-value data.
Uber Location Updates Architecture
graph LR
subgraph Driver App
Driver["Driver<br/><i>Updates every 4s</i>"]
end
subgraph Write-Behind Layer
API["Location API<br/><i>Service</i>"]
Redis[("Redis Cache<br/><i>Latest Positions</i>")]
Kafka["Kafka Queue<br/><i>Durable Buffer</i>"]
end
subgraph Persistence Layer
Consumer["Kafka Consumer<br/><i>Batch Writer</i>"]
Cassandra[("Cassandra<br/><i>Historical Data</i>")]
end
subgraph Rider App
Rider["Rider<br/><i>Sees real-time location</i>"]
end
Driver --"1. POST /location<br/>(lat, lng)"---> API
API --"2. Write to cache<br/>(sub-ms)"---> Redis
API --"3. Enqueue<br/>(5-10ms)"---> Kafka
API --"4. Return 200 OK"---> Driver
Kafka --"5. Consume batch<br/>(every 10s)"---> Consumer
Consumer --"6. Coalesce + persist<br/>(50-70% reduction)"---> Cassandra
Rider --"7. Read location"---> Redis
Redis --"8. Return latest<br/>position"---> Rider
Uber’s location update system uses write-behind with Kafka as a durable queue between Redis cache and Cassandra database. Drivers send updates every 4 seconds, which are written to Redis (sub-millisecond) and queued in Kafka (5-10ms) before returning success. Kafka consumers batch and coalesce updates every 10 seconds, reducing Cassandra writes by 50-70%. Riders read from Redis for real-time positions, while Cassandra stores historical data for analytics.
Interview Essentials
Mid-Level
Explain the basic write-behind flow: write to cache, mark dirty, background worker flushes to database. Describe write coalescing and why it improves performance. Discuss the primary trade-off: fast writes vs. data loss risk. Be able to compare write-behind to write-through: write-behind is faster but less durable. Know one real-world example (Uber location updates) and why write-behind fits that use case.
Senior
Design a write-behind system with specific technology choices: Redis for cache, Kafka for durable queue, PostgreSQL for database. Explain batching strategies (time-based, size-based, hybrid) and when to use each. Discuss failure scenarios: cache crash, database unavailable, network partition. Describe mitigation strategies: write-ahead logging, replication, snapshots. Calculate the data loss window: if you batch every 5 seconds and the cache crashes, you lose up to 5 seconds of writes. Explain how to monitor write-behind systems: queue depth, flush latency, dirty entry count, data loss events.
Staff+
Architect a write-behind system that balances performance, durability, and cost at scale. Discuss consistency models: eventual consistency, read-your-writes consistency (achievable if reads go through cache), and how to handle consistency requirements across microservices. Design recovery procedures: how do you detect data loss, reconcile cache and database state, and replay lost writes? Explain capacity planning: if you’re doing 100K writes/sec with 10:1 coalescing, you need 10K database writes/sec capacity plus buffer for retry storms. Discuss organizational trade-offs: write-behind requires sophisticated operational practices—is your team ready? When would you recommend against write-behind despite performance benefits? Describe how to migrate from write-through to write-behind without downtime.
Common Interview Questions
How do you prevent data loss in write-behind? (WAL, replication, snapshots, durable queues)
What happens if the cache crashes before writes are flushed? (Data loss for unflushed writes; mitigation: WAL, replication)
How do you handle write conflicts in write-behind? (Version numbers, timestamps, last-write-wins)
When would you choose write-behind over write-through? (High-frequency writes, write latency critical, eventual consistency acceptable)
How do you monitor a write-behind system? (Queue depth, flush latency, dirty entry count, data loss events, cache-database lag)
Red Flags to Avoid
Not mentioning data loss risk—this is the primary trade-off and must be addressed
Claiming write-behind provides strong consistency—it doesn’t; eventual consistency is inherent
Ignoring failure scenarios—cache crashes, database unavailability, network partitions must be discussed
Not explaining write coalescing—this is the key performance benefit and why write-behind exists
Recommending write-behind for financial transactions or other data-loss-intolerant workloads
Key Takeaways
Write-behind decouples write acknowledgment from persistence, achieving 10-100x faster writes by writing to cache first and asynchronously flushing to the database later.
Write coalescing is the killer feature: multiple writes to the same key are batched into a single database operation, dramatically reducing database load in high-frequency update scenarios.
The fundamental trade-off is performance vs. durability: you get exceptional write speed but accept a window of vulnerability where data exists only in cache and can be lost if the cache fails.
Mitigation strategies (write-ahead logs, replication, durable queues) reduce but don’t eliminate data loss risk. Choose write-behind only when you can tolerate or recover from brief data loss.
Write-behind excels in write-heavy workloads with temporal locality (location updates, counters, session state) where only the final state matters and eventual consistency is acceptable.