Write-Ahead Log (WAL) Explained: Database Durability

After this topic, you will be able to:

Explain how write-ahead logging ensures durability and crash recovery
Describe the relationship between WAL and database checkpointing
Summarize WAL’s role in replication and point-in-time recovery

TL;DR

Write-Ahead Log (WAL) is a fundamental database reliability mechanism that ensures durability by writing all changes to a sequential log file before modifying data pages in memory. If a crash occurs, the database replays the WAL during recovery to restore committed transactions. WAL is the foundation for crash recovery, replication, and point-in-time recovery in production databases like PostgreSQL, MySQL, and distributed systems at Netflix and Stripe.

Cheat Sheet: WAL = append-only log written before data pages → enables crash recovery by replay → supports replication by shipping log entries → checkpointing truncates old WAL.

Mental Model

Think of WAL as a detailed journal you keep before making changes to your bank ledger. Before you update your account balance in the main ledger (which might be slow, involve reorganizing pages, or get interrupted), you write “withdrew $50 from checking” in your journal with a timestamp. If someone knocks over your coffee mid-update and ruins the ledger, you can rebuild the correct state by replaying your journal entries from the last known good checkpoint. The journal is fast to write (just append to the end), durable (you never erase entries until you’re certain the ledger is safe), and complete (every change is recorded). This is exactly how databases use WAL: the journal is the WAL file, the ledger is your data pages in memory, and the coffee spill is a server crash.

Why This Matters

WAL is the unsung hero of database reliability—it’s why your bank balance survives a power outage and why PostgreSQL can replicate to standby servers. In interviews, understanding WAL demonstrates you know how databases actually achieve ACID durability (not just theory) and why sequential writes outperform random I/O. Senior candidates must explain the tradeoff between WAL flush frequency and performance, while staff engineers should discuss how Netflix built a distributed WAL for their data platform to handle millions of writes per second. Misunderstanding WAL leads to dangerous assumptions about durability guarantees and replication lag. Every major database (PostgreSQL, MySQL InnoDB, MongoDB, Cassandra) uses some form of WAL, making this knowledge universally applicable.

Core Concept

Write-Ahead Logging (WAL) is a protocol that guarantees atomicity and durability by ensuring that all modifications are written to a persistent log before the actual data pages are updated. The core insight is that sequential writes to a log file are orders of magnitude faster than random writes to data pages scattered across disk. When a transaction commits, the database flushes the WAL to disk (an fsync operation), making the commit durable even if the in-memory data pages haven’t been written yet. This decouples the commit latency from the expensive work of updating data structures, checkpointing, and flushing dirty pages.

WAL entries contain enough information to redo operations during crash recovery. Each entry includes a Log Sequence Number (LSN), the transaction ID, the operation type (INSERT, UPDATE, DELETE), and the before/after images of affected data. The LSN is a monotonically increasing identifier that establishes a total order of operations. During normal operation, the database buffers WAL entries in memory and flushes them in batches to amortize fsync cost. The durability guarantee is simple: if a transaction’s commit record is flushed to WAL, that transaction will survive any crash.

Checkpointing and WAL Truncation Lifecycle

graph TB
    subgraph "Before Checkpoint"
        WAL1["WAL Segments<br/>LSN 0-1000<br/><i>16MB each</i>"]
        WAL2["WAL Segments<br/>LSN 1000-2000"]
        WAL3["WAL Segments<br/>LSN 2000-3000"]
        Dirty1["Dirty Pages<br/>in Buffer Pool<br/><i>Not yet on disk</i>"]
    end
    
    subgraph "Checkpoint Process"
        CP["Checkpoint Triggered<br/><i>Every 5 minutes</i>"]
        Flush["Flush all dirty pages<br/>to data files"]
        Record["Write checkpoint record<br/>LSN = 3000"]
    end
    
    subgraph "After Checkpoint"
        WAL1_Old["WAL Segments<br/>LSN 0-1000<br/><i>Can be deleted</i>"]
        WAL2_Keep["WAL Segments<br/>LSN 1000-2000<br/><i>Keep for replication</i>"]
        WAL3_Keep["WAL Segments<br/>LSN 2000-3000<br/><i>Active</i>"]
        DataDisk["Data Files<br/><i>All changes up to LSN 3000<br/>are durable</i>"]
    end
    
    WAL1 & WAL2 & WAL3 --> CP
    Dirty1 --> CP
    CP --> Flush
    Flush --> Record
    Record --> WAL1_Old
    Record --> WAL2_Keep
    Record --> WAL3_Keep
    Record --> DataDisk

Checkpointing writes all dirty pages to disk and records the LSN up to which all changes are durable. WAL segments before this checkpoint LSN can be safely deleted (or archived for PITR), preventing unbounded WAL growth. However, segments may be retained longer for replication lag or archive requirements. The tradeoff: frequent checkpoints mean smaller WAL but more I/O overhead; infrequent checkpoints mean larger WAL but faster recovery.

How It Works

The WAL protocol follows a strict sequence. First, when a transaction modifies data, the database generates WAL records describing the changes and appends them to an in-memory WAL buffer. These records are not yet durable. Second, when the transaction commits, the database performs a WAL flush (fsync) to ensure all WAL records up to the commit record are persisted to disk. Only after this flush succeeds does the database return success to the client. Third, asynchronously and independently, a background process writes modified data pages (dirty pages) from the buffer pool to their actual locations on disk. This is called checkpointing.

The key invariant is: a data page must never be written to disk before its corresponding WAL records are flushed. This is enforced by tracking the LSN of the last WAL record that modified each page. Before writing a dirty page, the database checks that all WAL up to that page’s LSN has been flushed. This ensures that during crash recovery, the WAL contains a complete history of all changes, even for pages that were partially written or not written at all.

During crash recovery, the database scans the WAL from the last checkpoint forward, replaying all committed transactions. Uncommitted transactions are rolled back using the before-images in the WAL. The recovery process is idempotent—replaying the same WAL entry multiple times produces the same result—which is critical because crashes can occur during recovery itself.

WAL Write Flow: Transaction Commit Path

graph LR
    Client["Client Application"]
    TxnMgr["Transaction Manager"]
    WALBuffer["WAL Buffer<br/><i>In-Memory</i>"]
    WALFile["WAL File<br/><i>Disk</i>"]
    BufferPool["Buffer Pool<br/><i>In-Memory Data Pages</i>"]
    DataFiles["Data Files<br/><i>Disk</i>"]
    
    Client --"1. UPDATE accounts SET balance=950"--> TxnMgr
    TxnMgr --"2. Generate WAL record<br/>(LSN, TxID, old=1000, new=950)"--> WALBuffer
    TxnMgr --"3. Modify in-memory page<br/>(mark dirty)"--> BufferPool
    TxnMgr --"4. COMMIT"--> WALBuffer
    WALBuffer --"5. fsync() - DURABILITY POINT"--> WALFile
    WALFile --"6. Success"--> Client
    BufferPool -."7. Async checkpoint<br/>(background writer)".-> DataFiles

The WAL protocol ensures durability by flushing log records to disk (step 5) before returning success to the client. Data page writes happen asynchronously in the background, decoupling commit latency from expensive random I/O. The fsync at step 5 is the durability guarantee—everything before this point could be lost in a crash.

Key Principles

principle: Sequential Writes Win explanation: WAL converts random I/O into sequential append-only writes, which are 10-100x faster on spinning disks and still significantly faster on SSDs due to write amplification reduction. This is why databases can commit thousands of transactions per second even when data pages are scattered across terabytes of storage. example: PostgreSQL’s WAL can sustain 50,000+ commits/sec on modern NVMe SSDs because each commit only appends a few KB to the WAL file, while the actual data page updates happen asynchronously in the background.

principle: Durability Decoupled from Performance explanation: By separating the commit path (fast WAL flush) from the data page write path (slow, batched checkpointing), WAL allows databases to provide strong durability guarantees without sacrificing throughput. The commit latency is determined by WAL fsync time, not by how long it takes to update indexes or reorganize B-trees. example: MySQL InnoDB uses group commit to batch multiple transactions’ WAL flushes into a single fsync, achieving sub-millisecond commit latency even under heavy write load.

principle: Checkpointing Enables WAL Truncation explanation: Without checkpointing, the WAL would grow indefinitely. A checkpoint writes all dirty pages to disk and records the LSN up to which all changes are durable. WAL entries before this LSN can be safely deleted or archived. Checkpointing is expensive (it stalls writes temporarily) so databases tune checkpoint frequency to balance WAL size and recovery time. example: PostgreSQL’s checkpoint_timeout parameter (default 5 minutes) controls how often checkpoints occur. Longer intervals mean larger WAL files but less I/O overhead; shorter intervals mean faster recovery but more frequent write stalls.

principle: WAL is the Source of Truth for Replication explanation: Replicas consume the primary’s WAL stream to stay synchronized. This is more efficient than replaying SQL statements because WAL contains the exact physical changes. WAL-based replication is the foundation for streaming replication in PostgreSQL, binlog replication in MySQL, and oplog replication in MongoDB. example: Netflix’s distributed WAL system ships log entries to multiple datacenters, enabling cross-region replication with exactly-once semantics and the ability to replay events for analytics pipelines.

principle: Point-in-Time Recovery (PITR) via WAL Archiving explanation: By archiving WAL files to durable storage (S3, GCS), databases can restore to any point in time by replaying archived WAL from a base backup. This is critical for disaster recovery and compliance requirements. example: Stripe archives PostgreSQL WAL to S3 every 60 seconds, allowing them to restore any production database to within one minute of any point in the last 30 days.

How It Works

Let’s walk through a complete write operation with WAL. Step 1: A client issues UPDATE accounts SET balance = balance - 50 WHERE id = 123. Step 2: The database acquires locks, reads the current row into memory, and computes the new balance. Step 3: Before modifying the in-memory page, the database generates a WAL record containing the transaction ID, LSN, operation type (UPDATE), table/page identifier, and both the old value (balance = 1000) and new value (balance = 950). Step 4: This WAL record is appended to the in-memory WAL buffer. Step 5: The database modifies the in-memory data page, marking it as dirty. Step 6: When the transaction commits, the database flushes all WAL records up to the commit record to disk via fsync. This is the durability guarantee. Step 7: The database returns success to the client. Step 8: Asynchronously, a background writer process eventually writes the dirty page to its location on disk, but this happens after the commit has already succeeded.

During crash recovery, the database starts from the last checkpoint. It scans the WAL forward, applying each record in order. For our example, it would re-apply the UPDATE operation, setting balance = 950 for account 123. If the transaction had not committed (no commit record in WAL), the recovery process would skip it or explicitly roll it back using the old value. The recovery process continues until it reaches the end of the WAL, at which point the database is consistent and ready to accept new transactions.

Checkpointing works by periodically flushing all dirty pages to disk and writing a checkpoint record to the WAL. This checkpoint record contains the LSN up to which all changes are guaranteed to be on disk. During recovery, the database only needs to replay WAL from the last checkpoint forward, not from the beginning of time. This bounds recovery time. However, checkpointing is I/O-intensive and can cause temporary write stalls, so databases use techniques like incremental checkpointing (spreading writes over time) and double buffering to minimize impact.

Crash Recovery Process: Replaying WAL from Checkpoint

sequenceDiagram
    participant DB as Database Process
    participant WAL as WAL File
    participant Checkpoint as Checkpoint Record
    participant DataPages as Data Pages
    participant Recovery as Recovery Manager
    
    Note over DB: Server crashes during operation
    DB->>Recovery: Start crash recovery
    Recovery->>Checkpoint: Read last checkpoint LSN
    Note over Checkpoint: LSN = 1000<br/>All changes before LSN 1000<br/>are on disk
    Recovery->>WAL: Scan WAL from LSN 1000
    
    loop For each WAL record
        WAL->>Recovery: Read record (LSN 1001-1500)
        alt Has commit record
            Recovery->>DataPages: REDO: Apply changes
            Note over DataPages: Restore committed transaction
        else No commit record
            Recovery->>DataPages: UNDO: Rollback using before-image
            Note over DataPages: Discard uncommitted transaction
        end
    end
    
    Recovery->>DB: Recovery complete
    Note over DB: Database consistent and ready

During crash recovery, the database replays WAL from the last checkpoint forward. Committed transactions (those with commit records in WAL) are redone by applying their changes. Uncommitted transactions are rolled back using before-images. This process is idempotent—replaying the same record multiple times produces the same result, which is critical if crashes occur during recovery itself.

Common Misconceptions

misconception: WAL makes databases slower because you write everything twice why_wrong: This ignores the performance difference between sequential and random I/O. Writing to WAL is a fast sequential append, while updating data pages involves random seeks, index updates, and page reorganization. truth: WAL actually makes databases faster by converting expensive random writes into cheap sequential writes. The data page writes happen asynchronously in the background, so they don’t block commits. Benchmarks show WAL-enabled databases are 5-10x faster than naive implementations that flush data pages synchronously on commit.

misconception: fsync on every commit kills performance, so you should disable it why_wrong: Disabling fsync (e.g., PostgreSQL’s synchronous_commit = off) trades durability for throughput. If the server crashes, you lose recent commits. This is acceptable for some workloads (analytics, caching) but catastrophic for financial systems. truth: Modern databases use group commit to batch multiple transactions’ fsyncs into a single system call, achieving both durability and high throughput. PostgreSQL can commit 10,000+ transactions/sec with fsync enabled using group commit. The right answer is tuning, not disabling.

misconception: WAL only matters for crash recovery; it’s not relevant to normal operation why_wrong: WAL is the foundation for replication, point-in-time recovery, and even query processing in some systems. It’s not just a recovery mechanism—it’s the system’s memory of what happened. truth: WAL is continuously used during normal operation. Replicas consume WAL to stay synchronized, backup systems archive WAL for PITR, and some databases (like Postgres logical replication) decode WAL to extract change data capture (CDC) events for downstream consumers.

misconception: Once a checkpoint completes, you can delete all WAL before it why_wrong: This is only true if you don’t care about replication or PITR. Replicas may still need old WAL to catch up, and archived WAL is required for point-in-time recovery. truth: Databases maintain WAL retention policies based on replication lag and archive requirements. PostgreSQL’s wal_keep_size parameter ensures enough WAL is retained for replicas. Archived WAL is kept separately for PITR, often for days or weeks.

misconception: WAL is a database-specific implementation detail why_wrong: WAL is a universal pattern used across databases, distributed systems, and even file systems. Understanding WAL teaches you a fundamental reliability technique applicable everywhere. truth: PostgreSQL, MySQL InnoDB, MongoDB, Cassandra, Kafka (commit log), and even ext4 (journal) all use WAL-like mechanisms. Netflix built a custom distributed WAL for their data platform. WAL is a first-principles solution to the durability problem.

Real-World Usage

PostgreSQL’s WAL is the gold standard for understanding this pattern. Every write generates WAL records stored in 16MB segment files in pg_wal/. The wal_level parameter controls how much information is logged (minimal for crash recovery, replica for streaming replication, logical for CDC). PostgreSQL uses group commit to batch fsyncs, achieving 50,000+ commits/sec on modern hardware. Streaming replication works by shipping WAL segments to standby servers, which replay them to stay synchronized. Point-in-time recovery uses pg_basebackup plus archived WAL to restore to any moment.

MySQL InnoDB uses a similar approach with its redo log (WAL) and undo log (for rollback). The innodb_flush_log_at_trx_commit parameter controls durability: 1 (fsync on every commit, safest), 2 (fsync every second, faster but risky), or 0 (no fsync, fastest but dangerous). InnoDB’s doublewrite buffer protects against partial page writes during crashes, complementing WAL.

Netflix built a distributed WAL system called “WAL” (yes, really) to handle millions of events per second across their data platform. Their WAL is partitioned across multiple nodes, uses Kafka-like replication for durability, and supports exactly-once semantics. This powers their real-time analytics, recommendation systems, and operational dashboards. The key insight was that a centralized database WAL couldn’t scale to Netflix’s write volume, so they built a distributed, horizontally scalable WAL as a first-class service.

Stripe uses PostgreSQL’s WAL archiving to S3 for compliance and disaster recovery. Every 60 seconds, WAL segments are uploaded to S3 with versioning enabled. This allows Stripe to restore any production database to within one minute of any point in the last 30 days, meeting their RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. They also use logical replication (decoding WAL) to stream changes to their data warehouse for analytics.

WAL-Based Streaming Replication Architecture

graph LR
    subgraph "Primary Database"
        App["Application<br/><i>Write Traffic</i>"]
        Primary["Primary Server<br/><i>PostgreSQL</i>"]
        WALWriter["WAL Writer<br/>Process"]
        WALFiles["WAL Segments<br/><i>pg_wal/</i>"]
    end
    
    subgraph "Standby 1 - Same Region"
        Standby1["Standby Server<br/><i>Streaming Replica</i>"]
        WALReceiver1["WAL Receiver<br/>Process"]
        Replay1["Startup Process<br/><i>Replay WAL</i>"]
    end
    
    subgraph "Standby 2 - Remote Region"
        Standby2["Standby Server<br/><i>Async Replica</i>"]
        WALReceiver2["WAL Receiver<br/>Process"]
        Replay2["Startup Process<br/><i>Replay WAL</i>"]
    end
    
    Archive["WAL Archive<br/><i>S3 / GCS</i><br/>Point-in-Time Recovery"]
    
    App --"1. Write transactions"--> Primary
    Primary --"2. Generate WAL"--> WALWriter
    WALWriter --"3. Write to disk"--> WALFiles
    WALFiles --"4. Stream WAL<br/>(TCP connection)"--> WALReceiver1
    WALFiles --"5. Stream WAL<br/>(async, cross-region)"--> WALReceiver2
    WALFiles -."6. Archive every 60s".-> Archive
    WALReceiver1 --> Replay1
    WALReceiver2 --> Replay2
    Replay1 --"7. Apply changes"--> Standby1
    Replay2 --"8. Apply changes"--> Standby2

PostgreSQL streaming replication ships WAL from the primary to standby servers over TCP connections. Standby servers replay WAL entries to stay synchronized, enabling read scaling and high availability. Synchronous replication waits for standby acknowledgment before commit (strong consistency), while asynchronous replication doesn’t wait (better performance, eventual consistency). WAL archiving to S3 enables point-in-time recovery independent of replica availability.

Interview Essentials

Mid-Level

Explain the basic WAL protocol: write log before data pages, flush on commit, replay during recovery. Describe why sequential writes are faster than random writes. Walk through a simple crash recovery scenario. Understand the relationship between WAL and durability. Know that checkpointing allows WAL truncation. Be able to explain why fsync is necessary and what happens if you disable it.

Senior

Discuss the tradeoffs between WAL flush frequency and performance (group commit, batching). Explain how WAL enables replication and PITR. Describe checkpointing strategies (incremental, double buffering) and their impact on recovery time. Calculate WAL generation rate given transaction volume and average record size. Understand how WAL interacts with buffer pool management (dirty pages, LSN tracking). Explain the difference between physical WAL (page-level) and logical WAL (row-level) and when each is appropriate.

Group Commit Optimization: Batching fsyncs for Throughput

sequenceDiagram
    participant T1 as Transaction 1
    participant T2 as Transaction 2
    participant T3 as Transaction 3
    participant WALBuffer as WAL Buffer
    participant Leader as Group Commit Leader
    participant Disk as Disk (fsync)
    
    Note over T1,T3: Multiple transactions commit concurrently
    T1->>WALBuffer: Append commit record (LSN 100)
    T2->>WALBuffer: Append commit record (LSN 101)
    T3->>WALBuffer: Append commit record (LSN 102)
    
    T1->>Leader: Request fsync
    Note over Leader: T1 becomes group leader
    T2->>Leader: Request fsync (wait)
    T3->>Leader: Request fsync (wait)
    
    Leader->>Disk: Single fsync(LSN 100-102)
    Note over Disk: One disk operation<br/>for three transactions
    Disk->>Leader: fsync complete
    
    Leader->>T1: Commit success
    Leader->>T2: Commit success
    Leader->>T3: Commit success
    
    Note over T1,T3: 3 commits with 1 fsync<br/>Throughput: 3x<br/>Latency: ~same

Group commit batches multiple transactions’ WAL flushes into a single fsync system call, dramatically improving throughput without sacrificing durability. The first transaction to request fsync becomes the group leader and waits briefly (microseconds) for other concurrent commits. This technique allows PostgreSQL and MySQL to achieve 10,000+ commits/sec with fsync enabled. The tradeoff is slightly higher latency for the first transaction in each group, but overall throughput increases by orders of magnitude.

Staff+

Design a distributed WAL system for a multi-datacenter deployment (partitioning, replication, consistency guarantees). Discuss how Netflix’s distributed WAL handles millions of writes/sec. Explain how to optimize WAL for NVMe SSDs (alignment, parallelism). Describe advanced recovery techniques (parallel replay, incremental recovery). Understand the interaction between WAL and consensus protocols (Raft, Paxos) in distributed databases. Discuss how to implement exactly-once semantics using WAL. Explain how to build change data capture (CDC) by decoding WAL.

Common Interview Questions

How does WAL ensure durability? (Answer: By flushing commit records to disk before acknowledging the transaction)

Why is WAL faster than writing data pages directly? (Answer: Sequential writes vs. random writes)

What happens during crash recovery? (Answer: Replay WAL from last checkpoint, redo committed transactions, rollback uncommitted ones)

How does checkpointing work and why is it necessary? (Answer: Flushes dirty pages, allows WAL truncation, bounds recovery time)

How does WAL enable replication? (Answer: Replicas consume WAL stream and replay entries to stay synchronized)

Red Flags to Avoid

Confusing WAL with transaction logs or audit logs (they’re related but serve different purposes)

Not understanding the performance implications of fsync and group commit

Thinking WAL is only for crash recovery (it’s also for replication, PITR, CDC)

Ignoring the tradeoff between checkpoint frequency and recovery time

Not knowing how real databases (PostgreSQL, MySQL) implement WAL

Key Takeaways

WAL ensures durability by writing changes to a sequential log before modifying data pages, enabling fast commits and reliable crash recovery through replay.

Sequential WAL writes are 10-100x faster than random data page updates, which is why WAL improves performance rather than hurting it.

Checkpointing flushes dirty pages to disk and allows WAL truncation, bounding recovery time but requiring careful tuning to avoid write stalls.

WAL is the foundation for streaming replication (ship log to replicas), point-in-time recovery (archive log for restore), and change data capture (decode log for events).

Real systems like PostgreSQL, MySQL, and Netflix’s distributed WAL demonstrate that WAL is a universal pattern for reliability at scale, not just a database implementation detail.