Key-Value Store: Redis, DynamoDB Design Guide

After this topic, you will be able to:

Analyze LSM-tree architecture and its write optimization benefits
Compare memtable, SSTable, and compaction strategies in key-value stores
Differentiate key-value store use cases from other NoSQL types

TL;DR

Key-value stores are the simplest NoSQL databases, optimizing for fast writes and lookups using a dictionary-like interface. Most production systems use LSM-tree architectures (memtable + SSTables + compaction) to achieve write throughput of 100K+ ops/sec while trading off read amplification. They power session stores, caching layers, and metadata systems at Netflix, Amazon, and Pinterest.

Cheat Sheet: O(1) reads/writes • LSM-tree = memtable → SSTable → compaction • Write-optimized (append-only) • Read amplification via bloom filters • Eventual consistency typical • Use for: sessions, counters, user preferences.

Background

Key-value stores emerged in the early 2000s as companies like Amazon and Google needed databases that could scale horizontally while maintaining predictable low latency. Traditional relational databases struggled with write-heavy workloads at internet scale—every write required updating B-tree indexes, acquiring locks, and maintaining ACID guarantees across multiple tables. Amazon’s Dynamo paper (2007) introduced the key insight: if you only need to store and retrieve data by a single key, you can eliminate most database complexity and achieve massive throughput.

The core problem key-value stores solve is write scalability. When Netflix needs to track viewing progress for 200 million users, each generating dozens of position updates per viewing session, they need a system that can handle millions of writes per second without degrading. Relational databases would bottleneck on index maintenance and transaction overhead. Key-value stores sidestep this by treating data as opaque blobs indexed by a single key, allowing them to optimize the entire storage engine for sequential writes.

Modern key-value stores like RocksDB (Facebook), LevelDB (Google), and Cassandra’s storage engine all converged on LSM-tree (Log-Structured Merge-tree) architecture because it transforms random writes into sequential disk I/O—the single most important optimization for write throughput. This architectural choice defines everything about how key-value stores behave in production.

Architecture

A production key-value store has three logical layers that work together to balance write performance, read performance, and storage efficiency.

Layer 1: Write Path (Memtable) All writes first go to an in-memory data structure called a memtable—typically a skip list or red-black tree that maintains keys in sorted order. Simultaneously, writes are appended to a write-ahead log (WAL) on disk for durability. This dual-write approach gives you both speed (memory) and safety (disk). The memtable acts as a write buffer, absorbing bursts of traffic without touching disk. When the memtable reaches a size threshold (typically 64-128MB), it’s frozen and flushed to disk as an immutable SSTable.

Layer 2: Storage Layer (SSTables) SSTables (Sorted String Tables) are immutable, sorted files on disk. Each SSTable contains key-value pairs in sorted order, plus a sparse index and bloom filter. Immutability is the key architectural decision here—once written, an SSTable never changes. Updates and deletes don’t modify existing SSTables; instead, they write new entries with higher timestamps. This append-only design eliminates the need for locks and enables extremely high write throughput. The trade-off is that you accumulate multiple versions of the same key across different SSTables, creating read amplification.

Layer 3: Compaction Engine Compaction runs in the background, merging SSTables to reclaim space and improve read performance. There are two main strategies: size-tiered compaction (merge SSTables of similar size) and leveled compaction (organize SSTables into levels, with each level 10x larger than the previous). RocksDB uses leveled compaction by default because it bounds read amplification—you only need to check one SSTable per level. The compaction process reads multiple SSTables, merges their sorted contents (keeping only the latest version of each key), and writes new SSTables. This is I/O intensive but happens asynchronously without blocking reads or writes.

Coordination Layer In distributed deployments, a coordination layer (often using consistent hashing) determines which nodes own which key ranges. See Consistent Hashing for partition assignment details. Each node runs its own LSM-tree storage engine independently.

LSM-Tree Architecture: Three-Layer Design

graph TB
    subgraph "Layer 1: Write Path (In-Memory)"
        Client["Client Application"]
        WAL["Write-Ahead Log<br/><i>Sequential disk writes</i>"]
        Memtable["Memtable<br/><i>Skip list/Red-black tree<br/>64-128MB</i>"]
    end
    
    subgraph "Layer 2: Storage Layer (Disk)"
        SST1["SSTable L0-1<br/><i>Immutable, sorted<br/>+ bloom filter</i>"]
        SST2["SSTable L0-2<br/><i>Immutable, sorted<br/>+ bloom filter</i>"]
        SST3["SSTable L1<br/><i>Non-overlapping ranges</i>"]
        SST4["SSTable L2<br/><i>10x larger than L1</i>"]
    end
    
    subgraph "Layer 3: Compaction Engine"
        Compaction["Background Compaction<br/><i>Merge + deduplicate<br/>Reclaim space</i>"]
    end
    
    Client --"1. put(key, value)"--> Memtable
    Client --"2. Append (durability)"--> WAL
    Memtable --"3. Flush when full<br/>(frozen memtable)"--> SST1
    Memtable --"4. Flush when full"--> SST2
    SST1 & SST2 --"5. Merge overlapping"--> Compaction
    Compaction --"6. Write merged data"--> SST3
    SST3 --"7. Level full, compact"--> Compaction
    Compaction --"8. Write to next level"--> SST4

The three-layer LSM-tree architecture showing how writes flow from in-memory memtable to immutable SSTables on disk, with background compaction merging files to reclaim space. This design achieves high write throughput by converting random writes into sequential I/O.

Internals

Understanding LSM-tree internals is critical for reasoning about key-value store behavior in production. Let’s trace a write and read through the system.

Write Path Mechanics When a client calls put(key, value), the system performs two operations atomically: (1) append the operation to the WAL on disk, and (2) insert the key-value pair into the memtable. The WAL append is a sequential write, which on modern SSDs achieves 100K+ writes/sec. The memtable insertion is O(log n) for the skip list structure, but since it’s in memory, this completes in microseconds. The write returns immediately—no disk seeks, no index updates, no locks. This is why key-value stores achieve 10-100x higher write throughput than B-tree databases.

When the memtable reaches capacity, the system freezes it (making it immutable) and starts a new memtable for incoming writes. A background thread flushes the frozen memtable to disk as an SSTable. This flush is also sequential I/O—the memtable is already sorted, so writing it to disk is a linear scan. The resulting SSTable includes: (1) the sorted key-value pairs, (2) a sparse index (every Nth key’s offset), and (3) a bloom filter for fast negative lookups.

Read Path Mechanics Reads are more complex because data might exist in multiple places. When a client calls get(key), the system searches in order: (1) current memtable, (2) frozen memtables being flushed, (3) SSTables from newest to oldest. The memtable check is O(log n) and fast. For SSTables, the system first checks the bloom filter—a probabilistic data structure that can definitively say “this key is NOT in this SSTable” with zero false negatives. If the bloom filter says the key might exist, the system uses the sparse index to narrow down the location, then performs a binary search within that block.

This is where read amplification occurs. In the worst case, you might check the bloom filters of dozens of SSTables before finding your key (or confirming it doesn’t exist). If you have 50 SSTables and your key is in the oldest one, you’ve done 50 bloom filter checks and potentially multiple disk seeks. This is the fundamental trade-off of LSM-trees: write amplification (compaction rewrites data multiple times) versus read amplification (reads check multiple files).

Compaction Strategies Leveled compaction organizes SSTables into levels L0, L1, L2, etc. L0 contains newly flushed SSTables (potentially overlapping key ranges). L1 contains non-overlapping SSTables, each covering a distinct key range. Each level is 10x larger than the previous (L1 = 100MB, L2 = 1GB, L3 = 10GB). When L1 fills up, compaction selects an SSTable from L1 and all overlapping SSTables from L2, merges them, and writes the result to L2. This bounds read amplification to O(log n) levels, but creates write amplification—each key might be rewritten 10+ times as it moves through levels.

Size-tiered compaction is simpler: when you have N SSTables of similar size, merge them into one larger SSTable. This reduces write amplification (less rewriting) but increases read amplification (more files to check). RocksDB at Facebook uses leveled compaction because their workloads are read-heavy; Cassandra offers both strategies as tuning options.

Write Path: Memtable to SSTable Flush

sequenceDiagram
    participant Client
    participant WAL as Write-Ahead Log
    participant Memtable as Memtable<br/>(Skip List)
    participant BG as Background Thread
    participant Disk as SSTable on Disk
    
    Client->>WAL: 1. Append operation<br/>(sequential write, sync)
    Note over WAL: Durability: survives crash
    Client->>Memtable: 2. Insert key-value<br/>(O(log n) in memory)
    Note over Memtable: Fast: microseconds
    Memtable-->>Client: 3. Write complete<br/>(p99 < 1ms)
    
    Note over Memtable: Memtable reaches 64-128MB
    Memtable->>Memtable: 4. Freeze (immutable)
    Memtable->>BG: 5. Signal flush needed
    
    BG->>Disk: 6. Write sorted data<br/>(sequential I/O)
    BG->>Disk: 7. Write sparse index<br/>(every Nth key offset)
    BG->>Disk: 8. Write bloom filter<br/>(probabilistic lookup)
    Note over Disk: SSTable: immutable,<br/>never modified
    BG->>WAL: 9. Truncate old entries<br/>(data now durable on disk)

The write path showing dual-write to WAL (durability) and memtable (speed), followed by background flush to SSTable. Writes return immediately after memtable insertion, achieving 100K+ ops/sec through sequential I/O and in-memory operations.

Read Path: Multi-Level Lookup with Bloom Filters

graph TB
    Client["Client: get(key)"] --> Memtable["1. Check Memtable<br/><i>O(log n), in-memory</i>"]
    Memtable -->|Found| Return1["Return value<br/><i>p99 < 1ms</i>"]
    Memtable -->|Not found| Frozen["2. Check Frozen Memtables<br/><i>Being flushed</i>"]
    Frozen -->|Found| Return2["Return value"]
    Frozen -->|Not found| L0
    
    subgraph "SSTable Levels (Newest to Oldest)"
        L0["3. Level 0 SSTables<br/><i>Check bloom filters</i>"]
        L0 --> BF0{"Bloom filter:<br/>might exist?"}
        BF0 -->|No| L1
        BF0 -->|Maybe| Read0["Binary search<br/>in SSTable"]
        Read0 -->|Found| Return3["Return value"]
        Read0 -->|Not found| L1
        
        L1["4. Level 1 SSTables<br/><i>Non-overlapping ranges</i>"]
        L1 --> BF1{"Bloom filter:<br/>might exist?"}
        BF1 -->|No| L2
        BF1 -->|Maybe| Read1["Binary search<br/>in SSTable"]
        Read1 -->|Found| Return4["Return value"]
        Read1 -->|Not found| L2
        
        L2["5. Level 2 SSTables<br/><i>10x larger</i>"]
        L2 --> BF2{"Bloom filter:<br/>might exist?"}
        BF2 -->|No| NotFound
        BF2 -->|Maybe| Read2["Binary search<br/>in SSTable"]
        Read2 -->|Found| Return5["Return value"]
        Read2 -->|Not found| NotFound
    end
    
    NotFound["Key not found<br/><i>Checked all levels</i>"]

The read path demonstrates read amplification: checking memtable, then multiple SSTable levels from newest to oldest. Bloom filters eliminate unnecessary disk seeks by definitively saying ‘key not in this SSTable’, but worst-case reads still check dozens of files.

Leveled Compaction: Merging SSTables Across Levels

graph LR
    subgraph "Before Compaction"
        L1A["L1 SSTable A<br/><i>keys: 100-200</i>"]
        L1B["L1 SSTable B<br/><i>keys: 300-400</i>"]
        L1C["L1 SSTable C<br/><i>keys: 150-250</i>"]
        L2A["L2 SSTable 1<br/><i>keys: 100-300</i>"]
        L2B["L2 SSTable 2<br/><i>keys: 301-500</i>"]
    end
    
    subgraph "Compaction Process"
        Select["1. L1 full: select<br/>SSTable C (150-250)"]
        Identify["2. Find overlapping<br/>L2 SSTable 1 (100-300)"]
        Merge["3. Merge-sort both<br/>Keep latest version<br/>of each key"]
        Write["4. Write new<br/>L2 SSTable<br/><i>keys: 100-300</i>"]
    end
    
    subgraph "After Compaction"
        L1A2["L1 SSTable A<br/><i>keys: 100-200</i>"]
        L1B2["L1 SSTable B<br/><i>keys: 300-400</i>"]
        L2New["L2 SSTable (new)<br/><i>keys: 100-300<br/>merged + deduplicated</i>"]
        L2B2["L2 SSTable 2<br/><i>keys: 301-500</i>"]
    end
    
    L1C --> Select
    Select --> Identify
    L2A --> Identify
    Identify --> Merge
    Merge --> Write
    Write --> L2New
    
    L1C -."Delete old".-> X1[ ]
    L2A -."Delete old".-> X2[ ]

Leveled compaction merges overlapping SSTables from adjacent levels, keeping only the latest version of each key. This bounds read amplification (one SSTable per level to check) but creates write amplification—each key is rewritten 10-30x as it moves through levels.

Performance Characteristics

Key-value stores deliver predictable performance at scale, but the numbers vary dramatically based on workload and configuration.

Write Performance Single-node write throughput typically ranges from 50K-500K ops/sec depending on value size and durability settings. RocksDB benchmarks show 200K writes/sec for 1KB values on NVMe SSDs with synchronous WAL writes. If you disable WAL syncing (accepting risk of data loss on crash), throughput jumps to 1M+ writes/sec because you eliminate the disk sync bottleneck. Write latency is typically p99 < 1ms for in-memory memtable writes, though WAL sync can add 1-5ms depending on disk.

Write amplification is the hidden cost. With leveled compaction, each key might be rewritten 10-30x as it moves through levels. If you write 1TB of data, you might generate 10-30TB of actual disk writes. This matters for SSD wear and I/O capacity planning. The formula is: write_amplification = (bytes_written_to_disk) / (bytes_written_by_application).

Read Performance Read performance depends heavily on whether data is cached. Hot keys served from memtable or OS page cache deliver p99 < 1ms latency at 100K+ reads/sec per node. Cold reads that hit disk are slower—p99 10-50ms depending on read amplification. If your key requires checking 20 SSTables, you might do 5-10 disk seeks even with bloom filters.

Block cache hit rate is the critical metric. RocksDB at Facebook tunes for 95%+ block cache hit rates, keeping frequently accessed SSTable blocks in memory. With good caching, read throughput approaches write throughput. Without it, you’re limited by disk IOPS (typically 10K-50K random reads/sec for SSDs).

Scalability Key-value stores scale horizontally by partitioning the key space across nodes. A 100-node cluster can handle 10M+ writes/sec and store 100TB+ of data. The limiting factor is usually network bandwidth (10Gbps NICs saturate at ~1M small packets/sec) or compaction I/O (background compaction can consume 50%+ of disk bandwidth). Netflix runs Cassandra clusters with 1000+ nodes handling petabytes of data.

Trade-offs

Key-value stores make specific trade-offs that make them excellent for certain workloads and poor for others.

What They Excel At Write-heavy workloads are where key-value stores shine. If you’re ingesting time-series data, tracking user sessions, or maintaining counters, the append-only LSM-tree architecture delivers 10-100x better write throughput than B-tree databases. The lack of complex query support is actually a feature—by eliminating joins, secondary indexes, and query optimization, the system can optimize purely for key-based access patterns. This simplicity also enables horizontal scaling without complex distributed query planning.

Predictable latency is another strength. Because operations are O(1) or O(log n) with bounded read amplification, you can reliably hit p99 < 10ms SLAs even under heavy load. There are no slow queries or lock contention issues that plague relational databases.

Where They Fall Short Range queries are inefficient compared to B-tree databases. While LSM-trees store keys in sorted order (enabling range scans), you still need to merge-read from multiple SSTables. A range query touching 1000 keys might require reading from 10+ SSTables, creating significant I/O amplification. B-trees handle range queries more efficiently because all data lives in a single tree structure.

Read-heavy workloads with poor cache locality suffer from read amplification. If you’re randomly accessing cold keys across a large dataset, you’ll do multiple disk seeks per read. Relational databases with B-tree indexes can often serve these reads in 1-2 seeks.

Complex queries are impossible. You can’t join data, filter on non-key attributes, or run aggregations without pulling data into the application layer. If your access patterns evolve beyond simple key lookups, you’ll need to denormalize data into multiple key-value pairs or move to a different database type. See SQL vs NoSQL for when relational databases are better fits.

Consistency models are typically eventual rather than strong. Most distributed key-value stores prioritize availability over consistency (AP in CAP theorem), meaning concurrent writes to the same key might create conflicts that require application-level resolution.

Write Amplification vs Read Amplification Trade-off

graph TB
    subgraph "Size-Tiered Compaction"
        ST_Write["Write Amplification: LOW<br/><i>Each key rewritten 2-5x<br/>Less frequent compaction</i>"]
        ST_Read["Read Amplification: HIGH<br/><i>Check 20-50 SSTables<br/>Overlapping key ranges</i>"]
        ST_Use["Best for: Write-heavy<br/><i>Time-series, logs, metrics</i>"]
        ST_Write --> ST_Read --> ST_Use
    end
    
    subgraph "Leveled Compaction"
        LC_Write["Write Amplification: HIGH<br/><i>Each key rewritten 10-30x<br/>Frequent compaction</i>"]
        LC_Read["Read Amplification: LOW<br/><i>Check O(log n) SSTables<br/>Non-overlapping ranges</i>"]
        LC_Use["Best for: Read-heavy<br/><i>User profiles, sessions</i>"]
        LC_Write --> LC_Read --> LC_Use
    end
    
    Decision{"Workload<br/>Characteristics"}
    Decision -->|"Writes > Reads<br/>Sequential access"| ST_Write
    Decision -->|"Reads > Writes<br/>Random access"| LC_Write
    
    subgraph "Impact on Production"
        Disk["Disk I/O Capacity<br/><i>Write amp consumes bandwidth</i>"]
        SSD["SSD Lifespan<br/><i>Write amp increases wear</i>"]
        Latency["Read Latency<br/><i>Read amp adds seeks</i>"]
    end
    
    ST_Write -."High I/O".-> Disk
    LC_Write -."Very high I/O".-> Disk
    LC_Write -."Faster wear".-> SSD
    ST_Read -."Higher p99".-> Latency

The fundamental trade-off in LSM-tree design: size-tiered compaction minimizes write amplification (better for write-heavy workloads) while leveled compaction minimizes read amplification (better for read-heavy workloads). RocksDB defaults to leveled; Cassandra offers both as tuning options.

When to Use (and When Not To)

Choose a key-value store when your access patterns are dominated by single-key lookups and updates, and you need high write throughput or horizontal scalability.

Ideal Use Cases

Session storage: Web applications storing user session data keyed by session ID. Each request does a single get/put operation, writes are frequent (every request), and you need millisecond latency. Pinterest uses Redis (in-memory key-value store) for 100M+ active sessions.
User preferences and profiles: Storing user settings, preferences, or profile data keyed by user ID. Access is always by user ID, updates are frequent, and you don’t need to query across users. Netflix uses Cassandra to store viewing history and preferences for 200M+ users.
Counters and metrics: Tracking page views, likes, or real-time analytics where you increment counters by key. The write-optimized nature handles high update rates. Amazon uses DynamoDB for shopping cart state and inventory counters.
Metadata stores: Storing file metadata, object attributes, or configuration data where each item is accessed by a unique identifier. The simplicity and performance make key-value stores ideal for these supporting systems.

When to Choose Alternatives If you need complex queries, joins, or filtering on non-key attributes, use a relational database or document store. If your data has rich relationships and you need graph traversals, use a graph database. If you need strong consistency guarantees and can tolerate lower write throughput, consider a relational database with ACID transactions. See Document Databases for semi-structured data with query flexibility.

If your workload is read-heavy with poor cache locality (random access to cold data), B-tree databases might deliver better read performance despite lower write throughput. The read amplification in LSM-trees becomes costly when you can’t keep hot data in cache.

Real-World Examples

Netflix: Viewing History and Recommendations Netflix uses Cassandra (LSM-tree based key-value store) to track viewing progress for 200M+ subscribers. Each viewing session generates dozens of position updates as users pause, rewind, or skip. The key is user_id:content_id, and the value contains timestamp, position, and device info. This workload is write-heavy (millions of updates/sec) with occasional reads when users resume watching. Cassandra’s LSM-tree architecture handles the write volume while maintaining p99 < 5ms latency. Netflix runs multi-region Cassandra clusters with 1000+ nodes, storing petabytes of viewing data. The interesting detail: they tune compaction aggressively because viewing data has temporal locality—recent positions are accessed frequently, while old data is rarely read. This allows them to optimize for write throughput without suffering read amplification on hot data.

Amazon: DynamoDB for Shopping Carts Amazon uses DynamoDB (proprietary key-value store) for shopping cart state, where the key is user_id or session_id. Every item addition, quantity change, or removal triggers a write. During peak shopping periods (Prime Day, Black Friday), DynamoDB handles 20M+ requests/sec across all AWS customers. The system uses consistent hashing for partitioning and maintains three replicas per partition for availability. The interesting architectural choice: DynamoDB uses a hybrid storage engine that combines in-memory caching with LSM-tree persistence, achieving single-digit millisecond p99 latency even during traffic spikes. They also implement adaptive capacity—automatically redistributing hot keys across more partitions when they detect skew.

Pinterest: User Graph and Feed Metadata Pinterest uses HBase (Hadoop-based LSM-tree store) to maintain the user graph—who follows whom, board memberships, and pin metadata. Keys are structured as user_id:relationship_type:target_id, enabling efficient lookups of “all users following X” or “all boards user Y owns.” The workload is read-heavy (users browse feeds constantly) but with significant write volume (new follows, pins, repins). Pinterest keeps 95%+ of hot data in HBase’s block cache, achieving p99 < 10ms read latency. The interesting optimization: they use bloom filters aggressively and tune compaction to minimize read amplification on frequently accessed user data, while accepting higher write amplification for less active users.

Interview Essentials

Mid-Level

At the mid-level, interviewers expect you to explain the basic LSM-tree architecture and why it’s write-optimized. You should be able to describe the memtable → SSTable → compaction flow and explain the trade-off between write amplification and read amplification. Be ready to compare key-value stores to relational databases, focusing on when each is appropriate. A common question is “Why are key-value stores faster for writes?”—the answer involves explaining how append-only writes avoid disk seeks and index maintenance. You should also understand bloom filters conceptually (probabilistic data structure for fast negative lookups) and why they’re critical for read performance.

Senior

Senior candidates must demonstrate deep understanding of LSM-tree internals and production trade-offs. Expect questions like “How would you tune a key-value store for a read-heavy workload?” (increase block cache size, use leveled compaction, tune bloom filter false positive rate). You should know the difference between size-tiered and leveled compaction, including their write amplification and read amplification characteristics. Be prepared to discuss consistency models—why most distributed key-value stores are eventually consistent and how to handle conflicts (last-write-wins, vector clocks, CRDTs). You should also understand how to estimate capacity: given 1M writes/sec with 1KB values and 10x write amplification, you need 10GB/sec write bandwidth. Interviewers often ask about hot key problems and how to mitigate them (caching, key splitting, consistent hashing with virtual nodes).

Staff+

Staff+ candidates must demonstrate systems thinking about key-value store architecture decisions and their business impact. You should be able to design a distributed key-value store from scratch, making explicit trade-offs between consistency, availability, and partition tolerance (CAP theorem). Expect deep dives into compaction strategies—why RocksDB chose leveled compaction, when size-tiered makes sense, and how to tune compaction for specific workloads. You should understand write stalls (when compaction can’t keep up with write rate) and how to prevent them (throttling, compaction parallelism, faster disks). Be ready to discuss operational challenges: how to perform zero-downtime schema changes (you can’t—key-value stores are schemaless), how to handle data corruption (checksums, replica repair), and how to balance cost vs. performance (SSD vs. HDD, replication factor, compression). A common staff+ question: “How would you migrate 100TB from a relational database to a key-value store?” This tests your understanding of data modeling differences, migration strategies (dual writes, backfill, cutover), and rollback plans.

Common Interview Questions

Explain how LSM-trees achieve high write throughput. What’s the trade-off?

Walk me through the read path in a key-value store. Where does read amplification come from?

How do bloom filters work, and why are they important for LSM-trees?

When would you choose a key-value store over a relational database?

What is compaction, and why is it necessary? Compare size-tiered vs. leveled compaction.

How would you handle a hot key that’s receiving 10x more traffic than other keys?

Explain write amplification. How does it affect SSD lifespan and I/O capacity?

Design a distributed key-value store. How do you partition data and handle node failures?

Why are most distributed key-value stores eventually consistent? What are the alternatives?

How would you tune a key-value store for a read-heavy workload with poor cache locality?

Red Flags to Avoid

Claiming key-value stores are always faster than relational databases (ignores read amplification and cache effects)

Not understanding the write path (memtable + WAL) or why it’s fast (sequential I/O, no index updates)

Confusing in-memory key-value stores (Redis) with persistent LSM-tree stores (RocksDB, Cassandra)

Not knowing what bloom filters are or why they matter for read performance

Suggesting key-value stores for workloads requiring complex queries, joins, or secondary indexes

Not understanding compaction or claiming it doesn’t impact performance (it’s often the bottleneck)

Ignoring consistency models—assuming strong consistency when most systems are eventually consistent

Not considering operational aspects: how to monitor write amplification, detect hot keys, or handle compaction stalls

Key Takeaways

LSM-trees optimize for writes: Memtable (in-memory) + SSTable (immutable disk files) + compaction (background merging) enables 100K+ writes/sec by transforming random writes into sequential I/O. The trade-off is read amplification—reads must check multiple SSTables.

Write amplification vs. read amplification: Leveled compaction minimizes read amplification (O(log n) SSTables to check) but increases write amplification (10-30x rewriting). Size-tiered compaction does the opposite. Choose based on your read/write ratio.

Bloom filters are critical: These probabilistic data structures enable fast negative lookups (“key definitely not in this SSTable”), reducing disk seeks. Without them, read performance degrades significantly on cold data.

Use for single-key access patterns: Key-value stores excel at session storage, user profiles, counters, and metadata where all access is by key. Avoid for complex queries, joins, or filtering on non-key attributes—the simplicity that enables performance also limits query flexibility.

Production tuning matters: Block cache hit rate, compaction strategy, and write amplification directly impact performance. Netflix and Pinterest achieve p99 < 10ms by keeping hot data cached and tuning compaction for their specific workloads. Monitor write stalls and hot keys in production.