Distributed Locking: Redis, ZooKeeper & Redlock

After this topic, you will be able to:

Compare distributed locking algorithms (Redlock, Chubby, ZooKeeper) for correctness guarantees
Analyze failure scenarios in distributed locks (network partitions, clock skew, process pauses)
Evaluate fencing token mechanisms for preventing split-brain in distributed locks

TL;DR

Distributed locks coordinate exclusive access to shared resources across multiple processes or machines, preventing race conditions in distributed systems. Unlike local locks, they must handle network partitions, clock skew, and process failures—making correctness surprisingly difficult. Fencing tokens are essential for safety: without them, even sophisticated algorithms like Redlock can violate mutual exclusion guarantees.

Cheat Sheet:

Purpose: Mutual exclusion across distributed processes
Key Challenge: Network delays and failures can create split-brain scenarios
Safety Mechanism: Fencing tokens prevent stale lock holders from corrupting state
Common Implementations: Redis (Redlock), ZooKeeper (ephemeral nodes), etcd, Chubby
Interview Focus: Understanding why time-based locks alone are insufficient for correctness

Background

Distributed locking emerged from the need to coordinate access to shared resources in distributed systems where multiple processes compete for the same resource—whether that’s updating a database record, processing a job exactly once, or electing a leader. In monolithic applications, language-level mutexes or database row locks suffice. But when your system spans multiple machines, you need a coordination service that all nodes can consult.

The problem is deceptively hard. Early distributed systems often used simple time-based locks in Redis or Memcached: “acquire lock with 10-second TTL, do work, release lock.” This works until network partitions, garbage collection pauses, or clock skew cause a process to hold a lock longer than intended while another process acquires it. Martin Kleppmann’s 2016 critique of Redlock crystallized the community’s understanding that distributed locks require more than timeouts—they need fencing tokens to guarantee safety.

Google’s Chubby (2006) was one of the first production systems to get this right, using Paxos-based consensus to provide strongly consistent locks with sequencer numbers (fencing tokens). ZooKeeper (2010) brought similar guarantees to open-source systems. Today, distributed locks are fundamental to leader election, distributed cron jobs (ensuring only one instance runs), and preventing duplicate processing in event-driven architectures. The key insight: a lock is only as safe as the mechanism preventing stale holders from acting on it.

Architecture

A distributed locking system consists of three key components: lock servers, clients, and protected resources. The lock servers form a coordination service (single Redis instance, Redis cluster, ZooKeeper ensemble, or etcd cluster) that stores lock state and mediates acquisition. Clients request locks by writing to the coordination service, typically with a unique identifier and timeout. The protected resource (database, file, job queue) is what the lock guards—and critically, it must validate fencing tokens to reject stale lock holders.

Single-Server Approach (Redis): The simplest architecture uses a single Redis instance. Clients execute SET resource_name my_random_value NX PX 30000 to acquire a lock with a 30-second expiration. The random value ensures only the lock holder can release it (preventing accidental release by another client). This works for efficiency use cases (preventing duplicate work) but fails for correctness: if Redis crashes, all locks are lost. If a client pauses (GC, network delay), its lock expires while it’s still working, and another client acquires the lock—classic split-brain.

Multi-Server Approach (Redlock): Redlock attempts to improve fault tolerance by running N independent Redis instances (typically 5). A client acquires the lock by:

Getting the current time
Trying to acquire the lock on all N instances sequentially, with a small timeout for each attempt
Computing elapsed time; if locks were acquired on a majority (≥3 of 5) and elapsed time is less than lock validity time, the lock is considered acquired
If acquisition fails, releasing all locks

Redlock assumes independent failures and bounded clock drift. However, as Kleppmann demonstrated, this still fails under realistic conditions: a client can acquire a majority, pause for GC, resume after its locks expire, and proceed to corrupt state while another client holds the lock.

Consensus-Based Approach (ZooKeeper/Chubby): The robust solution uses consensus algorithms (Paxos, Raft) to maintain strongly consistent lock state. In ZooKeeper, clients create ephemeral sequential nodes under a lock path (e.g., /locks/resource/lock-0000000001). The client with the lowest sequence number holds the lock. If the client disconnects, ZooKeeper’s session timeout deletes the ephemeral node, automatically releasing the lock. Critically, ZooKeeper provides monotonically increasing sequence numbers that serve as fencing tokens. The protected resource (database, storage system) must reject operations with lower sequence numbers, preventing stale lock holders from causing corruption. See Distributed Consensus for how Paxos and Raft provide the foundation for this approach.

Distributed Lock Architecture: Single Redis vs. Consensus-Based

graph TB
    subgraph Single Redis Architecture
        C1["Client A"]
        C2["Client B"]
        R["Redis<br/><i>Single Instance</i>"]
        DB1[("Protected<br/>Resource")]
        C1 --"1. SET lock NX PX 30000"--> R
        C2 --"2. SET lock NX PX 30000<br/>(blocked)"--> R
        C1 --"3. Write data"--> DB1
        C1 --"4. DEL lock"--> R
    end
    
    subgraph ZooKeeper Consensus Architecture
        C3["Client A"]
        C4["Client B"]
        subgraph ZK Ensemble
            Z1["ZK Node 1<br/><i>Leader</i>"]
            Z2["ZK Node 2"]
            Z3["ZK Node 3"]
        end
        DB2[("Protected<br/>Resource<br/><i>Validates Tokens</i>")]
        C3 --"1. Create ephemeral<br/>sequential node"--> Z1
        C4 --"2. Create ephemeral<br/>sequential node"--> Z1
        Z1 -."Consensus<br/>replication".-> Z2
        Z1 -."Consensus<br/>replication".-> Z3
        C3 --"3. Write with<br/>token=33"--> DB2
    end

Single Redis provides fast lock acquisition but no fault tolerance or fencing tokens. ZooKeeper uses consensus across multiple nodes to provide strongly consistent locks with sequence numbers that serve as fencing tokens, which the protected resource validates to prevent stale lock holders from corrupting state.

Internals

Redis Single-Instance Implementation: Redis uses a simple key-value store with TTL (time-to-live) for lock expiration. The SET NX PX command atomically sets a key only if it doesn’t exist, with millisecond precision expiration. Internally, Redis maintains an expiration dictionary mapping keys to expiration timestamps, checked lazily on access and actively by a background thread. Lock release uses a Lua script to ensure atomicity: if redis.call("get", KEYS[1]) == ARGV[1] then return redis.call("del", KEYS[1]) else return 0 end. This prevents client A from releasing client B’s lock if A’s lock expired and B acquired it.

Redlock Algorithm Details: Redlock’s core assumption is that clock drift is bounded (e.g., 1% drift on a 10-second lock means ±100ms uncertainty). The algorithm uses wall-clock time to compute lock validity: validity_time = lock_ttl - elapsed_time - clock_drift_margin. If a client acquires locks on 3 of 5 Redis instances in 2 seconds with a 10-second TTL, it has ~7.8 seconds of valid lock time (10 - 2 - 0.2 for clock drift). The client must complete its work and release locks before this window expires.

The problem: process pauses break this model. A client can acquire the lock, pause for 15 seconds (GC, CPU starvation, network partition), resume, and proceed with stale locks. Redlock has no mechanism to detect this. Even worse, if the pause happens after acquiring 3 locks but before checking elapsed time, the client thinks it holds a valid lock when all instances have already expired it.

ZooKeeper Ephemeral Nodes: ZooKeeper stores data in a hierarchical namespace (like a file system). Ephemeral nodes exist only while the client’s session is active—if the session times out (default 40 seconds of no heartbeats), ZooKeeper deletes all ephemeral nodes created by that session. For distributed locks, clients create ephemeral sequential nodes: create /locks/resource/lock- EPHEMERAL SEQUENTIAL. ZooKeeper appends a monotonically increasing 10-digit sequence number (e.g., lock-0000000001).

To acquire the lock:

Create ephemeral sequential node, get back full path with sequence number
Call getChildren on /locks/resource to list all lock nodes
If your node has the lowest sequence number, you hold the lock
Otherwise, set a watch on the next-lowest node and wait for it to be deleted

This “herd effect” optimization ensures only one client wakes up when a lock is released, rather than all waiting clients thundering to check. The sequence number serves as a fencing token: when the lock holder writes to the protected resource, it includes its sequence number. The resource rejects writes with lower sequence numbers, preventing stale holders from corrupting state even if they’re unaware their session expired.

Chubby’s Sequencer Mechanism: Google’s Chubby (closed-source, but described in their 2006 paper) uses a similar approach with “sequencers”—opaque byte strings that include a lock generation number. When a client acquires a lock, Chubby returns a sequencer. The client passes this sequencer to any service it interacts with while holding the lock. Services check with Chubby that the sequencer is still valid before honoring requests. This prevents the scenario where a client holds a lock, gets partitioned, loses the lock, another client acquires it, and the original client’s delayed requests arrive and corrupt state.

Redlock Algorithm: Multi-Instance Lock Acquisition

sequenceDiagram
    participant Client
    participant Redis1
    participant Redis2
    participant Redis3
    participant Redis4
    participant Redis5
    
    Note over Client: T=0: Start acquisition
    Client->>Redis1: SET lock NX PX 10000
    Redis1-->>Client: OK (acquired)
    Client->>Redis2: SET lock NX PX 10000
    Redis2-->>Client: OK (acquired)
    Client->>Redis3: SET lock NX PX 10000
    Redis3-->>Client: (timeout/fail)
    Client->>Redis4: SET lock NX PX 10000
    Redis4-->>Client: OK (acquired)
    Client->>Redis5: SET lock NX PX 10000
    Redis5-->>Client: (timeout/fail)
    
    Note over Client: T=2s: Check validity<br/>3/5 acquired ✓<br/>Elapsed: 2s<br/>Valid time: 10s - 2s - 0.2s = 7.8s
    Note over Client: Lock acquired!<br/>Must complete work in 7.8s
    
    Note over Client: Problem: If client pauses<br/>for GC here, lock expires<br/>while client thinks it's valid

Redlock attempts to acquire locks on a majority of independent Redis instances. The client computes remaining validity time by subtracting elapsed time and clock drift margin from the TTL. However, process pauses (GC, network delays) can cause the client to hold expired locks without detection, violating mutual exclusion.

ZooKeeper Lock Acquisition with Ephemeral Sequential Nodes

graph TB
    subgraph ZooKeeper Namespace
        Root["/locks/payment-123"]
        Lock1["lock-0000000033<br/><i>Client A - EPHEMERAL</i>"]
        Lock2["lock-0000000034<br/><i>Client B - EPHEMERAL</i>"]
        Lock3["lock-0000000035<br/><i>Client C - EPHEMERAL</i>"]
        Root --> Lock1
        Root --> Lock2
        Root --> Lock3
    end
    
    ClientA["Client A<br/><i>Lock Holder</i>"]
    ClientB["Client B<br/><i>Waiting</i>"]
    ClientC["Client C<br/><i>Waiting</i>"]
    
    ClientA -."1. Create ephemeral<br/>sequential node".-> Lock1
    ClientB -."2. Create ephemeral<br/>sequential node".-> Lock2
    ClientC -."3. Create ephemeral<br/>sequential node".-> Lock3
    
    Lock1 --"Lowest sequence<br/>= Lock acquired"--> ClientA
    Lock2 --"Watch on<br/>lock-0000000033"--> ClientB
    Lock3 --"Watch on<br/>lock-0000000034"--> ClientC
    
    Note1["When Client A releases<br/>or session expires,<br/>lock-0000000033 deleted<br/>→ Client B wakes up"]

ZooKeeper implements distributed locks using ephemeral sequential nodes. Each client creates a node and checks if it has the lowest sequence number. If not, it sets a watch on the next-lowest node to avoid the thundering herd problem. The sequence number serves as a fencing token. When a client’s session expires, ZooKeeper automatically deletes its ephemeral node, releasing the lock.

Correctness Analysis

Distributed locks fail in subtle ways that violate mutual exclusion—the fundamental guarantee that only one client holds the lock at a time. Understanding these failure modes is critical for system design interviews and production systems.

Network Partition Scenario: Consider a Redis-based lock with a 10-second TTL. Client A acquires the lock and begins processing. A network partition isolates Client A from Redis for 15 seconds. Redis expires the lock after 10 seconds. Client B acquires the lock and starts processing. The partition heals. Both Client A and Client B believe they hold the lock and write to the shared resource concurrently—mutual exclusion is violated. Time-based expiration alone cannot prevent this.

Clock Skew in Redlock: Redlock assumes bounded clock drift, but real systems experience clock jumps. Imagine 5 Redis instances with synchronized clocks. Client A acquires locks on 3 instances at time T=0 with 10-second TTL. At T=2, an NTP update jumps the clock forward 15 seconds on 2 of those instances, expiring their locks. Client B now acquires locks on those 2 instances plus 1 other, achieving a majority. Both clients hold “valid” locks according to Redlock’s rules. Martin Kleppmann’s analysis showed that Redlock’s safety depends on synchronized clocks, which distributed systems cannot guarantee.

Process Pause (GC/CPU Starvation): The most insidious failure: Client A acquires a lock, then pauses for 20 seconds due to a stop-the-world GC. The lock expires. Client B acquires the lock and modifies the resource. Client A resumes, unaware it paused, and proceeds to modify the resource with stale assumptions. This violates mutual exclusion even with perfect network and clocks. No timeout-based lock can prevent this without fencing tokens.

Fencing Tokens Provide Safety: The solution is monotonically increasing fencing tokens (sequence numbers). When Client A acquires lock generation 33, it includes “33” in every write to the protected resource. When Client B acquires lock generation 34, it includes “34”. If Client A’s delayed write arrives after Client B’s, the resource sees token 33 < 34 and rejects it. This requires the resource to:

Store the highest token it has seen
Reject any write with a lower token
Accept and update the highest token on writes with higher tokens

ZooKeeper’s sequence numbers and Chubby’s sequencers implement this pattern. Redlock does not provide fencing tokens, making it unsuitable for correctness-critical use cases. Redis can be extended with a counter to generate tokens, but this requires additional coordination. For a deeper discussion of fencing mechanisms, see Split Brain & Fencing.

Fencing Token Preventing Split-Brain Corruption

sequenceDiagram
    participant ClientA as Client A
    participant Lock as Lock Service<br/>(ZooKeeper)
    participant ClientB as Client B
    participant DB as Protected Resource<br/>(Database)
    
    ClientA->>Lock: Acquire lock
    Lock-->>ClientA: Lock granted<br/>Token=33
    Note over ClientA: Client A starts work
    
    Note over ClientA: ⚠️ Client A pauses<br/>(GC, network partition)<br/>for 20 seconds
    
    Note over Lock: Lock expires after<br/>session timeout
    
    ClientB->>Lock: Acquire lock
    Lock-->>ClientB: Lock granted<br/>Token=34
    ClientB->>DB: Write data<br/>with Token=34
    DB-->>ClientB: ✓ Accepted<br/>(34 > highest seen)
    Note over DB: Highest token: 34
    
    Note over ClientA: Client A resumes,<br/>unaware it paused
    ClientA->>DB: Write data<br/>with Token=33
    DB-->>ClientA: ✗ Rejected<br/>(33 < 34)
    Note over DB: Corruption prevented!<br/>Stale holder rejected

Fencing tokens prevent split-brain corruption by ensuring the protected resource rejects operations from stale lock holders. When Client A pauses and its lock expires, Client B acquires a lock with a higher token (34). Even though Client A resumes and attempts to write with its stale token (33), the database rejects it because 33 < 34, preventing data corruption.

Network Partition Causing Split-Brain Without Fencing

sequenceDiagram
    participant ClientA as Client A
    participant Redis
    participant ClientB as Client B
    participant Resource as Protected Resource<br/>(No Token Validation)
    
    ClientA->>Redis: SET lock NX PX 10000
    Redis-->>ClientA: OK (lock acquired)
    Note over ClientA: T=0: Start processing
    
    Note over ClientA,Redis: ⚠️ Network partition<br/>Client A isolated for 15s
    
    Note over Redis: T=10s: Lock expires<br/>(TTL reached)
    
    ClientB->>Redis: SET lock NX PX 10000
    Redis-->>ClientB: OK (lock acquired)
    ClientB->>Resource: Write data
    Resource-->>ClientB: ✓ Success
    
    Note over ClientA,Redis: Network partition heals
    
    ClientA->>Resource: Write data<br/>(believes it still holds lock)
    Resource-->>ClientA: ✓ Success
    
    Note over Resource: ⚠️ CORRUPTION!<br/>Both clients wrote<br/>Mutual exclusion violated<br/><br/>Solution: Fencing tokens<br/>would reject Client A

Time-based lock expiration cannot prevent split-brain scenarios. When Client A is partitioned from Redis, the lock expires and Client B acquires it. After the partition heals, both clients believe they hold the lock and write to the resource concurrently, violating mutual exclusion. Fencing tokens would prevent this by allowing the resource to reject Client A’s stale operations.

Performance Characteristics

Distributed lock performance varies dramatically by implementation and use case. The key metrics are lock acquisition latency, throughput (locks per second), and availability (can clients acquire locks during failures).

Redis Single-Instance:

Latency: 1-2ms for local network, 10-50ms cross-region. Single round-trip to Redis.
Throughput: 50,000-100,000 lock acquisitions/sec on modern hardware (limited by Redis single-threaded event loop).
Availability: Zero tolerance for Redis failure. If Redis crashes, all locks are lost and clients cannot acquire new locks until Redis restarts.
Use Case: High-throughput, low-latency scenarios where brief unavailability is acceptable (rate limiting, preventing duplicate work).

Redlock (5 Redis instances):

Latency: 5-10ms local, 50-250ms cross-region (sequential requests to 5 instances, though can be parallelized).
Throughput: ~10,000 locks/sec (limited by need to contact majority).
Availability: Tolerates 2 of 5 Redis failures. Clients can still acquire locks with 3 healthy instances.
Caveat: As discussed, Redlock’s correctness guarantees are questionable. Use only for efficiency, not correctness.

ZooKeeper/etcd (Consensus-Based):

Latency: 10-20ms local (requires consensus round-trip: leader receives request, replicates to majority, acknowledges). Cross-region: 100-500ms depending on quorum placement.
Throughput: 5,000-10,000 locks/sec for a 3-node cluster (limited by consensus overhead and leader bottleneck).
Availability: Tolerates minority failures (1 of 3, 2 of 5). Requires majority quorum to make progress.
Use Case: Correctness-critical scenarios (leader election, distributed transactions, preventing duplicate job execution).

Scalability Considerations: Distributed locks don’t scale horizontally—adding more lock servers doesn’t increase throughput for a single lock. Each lock is a serialization point. If 1,000 clients contend for the same lock, they queue regardless of infrastructure. The solution is lock granularity: instead of one global lock, use per-resource locks (e.g., lock per user ID, per shard). Netflix’s distributed cron system uses per-job locks in ZooKeeper, allowing thousands of jobs to run concurrently while ensuring each job runs on exactly one instance.

Lock Duration Impact: Longer lock hold times reduce throughput. If a lock is held for 100ms on average and acquisition takes 10ms, maximum throughput is ~9 locks/sec per resource (1000ms / 110ms). Short-lived locks (1-10ms hold time) achieve higher throughput but increase coordination overhead. Google’s Chubby paper recommends coarse-grained locks (seconds to hours) for leader election, not fine-grained locks (milliseconds) for transaction coordination.

Trade-offs

Efficiency vs. Correctness: The fundamental trade-off in distributed locking is between performance (low latency, high availability) and safety (guaranteed mutual exclusion). Redis single-instance locks are fast and simple but provide no fault tolerance. Redlock adds fault tolerance but still lacks correctness guarantees under realistic failure conditions. ZooKeeper/etcd provide strong correctness via consensus but pay a latency and complexity cost. Choose based on consequences of failure: if two clients briefly holding the same lock causes duplicate work but no corruption, Redis suffices. If it causes data corruption or financial loss, use consensus-based locks with fencing tokens.

Availability vs. Consistency: Consensus-based locks require a majority quorum, meaning they become unavailable during majority failures (e.g., 2 of 3 nodes down). This is inherent to the CAP theorem: you cannot have both strong consistency (mutual exclusion) and availability during partitions. Redis-based locks favor availability—clients can always acquire locks as long as they reach any Redis instance—but sacrifice consistency (split-brain possible). For critical systems, unavailability is safer than inconsistency: better to fail to acquire a lock than to violate mutual exclusion.

Lock Granularity: Coarse-grained locks (one lock for an entire service) are simple but create contention bottlenecks. Fine-grained locks (per-resource, per-user) improve concurrency but increase coordination overhead and complexity. Uber’s payment system uses per-transaction locks in etcd to prevent duplicate charges, accepting the coordination cost for correctness. In contrast, their dispatch system uses optimistic concurrency (version numbers in the database) to avoid locks entirely for high-throughput matching.

Timeout Selection: Short timeouts (1-5 seconds) reduce the window where a failed client holds a lock, improving availability. But they increase false positives: a slow client loses its lock and causes duplicate work. Long timeouts (30-60 seconds) reduce false positives but mean a crashed client holds the lock longer, delaying recovery. ZooKeeper’s default 40-second session timeout balances these concerns. Fencing tokens eliminate the need to tune this trade-off: the timeout can be long (reducing false positives) because stale holders are rejected by the resource.

Operational Complexity: Redis is operationally simple: deploy a single instance or Redis Cluster, use SET NX. ZooKeeper/etcd require running a consensus cluster (3-5 nodes), monitoring quorum health, and handling leader elections. For small teams or non-critical use cases, this complexity may not be justified. Many teams use Redis for distributed locking despite its limitations, accepting occasional duplicates in exchange for operational simplicity.

When to Use (and When Not To)

Use Distributed Locks When:

Multiple processes compete for exclusive access to a shared resource (database record, file, job) across machines. If all processes run on one machine, use local locks (mutexes, semaphores).
You need mutual exclusion guarantees stronger than optimistic concurrency. If your database supports compare-and-swap or version numbers, prefer that—it’s simpler and faster.
The resource cannot enforce exclusion itself. Distributed locks are a coordination mechanism external to the resource. If the resource can enforce exclusion (e.g., database unique constraints, Kafka consumer groups), use that instead.

Choose Redis Single-Instance When:

Efficiency, not correctness, is the goal. Preventing duplicate work (cron jobs, cache warming) where occasional duplicates are acceptable.
Latency is critical (sub-5ms lock acquisition).
Operational simplicity matters more than fault tolerance.
Example: Rate limiting API requests. If two requests briefly bypass the limit during a Redis restart, it’s annoying but not catastrophic.

Choose Redlock When:

Honestly, avoid Redlock for new systems. It’s more complex than single-instance Redis but doesn’t provide the correctness guarantees of consensus-based locks. If you need fault tolerance, use ZooKeeper/etcd. If you don’t, use single-instance Redis.

Choose ZooKeeper/etcd When:

Correctness is non-negotiable. Financial transactions, leader election, preventing duplicate job execution where duplicates cause data corruption.
You can implement fencing tokens in the protected resource (database, storage system).
You’re already running ZooKeeper/etcd for other coordination (service discovery, configuration). Adding locks is incremental complexity.
Example: Distributed cron ensuring exactly-once job execution. Netflix uses ZooKeeper locks for their scheduler—if two instances run the same job, it could charge customers twice.

Alternatives to Consider:

Optimistic Concurrency: Use version numbers or ETags in your database. Cheaper and simpler than distributed locks. Stripe’s payment API uses idempotency keys instead of locks.
Database Row Locks: If all processes access the same database, use SELECT FOR UPDATE. Simpler than external coordination.
Message Queue Semantics: Kafka consumer groups and SQS visibility timeouts provide at-least-once or exactly-once semantics without explicit locks.
Leader Election: If you need one active instance, use leader election (see Leader Election) rather than locks for every operation.

Real-World Examples

company: Google system: Chubby Lock Service use_case: Google built Chubby in 2006 to provide distributed locking and leader election for internal systems like GFS (Google File System) and Bigtable. Chubby uses Paxos consensus across 5 replicas to maintain strongly consistent lock state. Each lock acquisition returns a sequencer (fencing token) that clients pass to storage systems. GFS master election uses Chubby locks: when a master acquires the lock, it gets a sequencer and includes it in all operations. If the master is partitioned and a new master is elected with a higher sequencer, storage nodes reject the old master’s operations. This prevents split-brain corruption. interesting_detail: Chubby’s design explicitly optimizes for coarse-grained locks held for hours or days (leader election, configuration) rather than fine-grained locks held for milliseconds (transaction coordination). This reduces load on the Chubby cluster and simplifies client caching. Clients cache lock state and receive invalidation callbacks when locks change, reducing round-trips. Google’s paper notes that Chubby handles millions of clients across their fleet with just a handful of Chubby cells (clusters).

company: Netflix system: Distributed Cron (Fenzo Scheduler) use_case: Netflix runs thousands of scheduled jobs (data pipelines, cache warming, report generation) across their microservices. To ensure exactly-once execution, each job uses a distributed lock in ZooKeeper. When a job’s schedule triggers, all instances attempt to acquire the lock at /jobs/{job_id}/lock. The instance that creates the lowest ephemeral sequential node wins and executes the job. If that instance crashes, ZooKeeper’s session timeout releases the lock and another instance acquires it, ensuring the job eventually runs. interesting_detail: Netflix’s scheduler includes a “fencing token” mechanism where the lock holder’s sequence number is passed to downstream services. If a job writes to S3 or a database, it includes its sequence number in metadata. This prevents a partitioned job instance from corrupting state after its lock expired and another instance acquired it. They also use lock timeouts aligned with job SLAs: a 5-minute job gets a 10-minute lock to tolerate slow execution, but not so long that a crashed instance blocks the job for hours.

company: Uber system: Payment Processing use_case: Uber’s payment system uses distributed locks in etcd to prevent duplicate charges when processing ride payments. When a ride ends, multiple services (trip completion, surge pricing, promotions) may trigger payment processing. Each service attempts to acquire a lock on /payments/{ride_id} before charging the user. The lock holder processes the payment, writes the result to the database with the lock’s sequence number (fencing token), and releases the lock. If a service is partitioned and its lock expires, the database rejects its payment write because the sequence number is stale. interesting_detail: Uber chose etcd over ZooKeeper for operational reasons: etcd’s gRPC API and HTTP endpoints simplified integration with their polyglot microservices (Go, Java, Node.js). They run a 5-node etcd cluster per region with cross-region replication for disaster recovery. Lock acquisition latency is 10-15ms in the same region, which is acceptable for payment processing (not latency-critical). They considered Redis but rejected it due to lack of fencing tokens—duplicate charges are a regulatory and customer trust issue, so correctness trumps performance.

Interview Essentials

Mid-Level

Explain how a distributed lock differs from a local mutex. What new failure modes exist in distributed systems?

Walk through how you’d implement a simple distributed lock using Redis SET NX. What are its limitations?

Describe a scenario where two clients could both believe they hold the same lock. How does this violate mutual exclusion?

What is a fencing token and why is it necessary? Give an example of how it prevents corruption.

Compare Redis-based locks and ZooKeeper-based locks. When would you choose each?

Senior

Analyze the Redlock algorithm. What assumptions does it make about clocks and network? Under what conditions does it fail?

Design a distributed lock system for a payment processing service where duplicate charges are unacceptable. What guarantees do you need?

Explain how ZooKeeper’s ephemeral sequential nodes provide both lock acquisition and fencing tokens. Walk through the algorithm step-by-step.

A client acquires a lock, pauses for GC, resumes after the lock expires, and writes to the database. How do fencing tokens prevent corruption here?

Your team wants to use Redis locks for a critical system. What questions would you ask to evaluate if this is safe? What alternatives would you propose?

Staff+

Critique Martin Kleppmann’s analysis of Redlock. Do you agree with his conclusions? Under what conditions (if any) is Redlock safe?

Design a distributed lock service for a multi-region system where lock acquisition must complete in <10ms. What trade-offs do you make?

How would you implement fencing tokens in a legacy system where the protected resource (database) cannot be modified to check tokens?

Compare distributed locks vs. optimistic concurrency (version numbers) for preventing duplicate job execution. What are the latency, throughput, and correctness implications?

Your distributed lock system is causing outages: clients hold locks too long, causing cascading failures. How do you debug this? What architectural changes would prevent it?

Common Interview Questions

What is a distributed lock and why do we need it?

Explain the difference between efficiency and correctness in distributed locking.

How does clock skew affect distributed locks?

What is a fencing token and how does it work?

When would you use Redis vs. ZooKeeper for distributed locking?

Red Flags to Avoid

Claiming Redlock is safe for correctness-critical use cases without mentioning fencing tokens or Kleppmann’s critique.

Not understanding that time-based lock expiration alone cannot prevent split-brain scenarios.

Suggesting distributed locks for every coordination problem without considering alternatives (optimistic concurrency, database constraints, message queue semantics).

Ignoring the operational complexity of running a consensus-based coordination service (ZooKeeper, etcd).

Not recognizing that the protected resource must validate fencing tokens—the lock service alone is insufficient for safety.

Key Takeaways

Distributed locks coordinate exclusive access across machines, but time-based expiration alone is insufficient for correctness. Network partitions, clock skew, and process pauses can cause split-brain scenarios where multiple clients hold the same lock.

Fencing tokens (monotonically increasing sequence numbers) are essential for safety. The protected resource must reject operations from stale lock holders by checking token ordering. ZooKeeper and Chubby provide this; Redis-based locks do not.

Choose your locking strategy based on consequences of failure. Redis single-instance is fast and simple for efficiency use cases (preventing duplicate work). ZooKeeper/etcd with fencing tokens is necessary for correctness-critical scenarios (financial transactions, leader election).

Redlock attempts to provide fault tolerance without consensus, but its correctness guarantees are questionable under realistic failure conditions. Prefer single-instance Redis for efficiency or consensus-based locks for correctness; avoid Redlock for new systems.

Consider alternatives to distributed locks: optimistic concurrency (version numbers), database row locks, and message queue semantics are often simpler and faster. Distributed locks are a coordination primitive, not a universal solution.