Distributed Consensus: Raft & Paxos Explained
After this topic, you will be able to:
- Analyze Raft and Paxos consensus algorithms for correctness and performance trade-offs
- Compare consensus protocol characteristics (leader-based vs leaderless, message complexity)
- Evaluate consensus algorithm selection based on consistency requirements and failure scenarios
TL;DR
Distributed consensus ensures multiple nodes agree on a single value or sequence of operations despite failures and network partitions. Algorithms like Raft and Paxos provide safety guarantees (agreement, validity, termination) while tolerating up to (N-1)/2 node failures in a cluster of N nodes. Modern systems like etcd (Kubernetes), Consul (HashiCorp), and ZooKeeper (Apache) use consensus to coordinate distributed state, leader election, and configuration management.
Cheat Sheet: Raft = understandable leader-based consensus with explicit log replication. Paxos = theoretically elegant but complex multi-phase protocol. Both require majority quorum (3 of 5 nodes) for progress. Typical latency: 1-5ms for local clusters, 50-200ms cross-region. Choose Raft for new systems (better tooling, clearer mental model), Paxos only if already deployed (Google Chubby, Spanner).
Background
The distributed consensus problem emerged from the need to build reliable systems from unreliable components. In 1982, Leslie Lamport formalized the challenge: how can independent processes agree on a single value when messages can be delayed, lost, or delivered out of order, and processes can crash? This isn’t just academic—every distributed database, coordination service, and replicated state machine faces this problem.
The naive approach of “just pick a leader and follow it” fails catastrophically during network partitions. Imagine a payment system where two data centers lose connectivity. If both independently process transactions, you get split-brain: inconsistent state that’s impossible to reconcile. Banks learned this the hard way in the 1980s when ATM networks experienced partitions and issued duplicate withdrawals.
Lamport’s Paxos algorithm (1989) was the first practical solution, proving that consensus is achievable with a majority quorum even when up to (N-1)/2 nodes fail. However, Paxos gained a reputation for being notoriously difficult to understand and implement correctly. Google’s Chubby lock service (2006) used Paxos internally but required years of engineering effort to get right. This complexity tax led Diego Ongaro and John Ousterhout to design Raft (2014) with explicit understandability as a goal. Raft decomposes consensus into leader election, log replication, and safety properties, making it teachable to undergraduates and implementable by practitioners.
Today, consensus algorithms power critical infrastructure: Kubernetes uses etcd (Raft) to store cluster state, Consul coordinates service discovery across data centers, and Kafka uses ZooKeeper (ZAB, a Paxos variant) for broker coordination. Understanding consensus is essential because it’s the foundation for any system that needs strong consistency guarantees across multiple machines.
Architecture
Consensus systems follow a replicated state machine architecture. The core idea: if multiple nodes start in the same state and apply the same operations in the same order, they’ll end up in the same final state. The consensus algorithm ensures agreement on the operation sequence despite failures.
A typical consensus cluster has three architectural layers. The consensus layer manages agreement on log entries (operations). Each node maintains a replicated log where entries are numbered sequentially. The consensus protocol ensures all nodes agree on what goes in each log position. The state machine layer applies committed log entries to the application state (a key-value store, configuration database, etc.). Once an entry is committed (replicated to a majority), it’s applied deterministically. The client interface layer handles read/write requests, forwarding writes to the leader and potentially serving reads from followers (with consistency trade-offs).
Nodes in the cluster have distinct roles. In leader-based protocols like Raft, one node is elected leader and handles all writes. The leader appends entries to its log, then replicates them to followers. Followers passively accept log entries and vote in elections. If the leader fails, followers detect the timeout and start a new election. In leaderless protocols like basic Paxos, any node can propose values, but Multi-Paxos typically elects a stable leader for efficiency.
The quorum requirement is fundamental. With N nodes, consensus requires responses from (N/2 + 1) nodes for any decision. This means a 5-node cluster tolerates 2 failures, a 3-node cluster tolerates 1 failure. The math is deliberate: any two quorums must overlap by at least one node, ensuring that new decisions see previous committed values. This is why consensus clusters are typically deployed with odd numbers of nodes (3, 5, or 7)—even numbers don’t improve fault tolerance but add coordination overhead.
For cross-references: Leader election mechanics (how nodes campaign, vote, and handle split votes) are covered in detail in Leader Election. This topic focuses on how consensus algorithms use leader election as one component of a larger protocol.
Consensus Cluster Architecture with Replicated State Machine
graph TB
subgraph Client Layer
C1["Client 1"]
C2["Client 2"]
end
subgraph Consensus Cluster
subgraph Node 1 - Leader
L_CI["Client Interface"]
L_CL["Consensus Layer<br/><i>Raft/Paxos</i>"]
L_LOG[("Replicated Log<br/>1: SET x=5 (T1)<br/>2: SET y=10 (T1)<br/>3: DEL z (T2)")]
L_SM["State Machine<br/><i>Key-Value Store</i>"]
end
subgraph Node 2 - Follower
F1_CI["Client Interface"]
F1_CL["Consensus Layer"]
F1_LOG[("Replicated Log<br/>1: SET x=5 (T1)<br/>2: SET y=10 (T1)<br/>3: DEL z (T2)")]
F1_SM["State Machine"]
end
subgraph Node 3 - Follower
F2_CI["Client Interface"]
F2_CL["Consensus Layer"]
F2_LOG[("Replicated Log<br/>1: SET x=5 (T1)<br/>2: SET y=10 (T1)")]
F2_SM["State Machine"]
end
end
C1 --"1. Write Request"--> L_CI
C2 --"Read Request"--> F1_CI
L_CI --"2. Append to Log"--> L_LOG
L_CL --"3. Replicate (AppendEntries)"--> F1_CL
L_CL --"3. Replicate (AppendEntries)"--> F2_CL
F1_CL --"4. Append to Log"--> F1_LOG
F2_CL --"4. Append to Log"--> F2_LOG
F1_CL --"5. ACK"--> L_CL
F2_CL --"5. ACK"--> L_CL
L_CL --"6. Commit (majority reached)"--> L_LOG
L_LOG --"7. Apply committed entries"--> L_SM
F1_LOG --"7. Apply committed entries"--> F1_SM
A 3-node consensus cluster showing the three architectural layers: client interface (handles requests), consensus layer (manages log replication), and state machine (applies committed entries). The leader replicates entries to followers, and once a majority acknowledges (2 of 3 nodes), the entry is committed and applied to all state machines in the same order.
Internals
Raft and Paxos solve consensus through different approaches, but both guarantee safety properties: agreement (all nodes decide the same value), validity (the decided value was proposed by some node), and termination (all non-faulty nodes eventually decide). Let’s examine how each achieves this.
Raft Internals: Raft divides time into terms, each beginning with an election. Terms are monotonically increasing integers that act as logical clocks. Each log entry is tagged with the term in which it was created. When a node becomes leader for term T, it only appends entries with term T.
Log replication works through AppendEntries RPCs. The leader sends its log entries to followers, including the index and term of the preceding entry. Followers check if their log matches at that position. If not, they reject the RPC, and the leader decrements the index and retries. This process finds the point where logs diverge and overwrites the follower’s conflicting entries. The key insight: the leader’s log is always authoritative for its term.
Commitment happens when an entry is replicated to a majority. The leader tracks the highest index replicated to each follower. Once an entry reaches majority replication, it’s committed. Crucially, the leader only commits entries from its current term directly. Entries from previous terms are committed indirectly when a current-term entry is committed. This prevents a subtle bug where an old entry could be committed, then overwritten by a new leader.
Raft’s safety property ensures that if a log entry is committed at a given index, no different entry will ever be committed at that index. This is enforced through election restrictions: a candidate can only win an election if its log is at least as up-to-date as a majority of nodes. “Up-to-date” means having a higher last term, or the same last term with a higher last index. This prevents nodes with stale logs from becoming leader and overwriting committed entries.
Paxos Internals: Basic Paxos operates in two phases for each value. In Phase 1 (Prepare), a proposer selects a proposal number N (higher than any it’s seen) and sends Prepare(N) to acceptors. Acceptors respond with a promise not to accept proposals numbered less than N, and include the highest-numbered proposal they’ve already accepted (if any). If the proposer receives promises from a majority, it proceeds to Phase 2.
In Phase 2 (Accept), the proposer sends Accept(N, V) to acceptors, where V is either the value from the highest-numbered proposal returned in Phase 1, or the proposer’s own value if no proposals were returned. Acceptors accept the proposal unless they’ve since promised to a higher-numbered proposal. If a majority accepts, the value is chosen.
The genius of Paxos is in Phase 1’s promise mechanism. By promising not to accept lower-numbered proposals, acceptors prevent old proposals from overwriting newer ones. By returning previously accepted values, acceptors ensure that once a value is chosen by a majority, all future proposals will propose that same value.
Multi-Paxos optimizes basic Paxos for multiple values by electing a stable leader. The leader skips Phase 1 for subsequent proposals, sending only Accept messages. This reduces message complexity from 4 round-trips to 2 per value, matching Raft’s efficiency. However, Multi-Paxos requires careful handling of leader changes and log gaps, which is where implementation complexity creeps in.
Data Structures: Both algorithms maintain persistent state. Raft nodes store currentTerm, votedFor, and log[] on disk before responding to RPCs. This ensures that after a crash, a node doesn’t violate promises made before the crash (like voting for two different candidates in the same term). Paxos acceptors persist the highest promised proposal number and all accepted proposals. These disk writes are on the critical path, making fsync latency a key performance factor.
Paxos Two-Phase Protocol for Single Value Consensus
sequenceDiagram
participant P as Proposer<br/>(Node 1)
participant A1 as Acceptor 1
participant A2 as Acceptor 2
participant A3 as Acceptor 3
participant L as Learner<br/>(All nodes)
Note over P,A3: Phase 1: Prepare (Promise not to accept lower proposals)
P->>P: Select proposal number N=5
P->>A1: Prepare(N=5)
P->>A2: Prepare(N=5)
P->>A3: Prepare(N=5)
Note over A1: Highest promised: N=3<br/>Highest accepted: (N=3, V=X)
A1->>P: Promise(N=5, acceptedProposal=(3,X))
Note over A2: Highest promised: N=4<br/>Highest accepted: (N=4, V=Y)
A2->>P: Promise(N=5, acceptedProposal=(4,Y))
Note over A3: Highest promised: N=2<br/>No accepted proposal
A3->>P: Promise(N=5, acceptedProposal=null)
Note over P: Received majority promises (3 of 3)<br/>Must propose highest accepted value: Y<br/>(from proposal N=4)
Note over P,A3: Phase 2: Accept (Replicate chosen value)
P->>A1: Accept(N=5, V=Y)
P->>A2: Accept(N=5, V=Y)
P->>A3: Accept(N=5, V=Y)
Note over A1: N=5 ≥ promised N=5 ✓
A1->>A1: Accept (N=5, V=Y)
A1->>P: Accepted(N=5)
A1->>L: Accepted(N=5, V=Y)
Note over A2: N=5 ≥ promised N=5 ✓
A2->>A2: Accept (N=5, V=Y)
A2->>P: Accepted(N=5)
A2->>L: Accepted(N=5, V=Y)
Note over A3: N=5 ≥ promised N=5 ✓
A3->>A3: Accept (N=5, V=Y)
A3->>P: Accepted(N=5)
A3->>L: Accepted(N=5, V=Y)
Note over L: Majority accepted (3 of 3)<br/>Value Y is chosen
Paxos two-phase protocol showing how consensus is reached on a single value. Phase 1 (Prepare) ensures the proposer learns about any previously accepted values and gets promises from acceptors not to accept lower-numbered proposals. Phase 2 (Accept) replicates the value to a majority. The key insight: by proposing the highest-numbered previously accepted value (Y from N=4), Paxos ensures that once a value is chosen by a majority, all future proposals will propose that same value.
Raft Protocol Deep Dive
Let’s walk through Raft’s protocol with concrete message flows to understand how it achieves consensus in practice.
Leader Election: Nodes start as followers, expecting heartbeats from a leader. Each follower has an election timeout (randomized between 150-300ms). If the timeout expires without hearing from a leader, the follower transitions to candidate, increments its term, votes for itself, and sends RequestVote RPCs to all other nodes.
The RequestVote RPC includes the candidate’s term, its last log index, and last log term. A node grants its vote if: (1) the candidate’s term is at least as high as the voter’s current term, (2) the voter hasn’t already voted in this term, and (3) the candidate’s log is at least as up-to-date as the voter’s log. The “up-to-date” check compares last log term first (higher term wins), then last log index (higher index wins if terms are equal).
If a candidate receives votes from a majority, it becomes leader and immediately sends heartbeat AppendEntries RPCs to all followers to establish authority. If it receives an AppendEntries from another leader with a term ≥ its own, it reverts to follower. If the election times out (split vote), the candidate increments its term and starts a new election. Randomized timeouts make split votes unlikely—typically one node times out first and wins before others start campaigning.
Log Replication: When a client sends a write request, the leader appends the entry to its local log with the current term, then sends AppendEntries RPCs to followers in parallel. Each AppendEntries includes: the leader’s term, the index and term of the entry immediately preceding the new entries, the new entries themselves, and the leader’s commit index.
Followers perform a consistency check: does the entry at prevLogIndex have term prevLogTerm? If yes, the follower appends the new entries (deleting any conflicting entries) and responds success. If no, the follower responds failure. On failure, the leader decrements nextIndex for that follower and retries. This process walks backward through the log until finding the point of agreement, then sends all subsequent entries.
Once the leader receives success responses from a majority, it updates its commitIndex to the highest index replicated to a majority. On the next AppendEntries (or heartbeat), the leader includes the updated commitIndex. Followers then apply all entries up to commitIndex to their state machine.
Safety Guarantees: Raft’s safety comes from three key restrictions. First, the election restriction (described above) ensures only nodes with up-to-date logs can become leader. Second, the leader append-only property means a leader never deletes or overwrites entries in its own log. Third, the commit restriction means a leader only commits entries from its current term directly; old entries are committed indirectly.
Consider a failure scenario: Leader L1 in term 1 replicates entry E to 2 of 5 nodes, then crashes. L2 is elected in term 2 with a log missing E (it was in the minority that didn’t receive E). L2 appends entry F in term 2 and replicates it to a majority. Now E is lost—overwritten by F. This is safe because E was never committed (didn’t reach majority before L1 crashed). Clients that sent E would have timed out and retried, and the retry would be appended as a new entry by L2.
The commit restriction prevents a subtle bug: suppose L1 replicates E to 3 of 5 nodes in term 1 but crashes before committing. L2 is elected in term 2, replicates E to 4 of 5 nodes (inheriting it from L1), then crashes. L3 is elected in term 3 with a log missing E. If L2 had committed E just because it reached majority, L3 would violate safety by overwriting a committed entry. Raft prevents this by requiring L2 to append at least one entry in term 2 and commit that entry (which indirectly commits E) before E is considered committed.
Raft Leader Election with Term Progression
sequenceDiagram
participant N1 as Node 1<br/>(Follower→Candidate→Leader)
participant N2 as Node 2<br/>(Follower)
participant N3 as Node 3<br/>(Follower)
participant N4 as Node 4<br/>(Follower)
participant N5 as Node 5<br/>(Follower)
Note over N1,N5: Term 1: Normal operation with existing leader
Note over N1: Election timeout expires<br/>(150-300ms)
N1->>N1: Increment term to 2<br/>Transition to Candidate<br/>Vote for self
N1->>N2: RequestVote(term=2, lastLogIndex=5, lastLogTerm=1)
N1->>N3: RequestVote(term=2, lastLogIndex=5, lastLogTerm=1)
N1->>N4: RequestVote(term=2, lastLogIndex=5, lastLogTerm=1)
N1->>N5: RequestVote(term=2, lastLogIndex=5, lastLogTerm=1)
Note over N2: Check: term ≥ currentTerm?<br/>Not voted yet?<br/>Candidate log up-to-date?
N2->>N1: VoteGranted(term=2)
N3->>N1: VoteGranted(term=2)
N4->>N1: VoteGranted(term=2)
Note over N5: Rejects: already voted<br/>or log not up-to-date
N5->>N1: VoteRejected(term=2)
Note over N1: Received 4 votes (including self)<br/>Majority = 3 of 5 nodes<br/>Transition to Leader
N1->>N2: AppendEntries(term=2, heartbeat)
N1->>N3: AppendEntries(term=2, heartbeat)
N1->>N4: AppendEntries(term=2, heartbeat)
N1->>N5: AppendEntries(term=2, heartbeat)
Note over N1,N5: Term 2: N1 is now leader, sends heartbeats every 50-100ms
Raft leader election showing how a follower transitions to candidate after election timeout, requests votes from peers, and becomes leader upon receiving majority votes. The election restriction ensures only nodes with up-to-date logs can win. Randomized timeouts (150-300ms) prevent split votes by making it unlikely multiple nodes timeout simultaneously.
Raft Log Replication with Consistency Check
sequenceDiagram
participant Client
participant Leader as Leader<br/>(Term 3)
participant F1 as Follower 1<br/>(Log matches)
participant F2 as Follower 2<br/>(Log diverged)
Client->>Leader: Write: SET key=value
Note over Leader: Append entry to local log<br/>Index=7, Term=3
Leader->>Leader: Log: [1:T1, 2:T1, 3:T2, 4:T2, 5:T2, 6:T3, 7:T3]
Leader->>F1: AppendEntries(term=3, prevLogIndex=6,<br/>prevLogTerm=3, entries=[7:T3], commitIndex=5)
Leader->>F2: AppendEntries(term=3, prevLogIndex=6,<br/>prevLogTerm=3, entries=[7:T3], commitIndex=5)
Note over F1: Check: log[6].term == 3? ✓<br/>Append entry 7
F1->>F1: Log: [1:T1, 2:T1, 3:T2, 4:T2, 5:T2, 6:T3, 7:T3]
F1->>Leader: Success(matchIndex=7)
Note over F2: Check: log[6].term == 3? ✗<br/>Log: [1:T1, 2:T1, 3:T2, 4:T2, 5:T2, 6:T2]
F2->>Leader: Failure(matchIndex=5)
Note over Leader: Decrement nextIndex[F2] to 6<br/>Retry with earlier entries
Leader->>F2: AppendEntries(term=3, prevLogIndex=5,<br/>prevLogTerm=2, entries=[6:T3, 7:T3], commitIndex=5)
Note over F2: Check: log[5].term == 2? ✓<br/>Delete conflicting entry 6:T2<br/>Append entries 6:T3, 7:T3
F2->>F2: Log: [1:T1, 2:T1, 3:T2, 4:T2, 5:T2, 6:T3, 7:T3]
F2->>Leader: Success(matchIndex=7)
Note over Leader: Majority replicated (2 of 3)<br/>Commit entry 7<br/>Update commitIndex=7
Leader->>Client: Write acknowledged
Leader->>F1: AppendEntries(heartbeat, commitIndex=7)
Leader->>F2: AppendEntries(heartbeat, commitIndex=7)
Note over F1,F2: Apply entries 6 and 7 to state machine
Raft log replication showing the consistency check mechanism. When a follower’s log diverges (F2 has entry 6:T2 instead of 6:T3), it rejects the AppendEntries. The leader decrements nextIndex and retries with earlier entries, walking backward until finding the point of agreement. Once the follower’s log matches, it deletes conflicting entries and appends the leader’s entries, ensuring all nodes converge to the same log.
Performance Characteristics
Consensus protocols have predictable performance profiles driven by network round-trips and disk writes. Understanding these numbers helps you set realistic SLAs and capacity plan.
Latency: A successful write in Raft or Multi-Paxos requires one round-trip from client to leader, then one round-trip from leader to a majority of followers. In a local data center with sub-millisecond network latency, this yields 1-5ms write latency. The variance comes from disk fsync—nodes must persist log entries before acknowledging. SSDs with battery-backed write caches can fsync in 100-500μs; spinning disks take 5-10ms. etcd benchmarks show ~2ms median write latency on modern hardware with SSDs.
Cross-region deployments face higher latency. A 5-node cluster spanning US-East and US-West (70ms RTT) sees 140ms+ write latency because the leader must wait for acknowledgments from at least one remote node. This is why consensus clusters are typically deployed within a single region or availability zone, with cross-region replication handled at a higher layer.
Read latency depends on consistency requirements. Linearizable reads (reading the latest committed value) require the leader to confirm it’s still leader, adding a round-trip to a majority (1-5ms). Lease-based optimizations allow the leader to serve reads locally if it recently received heartbeat acknowledgments, reducing read latency to sub-millisecond. Stale reads from followers are instant but may return old data.
Throughput: Consensus throughput is limited by the leader’s ability to replicate entries. A single leader can typically handle 10,000-50,000 writes/second depending on entry size and network bandwidth. etcd achieves ~30,000 writes/sec with 256-byte values on a 3-node cluster with 10GbE networking. Larger entries reduce throughput proportionally—1KB values drop throughput to ~8,000 writes/sec due to network serialization.
Batching improves throughput significantly. Instead of sending one AppendEntries per write, leaders batch multiple entries into a single RPC. This amortizes network overhead and reduces the number of disk fsyncs. etcd’s batching can increase throughput by 3-5x, but adds latency (typically 10-50ms) as writes wait for a batch to fill.
Scalability: Consensus clusters don’t scale horizontally for writes—all writes go through the leader. Adding nodes improves fault tolerance (5 nodes tolerate 2 failures vs. 3 nodes tolerating 1) but doesn’t increase write throughput. In fact, larger clusters can reduce throughput because the leader must replicate to more followers.
Read scalability is better. Followers can serve stale reads without involving the leader, allowing read throughput to scale linearly with cluster size. For workloads with high read/write ratios (10:1 or higher), this is acceptable. For write-heavy workloads, you must shard data across multiple consensus groups (like Spanner’s approach with thousands of Paxos groups).
Failure Recovery: When a leader fails, the cluster is unavailable for writes until a new leader is elected. Election typically completes in 150-300ms (one election timeout). During this window, all writes fail or block. Reads may continue from followers if stale reads are acceptable. Once a new leader is elected, it must replicate at least one entry from its term before committing any previous entries, adding another round-trip (1-5ms).
Network partitions are more severe. If the leader is in the minority partition, it cannot commit new entries (no majority). The majority partition elects a new leader and continues. The old leader eventually steps down when it realizes it can’t reach a majority. Clients connected to the minority partition experience write failures until they reconnect to the majority partition.
Consensus Performance Profile: Latency and Throughput
graph TB
subgraph Local Cluster - Same AZ
LC_Client["Client"]
LC_Leader["Leader"]
LC_F1["Follower 1"]
LC_F2["Follower 2"]
LC_Disk1[("Disk<br/>fsync: 100-500μs")]
LC_Disk2[("Disk<br/>fsync: 100-500μs")]
end
subgraph Cross-Region Cluster
CR_Client["Client<br/><i>US-East</i>"]
CR_Leader["Leader<br/><i>US-East</i>"]
CR_F1["Follower<br/><i>US-East</i>"]
CR_F2["Follower<br/><i>US-West</i><br/>RTT: 70ms"]
end
LC_Client --"1. Write Request"--> LC_Leader
LC_Leader --"2. Persist (100-500μs)"--> LC_Disk1
LC_Leader --"3. Replicate<br/>(sub-1ms network)"--> LC_F1
LC_Leader --"3. Replicate<br/>(sub-1ms network)"--> LC_F2
LC_F1 --"4. Persist"--> LC_Disk2
LC_F1 --"5. ACK (sub-1ms)"--> LC_Leader
LC_Leader --"6. Commit & Respond<br/><b>Total: 1-5ms</b>"--> LC_Client
CR_Client --"1. Write Request"--> CR_Leader
CR_Leader --"2. Replicate (local)"--> CR_F1
CR_Leader --"3. Replicate<br/>(70ms RTT)"--> CR_F2
CR_F2 --"4. ACK (70ms RTT)"--> CR_Leader
CR_Leader --"5. Commit & Respond<br/><b>Total: 140ms+</b>"--> CR_Client
Note1["<b>Throughput Limits</b><br/>Single Leader: 10K-50K writes/sec<br/>Batching: 3-5x improvement<br/>Entry size: 256B = 30K/s, 1KB = 8K/s<br/><br/><b>Scalability</b><br/>Writes: No horizontal scaling
**
Trade-offs
Consensus algorithms make deliberate trade-offs between consistency, availability, and performance. Understanding these helps you choose the right tool and set appropriate expectations.
Strong Consistency vs. Availability: Consensus prioritizes consistency over availability. During a network partition, the minority partition becomes unavailable for writes (and linearizable reads). This is a direct consequence of the CAP theorem—you can’t have both consistency and availability during partitions. Systems like Cassandra or DynamoDB choose availability, allowing both partitions to accept writes and reconciling conflicts later. Consensus systems like etcd choose consistency, rejecting writes in the minority partition to prevent split-brain.
This trade-off is appropriate for coordination tasks (leader election, configuration management, distributed locking) where correctness is paramount. It’s less suitable for high-availability user-facing services where eventual consistency is acceptable.
Write Latency vs. Fault Tolerance: Larger clusters tolerate more failures but have higher write latency. A 3-node cluster (tolerating 1 failure) needs acknowledgment from 2 nodes. A 7-node cluster (tolerating 3 failures) needs acknowledgment from 4 nodes. If nodes are geographically distributed, waiting for 4 acknowledgments is slower than waiting for 2. This is why most production deployments use 3 or 5 nodes—7+ nodes are rare unless you need to tolerate 3+ simultaneous failures.
Understandability vs. Theoretical Elegance: Raft trades theoretical elegance for understandability. Paxos is more general (it can handle arbitrary proposal numbers, concurrent proposers, and gaps in the log), but this generality makes it harder to reason about. Raft’s explicit leader election and term-based log structure are easier to implement correctly. In practice, this matters: there are dozens of correct Raft implementations (etcd, Consul, Hashicorp Raft library) but far fewer correct Paxos implementations.
Single Leader vs. Leaderless: Leader-based protocols (Raft, Multi-Paxos) have a single point of coordination, which simplifies the protocol but creates a throughput bottleneck. Leaderless protocols (basic Paxos, EPaxos) allow any node to propose values, potentially increasing throughput. However, leaderless protocols have higher message complexity (more round-trips) and are harder to implement. In practice, most systems use leader-based protocols because the throughput of a single leader (10,000-50,000 writes/sec) is sufficient for coordination tasks.
Batching vs. Latency: Batching writes improves throughput by amortizing network and disk overhead, but increases latency as writes wait for a batch to fill. Systems like etcd expose this as a tunable parameter. For latency-sensitive workloads, disable batching and accept lower throughput. For throughput-sensitive workloads, enable batching and accept higher tail latency.
When to Use (and When Not To)
Choose consensus algorithms when you need strong consistency guarantees and can tolerate temporary unavailability during failures. Here’s a decision framework:
Use Consensus When: (1) You need linearizable consistency—every read sees the most recent write, and operations appear to execute atomically. Examples: leader election (only one leader at a time), configuration management (all nodes see the same config), distributed locking (only one holder per lock). (2) Correctness is more important than availability. A brief outage during leader election is acceptable, but split-brain or data loss is not. (3) Write volume is moderate (< 10,000 writes/sec per consensus group). Consensus doesn’t scale horizontally for writes. (4) You can deploy an odd number of nodes (3, 5, or 7) in a low-latency network (same region or AZ).
Don’t Use Consensus When: (1) You need high write throughput (> 50,000 writes/sec). Consensus is limited by the leader’s replication capacity. Consider sharding across multiple consensus groups or using a different consistency model. (2) You need high availability during network partitions. Consensus sacrifices availability for consistency. Use eventual consistency (Dynamo-style systems) if availability is paramount. (3) You can’t tolerate 1-5ms write latency. Consensus requires a network round-trip and disk fsync. Use in-memory systems or weaker consistency if you need sub-millisecond writes. (4) Your nodes are geographically distributed with high latency (> 50ms RTT). Cross-region consensus has 100ms+ write latency. Use asynchronous replication or multi-region consensus groups.
Raft vs. Paxos: For new systems, choose Raft. It’s easier to understand, has better tooling (etcd, Consul, Hashicorp Raft library), and performs similarly to Multi-Paxos. Only choose Paxos if you’re maintaining an existing Paxos-based system (Google Chubby, Spanner) or need Paxos-specific features (like EPaxos’s leaderless operation).
Alternatives to Consider: For eventual consistency with high availability, use Dynamo-style systems (Cassandra, Riak). For single-region strong consistency with higher throughput, use primary-backup replication with synchronous replication to one replica (like PostgreSQL synchronous replication). For cross-region coordination, use a consensus-based control plane (etcd) with eventual consistency for data (Cassandra, DynamoDB). For in-memory coordination, use ZooKeeper or etcd but tune for lower durability (fewer fsyncs) if you can tolerate data loss on total cluster failure.
Real-World Examples
Kubernetes + etcd: Kubernetes stores all cluster state (pods, services, deployments) in etcd, a distributed key-value store built on Raft. Every API server operation (creating a pod, updating a service) writes to etcd. The etcd cluster (typically 3 or 5 nodes) ensures that all API servers see consistent state. When a Kubernetes master node fails, the remaining API servers continue operating because etcd maintains quorum. Interesting detail: Kubernetes uses etcd’s watch API to get real-time notifications of state changes, enabling controllers to react immediately when resources are created or modified. This watch mechanism is built on Raft’s log—each watch is essentially a subscription to log entries starting from a specific index.
HashiCorp Consul: Consul uses Raft for service discovery and configuration management across data centers. Each data center runs an independent Raft cluster (3-5 servers) that stores the service catalog, health checks, and key-value configuration. When a service registers or a health check fails, the change is replicated via Raft to all servers in the data center. Consul’s interesting twist: it uses Raft for intra-datacenter consistency but gossip protocols (Serf) for inter-datacenter communication. This hybrid approach provides strong consistency within a data center (where low latency makes consensus feasible) and eventual consistency across data centers (where high latency makes consensus impractical).
Amazon Aurora: While Aurora’s main replication uses asynchronous quorum-based replication (not consensus), its control plane uses Paxos for critical metadata operations. When you create a database cluster, resize an instance, or perform a failover, these operations go through a Paxos-based coordination service to ensure all control plane components agree on the cluster’s state. This separation is instructive: Aurora uses consensus where correctness is critical (control plane) but uses faster, weaker consistency for high-throughput data replication (data plane). The control plane can tolerate 100ms latency for rare operations; the data plane needs sub-10ms latency for every write.
Interview Essentials
Mid-Level
At the mid-level, interviewers expect you to explain why consensus is necessary and describe Raft at a high level. You should articulate the problem: “Without consensus, distributed systems can have split-brain where two nodes both think they’re the leader, leading to data corruption.” Explain Raft’s three components: leader election (nodes vote for a leader), log replication (leader sends entries to followers), and safety (committed entries are never lost). Describe the quorum requirement: “A 5-node cluster needs 3 nodes to agree, so it tolerates 2 failures.” You should be able to sketch a timeline showing a leader replicating an entry to followers and committing it once a majority acknowledges. Common mistakes: confusing consensus with simple leader election (consensus is about agreeing on a sequence of operations, not just who’s leader), or thinking all nodes must agree (only a majority is needed).
Senior
Senior engineers must explain Raft’s safety properties and handle failure scenarios. You should describe the election restriction: “A candidate can only become leader if its log is at least as up-to-date as a majority of nodes, preventing nodes with stale logs from overwriting committed entries.” Explain the commit restriction: “A leader only commits entries from its current term directly; old entries are committed indirectly when a new entry is committed.” Walk through a network partition scenario: “If the leader is in the minority partition, it can’t commit new entries. The majority partition elects a new leader and continues. When the partition heals, the old leader steps down and replicates the new leader’s log.” Compare Raft and Paxos: “Raft is easier to understand because it has explicit leader election and term-based log structure. Paxos is more general but harder to implement correctly. In practice, Multi-Paxos performs similarly to Raft.” Discuss performance: “Write latency is 1-5ms in a local cluster, dominated by network round-trip and disk fsync. Throughput is 10,000-50,000 writes/sec, limited by the leader’s replication capacity.”
Staff+
Staff+ engineers must make technology selection decisions and design systems that use consensus correctly. You should discuss trade-offs: “Consensus provides strong consistency but sacrifices availability during partitions. For coordination tasks like leader election, this is the right trade-off. For user-facing data, eventual consistency might be better.” Explain how to scale beyond a single consensus group: “Shard data across multiple Raft groups, like Spanner does with thousands of Paxos groups. Each group handles a key range independently.” Discuss operational challenges: “Consensus clusters need odd numbers of nodes (3, 5, 7) in a low-latency network. Cross-region consensus has 100ms+ write latency, so use asynchronous replication for cross-region data and consensus only for control plane.” Describe how real systems use consensus: “Kubernetes uses etcd for cluster state, Consul uses Raft for service discovery, Aurora uses Paxos for control plane but quorum-based replication for data plane.” You should be able to design a system that uses consensus for coordination (leader election, configuration) but eventual consistency for high-throughput data, explaining why each layer needs different consistency guarantees.
Common Interview Questions
How does Raft ensure that committed entries are never lost? (Election restriction prevents nodes with stale logs from becoming leader)
What happens during a network partition? (Minority partition becomes unavailable, majority elects new leader)
Why do consensus clusters use odd numbers of nodes? (Even numbers don’t improve fault tolerance but add coordination overhead)
How does Raft differ from Paxos? (Raft has explicit leader election and term-based log structure, making it easier to understand)
What’s the latency and throughput of consensus? (1-5ms latency, 10,000-50,000 writes/sec)
When should you use consensus vs. eventual consistency? (Consensus for coordination tasks where correctness is critical, eventual consistency for high-throughput user-facing data)
Red Flags to Avoid
Confusing consensus with simple leader election (consensus is about agreeing on a sequence of operations)
Thinking all nodes must agree (only a majority is needed)
Not understanding the quorum requirement (N/2 + 1 nodes must respond)
Claiming consensus provides high availability (it sacrifices availability for consistency during partitions)
Not knowing the performance characteristics (1-5ms latency, 10,000-50,000 writes/sec)
Suggesting consensus for high-throughput user-facing data (it doesn’t scale horizontally for writes)
Key Takeaways
Consensus ensures multiple nodes agree on a sequence of operations despite failures, requiring majority quorum (N/2 + 1) for any decision. A 5-node cluster tolerates 2 failures; 3-node tolerates 1.
Raft decomposes consensus into leader election, log replication, and safety properties. The election restriction (only nodes with up-to-date logs can become leader) and commit restriction (only commit entries from current term directly) prevent committed entries from being lost.
Performance: 1-5ms write latency in local clusters (network round-trip + disk fsync), 10,000-50,000 writes/sec throughput (limited by leader replication capacity). Cross-region consensus has 100ms+ latency.
Use consensus for coordination tasks (leader election, configuration, distributed locking) where correctness is paramount. Don’t use it for high-throughput user-facing data—it doesn’t scale horizontally for writes.
Real systems use consensus selectively: Kubernetes (etcd for cluster state), Consul (Raft for service discovery), Aurora (Paxos for control plane, quorum replication for data plane). Separate consistency requirements by layer.