Replication in System Design: Master-Slave & Peer

After this topic, you will be able to:

Compare master-slave and master-master replication strategies
Evaluate replication lag impact on consistency and availability
Justify replication strategy choices based on read/write patterns

TL;DR

Replication creates multiple copies of data across different servers to improve availability, performance, and disaster recovery. The two primary patterns are master-slave (one writer, many readers) and master-master (multiple writers), each trading off consistency guarantees for different availability and performance characteristics. Understanding replication lag—the delay between writes and their propagation to replicas—is critical for choosing the right strategy.

Cheat Sheet: Master-Slave = simple, consistent, single point of write failure. Master-Master = complex, available, conflict resolution needed. Synchronous = consistent but slow. Asynchronous = fast but stale reads possible.

The Analogy

Think of replication like a restaurant chain. In master-slave replication, corporate headquarters (master) creates the menu and recipes, then distributes copies to all franchise locations (slaves). Customers can order from any location (read), but only headquarters can change the menu (write). In master-master replication, multiple regional offices can all modify the menu independently, but they need a system to resolve conflicts when two offices create different versions of the same dish. Synchronous replication is like waiting for every franchise to confirm they received the new menu before announcing it publicly—safe but slow. Asynchronous replication is like sending the update and moving on—some locations might serve yesterday’s menu for a few hours.

Why This Matters in Interviews

Replication questions appear in virtually every system design interview because they test your understanding of the fundamental availability-consistency trade-off. Interviewers want to see if you can articulate why Netflix uses asynchronous master-slave replication for its viewing history (eventual consistency is acceptable, read performance matters) versus why a banking system needs synchronous replication for account balances (strong consistency required). The ability to calculate replication lag impact on user experience, explain conflict resolution strategies, and justify your replication choice based on read/write ratios separates candidates who memorize patterns from those who understand distributed systems. Expect follow-up questions about what happens during network partitions, how to handle replica failures, and how replication interacts with caching and sharding strategies.

Core Concept

Replication is the practice of maintaining multiple copies of the same data on different servers to achieve higher availability, better read performance, and fault tolerance. When a single database server handles all requests, it becomes both a performance bottleneck and a single point of failure. Replication solves both problems by distributing the data across multiple nodes, allowing the system to continue operating even when individual servers fail and enabling parallel read operations to serve more users simultaneously.

The core challenge in replication is maintaining consistency across copies while minimizing latency. Every replication strategy makes explicit trade-offs between how quickly data propagates to replicas (affecting consistency), how many replicas must acknowledge writes (affecting availability), and whether multiple nodes can accept writes (affecting complexity). These trade-offs directly map to the CAP theorem—you cannot simultaneously guarantee perfect consistency, availability, and partition tolerance.

Replication differs from related concepts in important ways. Unlike sharding (covered in partitioning), which splits data across nodes, replication duplicates the entire dataset or specific portions. Unlike caching (covered in caching-strategies), which stores frequently accessed data temporarily, replication maintains persistent, authoritative copies. Understanding when to use each pattern—or combine them—is essential for building robust distributed systems.

Synchronous vs Asynchronous Replication Trade-offs

graph TB
    subgraph Synchronous Replication
        SC["Client"]
        SM["Master"]
        SR1["Replica 1"]
        SR2["Replica 2"]
        
        SC --"1. Write request"--> SM
        SM --"2. Write to disk"--> SM
        SM --"3. Send to replicas<br/>(WAIT for ACK)"--> SR1
        SM --"3. Send to replicas<br/>(WAIT for ACK)"--> SR2
        SR1 --"4. ACK"--> SM
        SR2 --"4. ACK"--> SM
        SM --"5. Confirm success<br/>(after all ACKs)"--> SC
        
        SNote["✓ Strong consistency<br/>✓ Data on multiple nodes<br/>✗ High write latency<br/>✗ Reduced availability"]
    end
    
    subgraph Asynchronous Replication
        AC["Client"]
        AM["Master"]
        AR1["Replica 1"]
        AR2["Replica 2"]
        
        AC --"1. Write request"--> AM
        AM --"2. Write to disk"--> AM
        AM --"3. Confirm success<br/>(immediately)"--> AC
        AM -."4. Send to replicas<br/>(background, no wait)".-> AR1
        AM -."4. Send to replicas<br/>(background, no wait)".-> AR2
        
        ANote["✓ Low write latency<br/>✓ High availability<br/>✗ Replication lag<br/>✗ Potential data loss"]
    end

Synchronous replication waits for replica acknowledgments before confirming writes to clients, guaranteeing strong consistency but increasing latency. Asynchronous replication confirms writes immediately, providing low latency but creating a window where data exists only on the master, risking data loss if the master fails.

GitHub’s Multi-Tier Replication Strategy

graph TB
    subgraph Availability Zone 1
        Master["Master DB<br/><i>Primary Write Node</i>"]
        SemiSync1["Semi-Sync Replica<br/><i>AZ1 Backup</i>"]
    end
    
    subgraph Availability Zone 2
        SemiSync2["Semi-Sync Replica<br/><i>AZ2 Backup</i>"]
        AsyncRead1["Async Read Replica<br/><i>Read Traffic</i>"]
    end
    
    subgraph Availability Zone 3
        AsyncRead2["Async Read Replica<br/><i>Read Traffic</i>"]
        AsyncRead3["Async Read Replica<br/><i>Read Traffic</i>"]
    end
    
    Client_W["Write Client<br/><i>git push</i>"]
    Client_R1["Read Client<br/><i>Permission Check</i>"]
    Client_R2["Read Client<br/><i>Star Count</i>"]
    
    Client_W --"1. Write request"--> Master
    Master --"2. Semi-sync replication<br/>(WAIT for 1 ACK)"--> SemiSync1
    Master --"2. Semi-sync replication<br/>(WAIT for 1 ACK)"--> SemiSync2
    Master -."3. Async replication<br/>(no wait)".-> AsyncRead1
    Master -."3. Async replication<br/>(no wait)".-> AsyncRead2
    Master -."3. Async replication<br/>(no wait)".-> AsyncRead3
    
    Client_R1 --"Strong consistency<br/>required"--> Master
    Client_R2 --"Eventual consistency<br/>acceptable"--> AsyncRead2
    
    FailoverNote["Failover: Semi-sync replica<br/>promoted to master in 30s<br/>(guaranteed data durability)"]
    FailoverNote -.-> SemiSync1

GitHub uses a hybrid replication strategy: semi-synchronous replication to one replica per availability zone ensures durability (data survives zone failures), while asynchronous read replicas scale read traffic. Critical reads (permissions) go to the master for consistency, while non-critical reads (star counts) use replicas for performance.

How It Works

Replication operates through a leader-follower model where one or more nodes accept writes and propagate changes to other nodes. The fundamental flow involves four steps: (1) a client sends a write request to a master node, (2) the master applies the change to its local storage, (3) the master sends the change to replica nodes through a replication log or stream, and (4) replicas apply the change to their local copies. The critical design decisions revolve around when to acknowledge the write to the client and how to handle failures during propagation.

In master-slave replication (also called primary-replica or leader-follower), a single master node handles all write operations while multiple slave nodes handle read operations. When a write arrives, the master updates its data and then propagates the change to slaves. Clients can read from any slave, distributing the read load across multiple machines. This pattern is simple to reason about because there’s only one source of truth for writes, eliminating write conflicts. However, the master becomes a single point of failure for writes—if it goes down, the system cannot accept new writes until a slave is promoted to master (covered in fail-over).

In master-master replication (also called multi-master or active-active), multiple nodes can accept both reads and writes. Each master maintains its own copy of the data and propagates changes to other masters. This provides higher write availability—if one master fails, clients can write to another. However, it introduces the possibility of write conflicts when two masters receive conflicting updates to the same data simultaneously. For example, if Master A sets user_status=‘active’ and Master B sets user_status=‘inactive’ for the same user at the same time, the system needs a conflict resolution strategy. Common approaches include last-write-wins (using timestamps), application-level merge functions, or exposing conflicts to users for manual resolution.

The timing of replication introduces another critical dimension: synchronous versus asynchronous replication. Synchronous replication waits for at least one replica to acknowledge the write before confirming success to the client. This guarantees that data exists on multiple nodes before the client proceeds, providing strong durability guarantees. The downside is increased write latency—every write must wait for network round-trips to replicas. If a replica is slow or unreachable, writes block. Asynchronous replication acknowledges writes immediately after the master persists them, without waiting for replicas. This provides low write latency but creates a window where data exists only on the master. If the master fails before replicating, those writes are lost. Semi-synchronous replication strikes a middle ground: wait for one replica to acknowledge (ensuring data exists on two nodes) but don’t wait for all replicas.

Master-Slave Replication: Write and Read Flow

graph LR
    Client1["Client 1<br/><i>Write Request</i>"]
    Client2["Client 2<br/><i>Read Request</i>"]
    Client3["Client 3<br/><i>Read Request</i>"]
    Master["Master DB<br/><i>Write Node</i>"]
    Slave1["Slave 1<br/><i>Read Replica</i>"]
    Slave2["Slave 2<br/><i>Read Replica</i>"]
    Slave3["Slave 3<br/><i>Read Replica</i>"]
    
    Client1 --"1. POST /api/update"--> Master
    Master --"2. Write to disk"--> Master
    Master --"3. Replicate changes<br/>(async)"--> Slave1
    Master --"3. Replicate changes<br/>(async)"--> Slave2
    Master --"3. Replicate changes<br/>(async)"--> Slave3
    Client2 --"4. GET /api/data"--> Slave1
    Client3 --"4. GET /api/data"--> Slave2

Master-slave replication separates write and read paths. All writes go to a single master node, which then asynchronously propagates changes to multiple slave replicas. Read requests are distributed across slaves to scale read capacity. Note that replication lag means slaves may serve slightly stale data.

Master-Master Replication with Conflict Resolution

sequenceDiagram
    participant Client_A as Client A<br/>(Region US)
    participant Master_US as Master US
    participant Master_EU as Master EU
    participant Client_B as Client B<br/>(Region EU)
    
    Note over Master_US,Master_EU: Both masters can accept writes
    
    Client_A->>Master_US: 1. UPDATE user SET status='active'<br/>WHERE id=123 (t=10:00:00.000)
    Client_B->>Master_EU: 2. UPDATE user SET status='inactive'<br/>WHERE id=123 (t=10:00:00.050)
    
    Master_US->>Master_US: 3. Apply locally
    Master_EU->>Master_EU: 4. Apply locally
    
    Master_US-->>Master_EU: 5. Replicate: status='active'<br/>(async, arrives at t=10:00:00.150)
    Master_EU-->>Master_US: 6. Replicate: status='inactive'<br/>(async, arrives at t=10:00:00.200)
    
    Note over Master_US,Master_EU: CONFLICT DETECTED!<br/>Same record, different values
    
    Master_US->>Master_US: 7. Resolve: Last-Write-Wins<br/>timestamp comparison<br/>inactive wins (t=10:00:00.050 > t=10:00:00.000)
    Master_EU->>Master_EU: 8. Resolve: Last-Write-Wins<br/>timestamp comparison<br/>inactive wins
    
    Note over Master_US,Master_EU: Eventually consistent:<br/>Both masters converge to status='inactive'

Master-master replication allows multiple nodes to accept writes simultaneously, but conflicts arise when different masters modify the same data. This sequence shows how two masters receive conflicting updates and use last-write-wins (timestamp-based) conflict resolution to eventually converge to the same state.

Key Principles

principle: Replication Lag Determines Consistency Guarantees explanation: Replication lag is the time delay between when a write commits on the master and when it becomes visible on replicas. This lag directly impacts what users see. In asynchronous replication, a user might write data to the master, then immediately read from a replica that hasn’t received the update yet, seeing stale data. This is called a ‘read-your-writes’ consistency violation. The lag depends on network latency, replica processing speed, and replication backlog size. Under normal conditions, lag might be milliseconds, but during network issues or heavy write load, it can grow to seconds or minutes. example: Instagram uses asynchronous master-slave replication for user posts. When you publish a photo, it writes to the master database. If you immediately refresh your profile by hitting a replica that’s 100ms behind, you might not see your own post yet. Instagram handles this by routing reads for your own profile to the master for a few seconds after writes, accepting the higher latency for consistency where it matters most to user experience.

principle: Write Patterns Dictate Replication Topology explanation: The ratio of reads to writes and the geographic distribution of writes determine which replication pattern makes sense. Systems with 95%+ read traffic benefit enormously from master-slave replication—you can add read replicas to scale read capacity linearly. Systems with significant write traffic in multiple regions need master-master replication to avoid forcing all writes through a single geographic location, which would add unacceptable latency for distant users. However, master-master replication only makes sense when the application can tolerate or resolve conflicts. example: Netflix’s viewing history service is 99% reads (checking what you’ve watched) and 1% writes (recording new views). They use master-slave replication with dozens of read replicas distributed globally. Writes go to a single master in one region, but reads happen from the nearest replica. In contrast, Google Docs uses master-master replication because users in different regions need to edit simultaneously, and the operational transform algorithm can merge conflicting edits intelligently.

principle: Replication Provides Availability, Not Consistency explanation: A common misconception is that replication automatically provides both availability and consistency. In reality, replication provides availability—the system keeps running when nodes fail. Consistency depends on your replication strategy and how you handle reads. With asynchronous replication, you get high availability but eventual consistency. With synchronous replication, you get stronger consistency but reduced availability (writes fail if replicas are unreachable). This is the CAP theorem in action: during a network partition, you must choose between consistency (reject writes until the partition heals) or availability (accept writes that might conflict). example: Amazon’s DynamoDB uses asynchronous multi-master replication across availability zones. During a network partition between zones, both sides continue accepting writes (choosing availability). When the partition heals, DynamoDB uses vector clocks and last-write-wins to resolve conflicts. This means you might briefly see different values depending on which zone you read from, but the system never stops accepting writes. Banking systems make the opposite choice—they use synchronous replication and reject transactions during partitions rather than risk conflicting account balances.

Deep Dive

Types / Variants

Beyond the basic master-slave and master-master patterns, several specialized replication variants exist for specific use cases. Chain replication arranges replicas in a linear chain where each node replicates to the next. Writes go to the head of the chain, propagate through each node sequentially, and reads come from the tail. This provides strong consistency for reads (the tail has all committed writes) while distributing replication load across nodes. However, write latency increases linearly with chain length. LinkedIn uses chain replication for its distributed storage system Espresso.

Quorum-based replication (detailed in quorum) allows tunable consistency by requiring W nodes to acknowledge writes and R nodes to respond to reads, where W + R > N (total replicas). This enables fine-grained trade-offs between consistency and latency. Cassandra and DynamoDB both implement quorum replication, allowing applications to choose consistency levels per operation.

Statement-based replication logs SQL statements (INSERT, UPDATE, DELETE) and replays them on replicas, while row-based replication logs the actual data changes. Statement-based replication produces smaller logs but can cause inconsistencies if statements use non-deterministic functions like NOW() or RAND(). Row-based replication is deterministic but generates more network traffic. MySQL supports both modes. Logical replication sends high-level change descriptions (“user 123’s email changed to x@y.com”), while physical replication sends low-level disk block changes. Logical replication allows replicas to run different database versions or schemas, enabling zero-downtime migrations.

Chain Replication Architecture

graph LR
    Client_W["Write Client"]
    Client_R["Read Client"]
    Head["Head Node<br/><i>Accepts Writes</i>"]
    Middle1["Middle Node 1<br/><i>Propagates</i>"]
    Middle2["Middle Node 2<br/><i>Propagates</i>"]
    Tail["Tail Node<br/><i>Serves Reads</i>"]
    
    Client_W --"1. Write request"--> Head
    Head --"2. Apply & forward"--> Middle1
    Middle1 --"3. Apply & forward"--> Middle2
    Middle2 --"4. Apply & forward"--> Tail
    Tail --"5. Apply & ACK"--> Middle2
    Middle2 --"6. ACK"--> Middle1
    Middle1 --"7. ACK"--> Head
    Head --"8. Confirm to client"--> Client_W
    
    Client_R --"Read request<br/>(always consistent)"--> Tail
    
    Note1["Write Latency:<br/>Increases with chain length<br/>(sequential propagation)"]
    Note2["Read Consistency:<br/>Tail has all committed writes<br/>(strong consistency guarantee)"]
    
    Note1 -.-> Head
    Note2 -.-> Tail

Chain replication arranges nodes in a linear sequence where writes flow from head to tail, and reads come exclusively from the tail. This provides strong consistency for reads (tail has all committed data) while distributing replication work across nodes, but write latency grows with chain length.

Trade-offs

Consistency Vs Latency

Synchronous replication provides strong consistency—replicas always have the latest data—but increases write latency by 2-10x depending on network distance to replicas. Asynchronous replication provides low write latency (just the master’s commit time) but creates eventual consistency—replicas lag behind by milliseconds to seconds. For user-facing writes where immediate feedback matters (posting a comment, updating a profile), asynchronous replication’s lower latency improves user experience. For financial transactions or inventory updates where stale reads cause real problems, synchronous replication’s consistency guarantees are worth the latency cost. Semi-synchronous replication offers a middle ground: wait for one replica (ensuring durability) but not all (limiting latency impact).

Availability Vs Complexity

Master-slave replication is operationally simple—one source of truth, no conflict resolution needed, straightforward failure modes. However, the master is a single point of failure for writes. Promoting a slave to master during failures requires coordination (covered in fail-over) and typically causes 10-60 seconds of write downtime. Master-master replication provides higher write availability—any master can accept writes—but introduces operational complexity. You need conflict resolution logic, monitoring for replication conflicts, and careful application design to minimize conflicts. The conflict resolution itself can cause data loss if not carefully designed. Choose master-slave unless you have a specific availability requirement that justifies the complexity.

Read Scaling Vs Consistency

Adding read replicas in master-slave replication scales read capacity linearly—10 replicas can handle 10x the read traffic. However, more replicas mean more nodes that might be lagging behind the master, increasing the probability that a random read hits stale data. You can mitigate this by routing reads that need consistency to the master (sacrificing read scaling for those queries) or by implementing read-your-writes consistency (tracking which replica has your recent writes). The trade-off is between maximum read throughput (distribute reads randomly across all replicas) and consistency guarantees (route some reads to the master or specific replicas).

Common Pitfalls

pitfall: Ignoring Replication Lag in Application Logic why_it_happens: Developers assume that data written to the database is immediately visible everywhere, not accounting for replication lag in asynchronous setups. This causes bugs where users don’t see their own writes or where dependent operations read stale data. how_to_avoid: Implement read-your-writes consistency by routing reads to the master for a short time window after writes (e.g., 5 seconds), or by tracking the replication position and only reading from replicas that have caught up to that position. For critical workflows, use synchronous replication or read from the master. Monitor replication lag as a key metric and alert when it exceeds acceptable thresholds (typically 1-5 seconds).

pitfall: Underestimating Master-Master Conflict Rates why_it_happens: Teams assume conflicts are rare because they seem unlikely in testing, but conflict rates scale with write volume and the number of masters. Even a 0.1% conflict rate becomes thousands of conflicts per day at scale. how_to_avoid: Design your data model to minimize conflicts—use append-only logs instead of updates, partition data so different masters own different subsets, or use CRDTs (Conflict-free Replicated Data Types) that merge automatically. Implement comprehensive conflict monitoring and resolution from day one, not as an afterthought. Test conflict resolution logic under realistic load with multiple masters writing simultaneously.

pitfall: Cascading Failures from Slow Replicas why_it_happens: In synchronous replication, a single slow or failing replica can block all writes, causing the entire system to grind to a halt. In asynchronous replication, a slow replica can cause replication lag to grow unboundedly, eventually running out of disk space for replication logs. how_to_avoid: Use semi-synchronous replication with a quorum (wait for N out of M replicas) rather than waiting for all replicas. Implement timeouts for replication acknowledgments and degrade to asynchronous mode if replicas are slow. Monitor replica lag and automatically remove replicas that fall too far behind, requiring manual intervention to re-sync them. Set up alerts for replication lag exceeding thresholds before it becomes critical.

Math & Calculations

Formula

Replication Lag Impact on Stale Read Probability

P(stale_read) = (replication_lag × write_rate) / replica_count

This estimates the probability that a random read from a replica returns stale data.

Variables

Replication Lag

Average time (seconds) between master write and replica visibility

Write Rate

Writes per second to the master

Replica Count

Number of read replicas

Worked Example

Scenario: E-commerce product inventory system with 1,000 writes/second, 100ms average replication lag, 5 read replicas.

P(stale_read) = (0.1s × 1000 writes/s) / 5 replicas = 100 / 5 = 20 in-flight writes per replica

At any moment, each replica is processing ~20 writes that haven’t completed yet. If reads are uniformly distributed, roughly 20 out of every 1,000 reads (2%) might hit data that’s still replicating.

Impact: With 10,000 reads/second, that’s 200 stale reads/second. For inventory checks, this means users might see “in stock” for items that just sold out. The business must decide if 2% stale reads are acceptable or if they need:

Reduce replication lag (faster network, more powerful replicas): 50ms lag → 1% stale reads
Add more replicas (distribute in-flight writes): 10 replicas → 1% stale reads
Route inventory reads to master (sacrifice read scaling for consistency)
Use synchronous replication for inventory writes (sacrifice write latency for consistency)

This calculation helps quantify the consistency-performance trade-off and justify infrastructure investments.

Real-World Examples

company: Twitter system: Timeline Service usage: Twitter uses master-slave replication with asynchronous replication for tweet storage and timelines. Tweets are written to a master database and replicated to dozens of read replicas distributed globally. The replication lag is typically under 100ms but can spike to seconds during high-traffic events. This is acceptable because users don’t expect to see their tweets instantly in their own timeline—the perceived latency of refreshing the app masks the replication lag. However, Twitter routes reads for the tweet author’s own profile to the master for 5 seconds after posting to ensure read-your-writes consistency. This hybrid approach balances read scalability (replicas handle 99% of reads) with user experience (authors see their own tweets immediately).

company: Uber system: Schemaless (Distributed Datastore) usage: Uber’s Schemaless uses master-master replication across multiple data centers for trip data and driver locations. Each data center runs a master that accepts writes for its region, minimizing write latency for users in that geography. Changes replicate asynchronously to other data centers. Conflicts are rare because trip data is naturally partitioned by trip_id, and different regions rarely write to the same trip simultaneously. When conflicts occur (e.g., a driver crosses regional boundaries), Schemaless uses last-write-wins based on timestamps. The system tolerates brief inconsistencies—if a driver’s location is slightly stale in a distant data center, it doesn’t impact operations because drivers only interact with their local region. This design provides low-latency writes globally while accepting eventual consistency across regions.

company: GitHub system: MySQL Database Infrastructure usage: GitHub uses master-slave replication with semi-synchronous replication for critical data like repository metadata and user accounts. Writes must be acknowledged by at least one replica in a different availability zone before confirming to the client, ensuring data survives single-zone failures. Read replicas use asynchronous replication and are distributed across multiple zones. For operations requiring strong consistency (like checking repository permissions before allowing a push), GitHub reads from the master. For operations tolerating eventual consistency (like displaying repository star counts), reads come from replicas. During a master failure, GitHub’s orchestration system promotes a semi-synchronous replica to master within 30 seconds, minimizing write downtime. This architecture balances durability (semi-synchronous replication), read performance (many async replicas), and availability (fast failover).

Interview Expectations

Mid-Level

Mid-level candidates should explain the difference between master-slave and master-master replication, describe synchronous vs asynchronous replication, and identify when each makes sense. You should be able to draw a simple replication architecture and explain how writes propagate to replicas. Expect questions like: ‘How would you replicate a user database for a social network?’ Answer by choosing master-slave replication, explaining that writes go to the master and reads come from replicas, and noting that replication lag might cause users to not see their own posts immediately. Mention monitoring replication lag as a key metric.

Senior

Senior candidates must articulate the trade-offs between replication strategies and justify choices based on system requirements. You should calculate replication lag impact, explain conflict resolution strategies for master-master replication, and discuss failure scenarios. Expect questions like: ‘Design replication for a global e-commerce inventory system with 10,000 writes/second.’ Answer by analyzing read/write ratios, calculating acceptable replication lag based on business requirements (stale inventory reads cause overselling), choosing between synchronous replication for consistency vs asynchronous for performance, and explaining how to handle replica failures. Discuss monitoring strategies and operational complexity.

Staff+

Staff+ candidates must demonstrate deep understanding of replication internals, cross-regional replication strategies, and how replication interacts with other system components. You should explain replication log formats (statement-based vs row-based), discuss consensus protocols for master election, and design hybrid replication strategies. Expect questions like: ‘Design a globally distributed database with strong consistency for financial transactions and eventual consistency for analytics.’ Answer by proposing a multi-tier replication strategy: synchronous replication within a region for transactions, asynchronous cross-region replication for disaster recovery, and separate read replicas with longer lag for analytics. Discuss how to handle network partitions, implement read-your-writes consistency, and migrate between replication strategies with zero downtime. Reference specific technologies (MySQL Group Replication, PostgreSQL logical replication, Kafka for change data capture) and explain when to build custom replication vs use existing solutions.

Common Interview Questions

How do you handle replication lag in a system where users need to see their own writes immediately?

What happens if the master fails in master-slave replication? How do you promote a slave?

How do you resolve conflicts in master-master replication?

When would you choose synchronous over asynchronous replication?

How does replication interact with caching? Can you cache data from replicas?

How do you replicate data across geographic regions with high network latency?

What metrics would you monitor to ensure replication is healthy?

Red Flags to Avoid

Claiming replication provides both perfect consistency and availability (violates CAP theorem)

Not mentioning replication lag or assuming replicas are always up-to-date

Choosing master-master replication without explaining conflict resolution strategy

Ignoring failure scenarios (what happens when master or replica fails?)

Not considering network partitions between master and replicas

Proposing synchronous replication without acknowledging latency impact

Confusing replication with sharding or caching

Key Takeaways

Replication creates multiple copies of data for availability and performance, but introduces consistency challenges through replication lag—the delay between master writes and replica visibility.

Master-slave replication (one writer, many readers) is simple and consistent but has a single point of write failure. Master-master replication (multiple writers) provides higher write availability but requires conflict resolution.

Synchronous replication waits for replicas before acknowledging writes (strong consistency, higher latency). Asynchronous replication acknowledges immediately (low latency, eventual consistency). Choose based on whether your application can tolerate stale reads.

Replication lag probability = (lag × write_rate) / replica_count. Use this to quantify the consistency-performance trade-off and justify infrastructure decisions.

Design for replication lag from day one: implement read-your-writes consistency, monitor lag as a key metric, and have a plan for handling slow or failed replicas before they cause cascading failures.

Prerequisites

CAP Theorem - Understanding the fundamental trade-offs between consistency, availability, and partition tolerance

Database Fundamentals - Basic database concepts before diving into replication

Fail-over - How to promote replicas to master when failures occur

Quorum - Advanced replication strategy with tunable consistency

Consistency Patterns - Detailed consistency models and guarantees

Next Steps

Sharding - Combining replication with data partitioning for horizontal scaling

Caching Strategies - How caching interacts with replicated data