Load Balancing Algorithms: Round Robin to Least Conn

After this topic, you will be able to:

Compare trade-offs of round-robin, least connections, weighted algorithms, and consistent hashing
Evaluate which algorithm to choose based on workload characteristics (stateful vs stateless, long vs short connections)
Analyze how session affinity (sticky sessions) impacts algorithm choice and system scalability
Apply algorithm selection criteria to real-world scenarios

TL;DR

Load balancing algorithms determine how traffic is distributed across backend servers. Static algorithms like round-robin and IP hash use predetermined rules, while dynamic algorithms like least connections and least response time adapt to real-time server state. The choice depends on workload characteristics: stateless services work well with simple round-robin, while stateful applications may need session affinity or consistent hashing to maintain user context.

Cheat Sheet: Round-robin for stateless apps with uniform requests | Least connections for long-lived connections (WebSockets, streaming) | Consistent hashing for distributed caches | Weighted algorithms when servers have different capacities | IP hash or cookies for session affinity when state can’t be externalized.

Background

In the early days of the web, a single server handled all requests until it became overloaded. The first load balancers in the late 1990s used simple round-robin: distribute requests sequentially across servers. This worked for stateless HTTP requests but failed when applications maintained session state or when servers had different capacities. As systems scaled, engineers at companies like Yahoo and Google developed sophisticated algorithms that considered server load, connection count, and response times.

The fundamental problem is resource allocation under uncertainty. You don’t know how long a request will take or how much CPU it will consume when it arrives. Static algorithms make no attempt to measure actual load—they follow predetermined rules. Dynamic algorithms monitor server metrics and adapt distribution accordingly. The trade-off is complexity versus responsiveness: static algorithms are simple and predictable but can create imbalances, while dynamic algorithms distribute load more evenly but require health monitoring infrastructure and introduce decision-making latency.

Modern distributed systems added another dimension: data locality. When Netflix built its CDN, they needed requests for the same video to hit the same cache servers to maximize hit rates. This drove adoption of consistent hashing, which maps requests to servers based on content identifiers rather than just distributing load evenly. Today’s load balancers often combine multiple algorithms—using consistent hashing for cache affinity while falling back to least connections when the primary server is unhealthy.

Architecture

Load balancing algorithms operate within the load balancer’s request routing engine, which sits between clients and backend servers. The architecture has three key components: the algorithm selector (chooses which algorithm to apply based on configuration), the server pool manager (maintains the list of healthy backends with their current state), and the decision engine (executes the algorithm to select a target server).

Static algorithms maintain minimal state—just the server list and perhaps a counter for round-robin’s current position. The decision engine simply applies the algorithm’s rule: increment the counter, hash the client IP, or pick randomly. Dynamic algorithms require a metrics collector that continuously gathers data from backends: active connection counts, response times, CPU utilization, or custom health scores. This data flows into the decision engine, which evaluates each server’s suitability before selecting one.

Session affinity adds a persistence layer that stores client-to-server mappings. Cookie-based affinity embeds a server identifier in the HTTP response, which the client returns on subsequent requests. IP-based affinity hashes the client’s IP address to consistently select the same server. The persistence layer must handle server failures gracefully—when a backend dies, its sticky sessions must be redistributed, potentially causing session loss for affected users.

The algorithm choice is typically configured per virtual server or listener. A single load balancer might use round-robin for stateless API traffic on port 80, least connections for WebSocket traffic on port 8080, and consistent hashing for cache requests on port 6379. This flexibility allows operators to optimize each workload independently.

Internals

Round-Robin maintains a circular list of servers and an index pointer. Each request increments the pointer modulo the server count. Implementation is trivial—O(1) time and space—but naive round-robin ignores server capacity differences and current load. If Server A has 32 cores and Server B has 4 cores, they still receive equal traffic. Weighted round-robin solves this by assigning each server a weight (typically proportional to capacity) and selecting servers proportionally. The algorithm maintains a current weight counter that decrements with each selection; when it reaches zero, it moves to the next server and resets the counter to that server’s weight.

Least Connections tracks active connection counts per server and selects the one with the fewest. The load balancer increments a counter when opening a connection and decrements it when closing. For HTTP/1.1 keep-alive connections, “active” means the TCP connection is open, even if no request is currently in flight. This works well for long-lived connections like database queries or streaming, where connection count correlates with load. The algorithm requires O(n) time to scan all servers unless you maintain a min-heap, which reduces selection to O(log n) but adds complexity. Weighted least connections divides active connections by server weight before comparing.

Least Response Time extends least connections by measuring how long each server takes to respond. The load balancer timestamps each request and calculates a moving average of response times. It then selects the server with the lowest (response_time × active_connections) product. This adapts to servers that are slow due to garbage collection pauses, disk I/O, or downstream dependencies. The challenge is avoiding oscillation: if everyone routes to the fastest server, it becomes slow, causing traffic to shift elsewhere. Implementations typically use exponential moving averages with decay factors around 0.9 to smooth out transient spikes.

IP Hash computes a hash of the client’s IP address and maps it to a server using modulo arithmetic: server_index = hash(client_ip) % server_count. This provides natural session affinity—the same client always hits the same server—without cookies or state. The fatal flaw is instability: adding or removing a server changes the modulo divisor, remapping most clients to different servers and invalidating their sessions. If you have 5 servers and add a 6th, 5/6 of clients get remapped.

Consistent Hashing solves IP hash’s instability problem using a hash ring. Each server is hashed to multiple points on a 0-2^32 ring (typically 100-500 virtual nodes per physical server). Client requests are hashed to a point on the ring, then routed to the next server clockwise. When a server is added or removed, only 1/n of keys are remapped (where n is the server count), and they move to the adjacent server. This is critical for distributed caches like Memcached or Redis, where cache invalidation is expensive. The virtual node technique ensures even distribution—without it, servers might cluster on one part of the ring, creating imbalance.

Random Selection simply picks a server uniformly at random. This sounds naive but has theoretical appeal: with enough requests, the law of large numbers ensures even distribution. It requires no state and is trivially parallelizable—multiple load balancer instances can make independent decisions without coordination. Power of Two Choices improves on pure random by selecting two servers at random and choosing the one with fewer connections. Research shows this achieves near-optimal load distribution with minimal overhead.

Round-Robin vs Weighted Round-Robin Distribution

graph LR
    subgraph Round-Robin
        RR_LB["Load Balancer<br/><i>Counter: 0→1→2→0</i>"]
        RR_S1["Server A<br/><i>32 cores</i>"]
        RR_S2["Server B<br/><i>32 cores</i>"]
        RR_S3["Server C<br/><i>4 cores</i>"]
        RR_LB --"Request 1"--> RR_S1
        RR_LB --"Request 2"--> RR_S2
        RR_LB --"Request 3"--> RR_S3
        RR_LB --"Request 4"--> RR_S1
    end
    
    subgraph Weighted Round-Robin
        WRR_LB["Load Balancer<br/><i>Weights: A=4, B=4, C=1</i>"]
        WRR_S1["Server A<br/><i>32 cores, weight=4</i>"]
        WRR_S2["Server B<br/><i>32 cores, weight=4</i>"]
        WRR_S3["Server C<br/><i>4 cores, weight=1</i>"]
        WRR_LB --"Requests 1-4"--> WRR_S1
        WRR_LB --"Requests 5-8"--> WRR_S2
        WRR_LB --"Request 9"--> WRR_S3
    end

Round-robin distributes requests equally regardless of server capacity, while weighted round-robin assigns requests proportionally to server weights. With equal weights, Server C (4 cores) receives the same traffic as Server A (32 cores), causing overload. Weighted distribution ensures Server C receives only 1/9 of traffic, matching its capacity.

Least Connections Algorithm Decision Flow

sequenceDiagram
    participant Client
    participant LB as Load Balancer<br/>Least Connections
    participant S1 as Server A<br/>5 active connections
    participant S2 as Server B<br/>12 active connections
    participant S3 as Server C<br/>3 active connections
    
    Client->>LB: 1. New request arrives
    Note over LB: 2. Query connection counts
    LB->>S1: Get active connections
    S1-->>LB: 5 connections
    LB->>S2: Get active connections
    S2-->>LB: 12 connections
    LB->>S3: Get active connections
    S3-->>LB: 3 connections
    Note over LB: 3. Select minimum (Server C)
    LB->>S3: 4. Route request
    Note over S3: Connection count: 3 → 4
    S3-->>Client: 5. Response
    Note over S3: Connection count: 4 → 3

Least connections algorithm queries all backend servers for their active connection counts, then routes the request to the server with the fewest connections. The load balancer increments the counter when opening a connection and decrements it when closing, ensuring servers with more capacity (fewer active connections) receive more traffic.

Consistent Hashing Ring with Virtual Nodes

graph TB
    subgraph Hash Ring: 0 to 2^32
        Ring["<b>Consistent Hash Ring</b>"]
        
        S1_V1["S1-v1<br/><i>hash: 100</i>"]
        S1_V2["S1-v2<br/><i>hash: 800</i>"]
        S1_V3["S1-v3<br/><i>hash: 2400</i>"]
        
        S2_V1["S2-v1<br/><i>hash: 300</i>"]
        S2_V2["S2-v2<br/><i>hash: 1200</i>"]
        S2_V3["S2-v3<br/><i>hash: 3000</i>"]
        
        S3_V1["S3-v1<br/><i>hash: 600</i>"]
        S3_V2["S3-v2<br/><i>hash: 1800</i>"]
        S3_V3["S3-v3<br/><i>hash: 3500</i>"]
        
        K1["Key: user_123<br/><i>hash: 250</i>"]
        K2["Key: video_456<br/><i>hash: 1500</i>"]
        K3["Key: cache_789<br/><i>hash: 2800</i>"]
    end
    
    K1 -."Next clockwise: S2-v1".-> S2_V1
    K2 -."Next clockwise: S3-v2".-> S3_V2
    K3 -."Next clockwise: S2-v3".-> S2_V3
    
    S2_V1 --> Server2["Physical Server 2"]
    S3_V2 --> Server3["Physical Server 3"]
    S2_V3 --> Server2
    
    Note1["<b>Adding Server 4:</b><br/>Only keys between new<br/>virtual nodes remap<br/>(~1/4 of total keys)"]
    Note2["<b>Removing Server 1:</b><br/>Its keys move to next<br/>clockwise servers<br/>(S2 and S3)"]

Consistent hashing maps both servers and keys to positions on a hash ring (0 to 2^32). Each physical server has multiple virtual nodes distributed around the ring. Keys are routed to the next server clockwise. When a server is added or removed, only keys between affected virtual nodes are remapped (~1/n of total), unlike IP hash where adding a server remaps 5/6 of keys.

Session Affinity & Sticky Sessions

Session affinity (sticky sessions) routes all requests from a client to the same backend server, preserving application state stored in server memory. This is necessary when applications store user sessions, shopping carts, or authentication tokens locally rather than in a shared data store. Without affinity, each request might hit a different server, forcing the user to re-authenticate or losing their cart contents.

Cookie-based affinity is the most common implementation. When the load balancer routes a request to a server, it injects a cookie containing that server’s identifier (often encrypted to prevent tampering). Subsequent requests include this cookie, and the load balancer extracts the server ID to route accordingly. If the target server is unhealthy, the load balancer selects a new server and updates the cookie. The user’s session is lost, but the application remains available. This approach works at Layer 7 and requires the load balancer to inspect and modify HTTP headers.

IP-based affinity hashes the client’s source IP to consistently select the same server, similar to IP hash algorithm. This works at Layer 4 without inspecting application data, making it faster and applicable to non-HTTP protocols. The downside is that clients behind NAT or corporate proxies share an IP address, causing all users from that network to hit the same server. Mobile clients that switch between WiFi and cellular networks get different IPs and lose affinity.

Application-controlled affinity uses a custom header or URL parameter that the application sets to indicate which server should handle the request. This gives applications fine-grained control—for example, routing all requests for a specific tenant to dedicated servers. The application must be aware of the load balancing scheme and include the routing hint in every request.

The fundamental trade-off is operational simplicity versus scalability. Sticky sessions let you deploy stateful applications without refactoring, but they create uneven load distribution (some users generate more traffic than others), reduce fault tolerance (server failures lose sessions), and complicate autoscaling (you can’t remove a server until its sessions expire). LinkedIn’s initial architecture used sticky sessions for their Java application servers, but as they scaled, they moved session state to a distributed cache (Couchbase) and eliminated affinity, allowing any server to handle any request. This improved resilience and simplified operations but required significant application changes.

Session affinity also impacts cache efficiency. If a server caches user-specific data in memory, affinity ensures cache hits. Without affinity, each server must cache data for all users, reducing hit rates and increasing memory requirements. Facebook’s web servers use sticky sessions to maximize cache locality for user profile data, accepting the operational complexity in exchange for performance.

Cookie-Based Session Affinity Flow

sequenceDiagram
    participant Client
    participant LB as Load Balancer
    participant S1 as Server A
    participant S2 as Server B
    
    Note over Client,S2: First Request (No Cookie)
    Client->>LB: 1. GET /login<br/>(no session cookie)
    Note over LB: 2. Select server using<br/>round-robin → Server A
    LB->>S1: 3. Forward request
    Note over S1: 4. Create session<br/>store in memory
    S1-->>LB: 5. Response + session data
    LB-->>Client: 6. Response<br/>Set-Cookie: LB_SERVER=A_encrypted
    
    Note over Client,S2: Subsequent Request (With Cookie)
    Client->>LB: 7. GET /dashboard<br/>Cookie: LB_SERVER=A_encrypted
    Note over LB: 8. Extract server ID from cookie<br/>→ Route to Server A
    LB->>S1: 9. Forward request
    Note over S1: 10. Session found in memory<br/>serve from cache
    S1-->>LB: 11. Response
    LB-->>Client: 12. Response
    
    Note over Client,S2: Server Failure Scenario
    Client->>LB: 13. GET /checkout<br/>Cookie: LB_SERVER=A_encrypted
    Note over LB: 14. Server A unhealthy<br/>Select new server → Server B
    LB->>S2: 15. Forward request
    Note over S2: 16. Session NOT found<br/>user must re-authenticate
    S2-->>LB: 17. Redirect to /login
    LB-->>Client: 18. Response<br/>Set-Cookie: LB_SERVER=B_encrypted

Cookie-based session affinity embeds a server identifier in the HTTP response cookie. The load balancer extracts this cookie on subsequent requests to route to the same server, preserving in-memory session state. When the target server fails, the load balancer selects a new server and updates the cookie, but the user’s session is lost and they must re-authenticate.

Performance Characteristics

Algorithm selection overhead is negligible for static algorithms—round-robin and random selection take nanoseconds. Dynamic algorithms like least connections require scanning the server list (typically 10-100 servers), adding microseconds of latency. Least response time algorithms must maintain moving averages, adding memory overhead (8-16 bytes per server) and occasional computation spikes when averages are recalculated.

Consistent hashing has higher computational cost due to cryptographic hashing (MD5 or SHA-1), taking 1-2 microseconds per request. The hash ring lookup is O(log n) using binary search over virtual nodes. With 10 servers and 150 virtual nodes each (1,500 total points), this means ~10 comparisons per request. Implementations often cache hash values to avoid recomputation.

Session affinity adds latency for cookie operations. Inserting a cookie requires HTTP header manipulation, adding 50-100 microseconds. Extracting and validating cookies on subsequent requests takes 20-50 microseconds. IP-based affinity is faster (just hash the IP), but consistent hashing is fastest when the hash can be computed once and reused.

Load distribution quality varies significantly. Round-robin achieves perfect balance for uniform workloads but can create 2x imbalance when request costs vary. Least connections typically maintains balance within 10-20% even with heterogeneous requests. Power of Two Choices achieves balance within 5-10% with minimal overhead. Consistent hashing can create 20-30% imbalance without virtual nodes, but virtual nodes reduce this to 5-10%.

Failover behavior matters for availability. Static algorithms continue working when servers fail—they simply skip unhealthy backends. Dynamic algorithms may experience brief imbalance as they recalculate metrics after a failure. Consistent hashing gracefully redistributes load to adjacent servers, but session affinity causes session loss for affected users. LinkedIn’s load balancers use least connections with 5-second health check intervals, accepting brief overload on healthy servers during the detection window rather than implementing complex predictive algorithms.

Trade-offs

Round-robin excels at simplicity and predictability. It’s the default choice for stateless services with uniform request costs. The trade-off is rigidity—it can’t adapt to server capacity differences or varying load. Use weighted round-robin when servers have different specs, but recognize that static weights can’t respond to dynamic conditions like garbage collection pauses or noisy neighbors in cloud environments.

Least connections adapts to heterogeneous workloads where some requests take longer than others. It prevents slow servers from receiving more work while they’re struggling. The trade-off is that connection count is a proxy for load, not a direct measurement. A server with 100 idle keep-alive connections may be less loaded than one with 10 active connections processing heavy queries. It also requires state synchronization in multi-load-balancer deployments.

Least response time provides the most accurate load measurement by considering actual server performance. It automatically adapts to degraded servers, slow dependencies, or resource contention. The trade-offs are complexity (requires continuous monitoring), potential oscillation (traffic shifts can cause the problem they’re trying to solve), and vulnerability to transient spikes (one slow request shouldn’t redirect all traffic). Use this when request latency varies significantly and you have sophisticated monitoring infrastructure.

IP hash and consistent hashing provide natural affinity without cookies, making them ideal for caching layers where data locality matters more than perfect load distribution. Consistent hashing’s stability during topology changes is critical for distributed caches—Memcached and Redis clusters rely on it. The trade-off is that you can’t easily rebalance load; if one server is overloaded, you can’t just redirect some of its traffic without breaking affinity.

Session affinity lets you deploy stateful applications without refactoring, but it’s a technical debt that limits scalability. Every server failure loses sessions, autoscaling is complicated, and load distribution is uneven. Modern architectures externalize state to Redis or databases, eliminating the need for affinity. Use sticky sessions only as a temporary measure while refactoring toward stateless design.

Random selection is underrated. It’s stateless, parallelizable, and achieves good distribution with enough traffic. Power of Two Choices adds minimal overhead for significantly better balance. The trade-off is lack of determinism—you can’t predict which server will handle a request, making debugging harder. Use random selection for high-throughput stateless services where simplicity matters more than perfect distribution.

When to Use (and When Not To)

Choose round-robin for stateless HTTP APIs with uniform request costs and servers of equal capacity. This is 80% of web services. Upgrade to weighted round-robin when servers have different specs (common in cloud environments with mixed instance types) or when gradually rolling out new code (route 10% to canary servers, 90% to stable).

Choose least connections for long-lived connections where connection count correlates with load: WebSocket servers, streaming media, database connection pools, or SSH gateways. Avoid it for short HTTP requests where connection setup overhead dominates—round-robin is simpler and equally effective.

Choose least response time when request latency varies significantly (10x or more) and you have monitoring infrastructure to track response times reliably. This is common for services that call multiple downstream dependencies with different SLAs, or for multi-tenant systems where some tenants generate heavier queries. The complexity is only justified when simpler algorithms create visible imbalance.

Choose consistent hashing for distributed caches (Memcached, Redis), CDN origin selection, or any system where data locality matters more than perfect load distribution. The stability during server additions/removals is critical—cache hit rates drop dramatically if keys are remapped. Also use it for sharded databases where you need to route queries to the shard containing the relevant data.

Choose IP hash for simple session affinity when you control the client network (internal services, B2B APIs) and don’t need to handle NAT or mobile clients. It’s faster than cookie-based affinity and works at Layer 4. Avoid it for public-facing consumer services where NAT is common.

Choose session affinity (cookie-based) only when you cannot externalize state and need to maintain user sessions across requests. This is common for legacy applications or when migrating from monoliths to microservices. Plan to eliminate it—see Horizontal Scaling for stateless design patterns that make affinity unnecessary.

Choose random selection or Power of Two Choices for extremely high-throughput services (millions of requests per second) where algorithm overhead matters, or for distributed systems where multiple load balancers must make independent decisions without coordination. The simplicity and parallelizability outweigh the slightly worse distribution compared to least connections.

Load Balancing Algorithm Selection Decision Tree

flowchart TB
    Start["New Load Balancing<br/>Requirement"]
    Start --> Q1{"Is data locality<br/>critical?<br/><i>(caching, sharding)</i>"}
    
    Q1 -->|Yes| Q2{"Can tolerate<br/>remapping on<br/>topology changes?"}
    Q2 -->|No| CH["<b>Consistent Hashing</b><br/>✓ Stable key mapping<br/>✓ Minimal remapping<br/>Use: Memcached, Redis, CDN"]
    Q2 -->|Yes| IH["<b>IP Hash</b><br/>✓ Simple affinity<br/>✗ Breaks on server changes<br/>Use: Internal services"]
    
    Q1 -->|No| Q3{"Are connections<br/>long-lived?<br/><i>(WebSocket, streaming)</i>"}
    
    Q3 -->|Yes| Q4{"Do servers have<br/>different capacities?"}
    Q4 -->|Yes| WLC["<b>Weighted Least Connections</b><br/>✓ Adapts to capacity<br/>✓ Prevents overload<br/>Use: Mixed instance types"]
    Q4 -->|No| LC["<b>Least Connections</b><br/>✓ Balances long connections<br/>✓ Simple implementation<br/>Use: WebSocket, DB pools"]
    
    Q3 -->|No| Q5{"Does request latency<br/>vary significantly?<br/><i>(10x+ variance)</i>"}
    
    Q5 -->|Yes| Q6{"Have monitoring<br/>infrastructure?"}
    Q6 -->|Yes| LRT["<b>Least Response Time</b><br/>✓ Most accurate load measure<br/>✗ Complex, can oscillate<br/>Use: Multi-tenant, varied queries"]
    Q6 -->|No| P2C["<b>Power of Two Choices</b><br/>✓ Near-optimal balance<br/>✓ Minimal overhead<br/>Use: High-throughput stateless"]
    
    Q5 -->|No| Q7{"Need session<br/>affinity?"}
    Q7 -->|Yes| Cookie["<b>Cookie-Based Affinity</b><br/>+ Round-Robin<br/>✗ Technical debt<br/>Use: Legacy stateful apps"]
    Q7 -->|No| Q8{"Servers have<br/>different capacities?"}
    Q8 -->|Yes| WRR["<b>Weighted Round-Robin</b><br/>✓ Simple + capacity-aware<br/>✓ Predictable<br/>Use: Mixed instance types"]
    Q8 -->|No| RR["<b>Round-Robin</b><br/>✓ Simplest algorithm<br/>✓ Predictable distribution<br/>Use: Stateless uniform APIs"]

Decision tree for selecting the appropriate load balancing algorithm based on workload characteristics. Start with data locality requirements (consistent hashing for caches), then consider connection duration (least connections for long-lived), request variance (least response time for heterogeneous), and finally server capacity differences (weighted algorithms). Round-robin is the default for stateless uniform workloads.

Real-World Examples

company: Facebook system: Web Server Load Balancing implementation: Facebook uses a combination of consistent hashing and sticky sessions for their web tier. User requests are hashed based on user ID to route to specific web servers, maximizing cache hit rates for user profile data stored in server memory. They use weighted round-robin at the edge to distribute traffic across data centers based on capacity and latency. When a server fails, affected users are remapped to adjacent servers in the consistent hash ring, losing their cached data but maintaining availability. This hybrid approach balances cache efficiency (consistent hashing) with operational flexibility (weighted distribution across DCs). interesting_detail: Facebook’s load balancers track per-server cache hit rates and adjust weights dynamically. Servers with higher hit rates receive slightly more traffic because they can serve requests faster from cache, creating a positive feedback loop that improves overall performance.

company: LinkedIn system: API Gateway Load Balancing implementation: LinkedIn’s API gateway uses least connections with health-aware routing. Each backend service registers with a service discovery system (Zookeeper), advertising its capacity and current load. The gateway tracks active connections per instance and routes new requests to the instance with the lowest (connections / capacity) ratio. They explicitly avoid session affinity, instead storing all session state in Couchbase (distributed cache) and Kafka (event log), allowing any server to handle any request. This stateless design enables aggressive autoscaling—they can add or remove servers in seconds without worrying about session loss. interesting_detail: During their migration from monolith to microservices, LinkedIn temporarily used cookie-based affinity to route requests to either the legacy monolith or new microservices based on feature flags. As they completed the migration, they removed affinity and deleted the monolith, simplifying operations significantly.

company: Netflix system: CDN Origin Selection implementation: Netflix’s CDN uses consistent hashing to route requests for video chunks to origin servers. Each video file is split into chunks, and each chunk ID is hashed to a position on the hash ring. This ensures that requests for the same chunk always hit the same origin server, maximizing cache hit rates. They use virtual nodes (150 per physical server) to ensure even distribution even when servers have different capacities. When an origin server fails, only 1/n of chunks are remapped (where n is the server count), and those chunks are served from the next server on the ring, which likely already has them cached from previous requests. interesting_detail: Netflix’s consistent hashing implementation includes a “fallback chain” where each chunk maps to a primary, secondary, and tertiary server. If the primary is unhealthy, the load balancer tries the secondary, which often already has the chunk cached because it’s the next server clockwise on the ring. This dramatically reduces cache misses during failures.

Interview Essentials

Mid-Level

Explain how round-robin works and why it might create imbalance with heterogeneous requests. Walk through an example with 3 servers where requests take 100ms, 500ms, and 1000ms.

Describe least connections algorithm. Why is it better than round-robin for long-lived connections like WebSockets?

What is session affinity? Explain cookie-based vs IP-based implementations and their trade-offs.

You have 5 servers and use IP hash for load balancing. What happens when you add a 6th server? Why is this a problem?

Senior

Compare least connections vs least response time. When would you choose each? What are the failure modes of least response time (hint: oscillation)?

Explain consistent hashing in detail. Why does it use virtual nodes? Walk through what happens when a server is added or removed.

Design a load balancing strategy for a video streaming service. Consider: long-lived connections, cache locality, and server failures. Which algorithms would you combine and why?

Your application uses sticky sessions, but you need to implement autoscaling. What are the challenges? How would you refactor to eliminate session affinity?

Explain Power of Two Choices algorithm. Why does it achieve near-optimal load distribution with minimal overhead? What’s the mathematical intuition?

Staff+

You’re designing a global load balancing system for a multi-region deployment. How do you combine geographic routing, capacity-based weighting, and consistent hashing? What are the trade-offs between optimizing for latency vs cache hit rate?

Analyze the trade-offs between client-side load balancing (each client chooses a server) vs centralized load balancing. When would you use each? How does this relate to service mesh architectures?

Your load balancer uses least response time, but you’re seeing oscillation where traffic shifts between servers every few seconds. Diagnose the root cause and propose solutions. Consider: measurement windows, hysteresis, and adaptive algorithms.

Design a load balancing algorithm for a multi-tenant system where some tenants generate 1000x more traffic than others. How do you prevent large tenants from overwhelming servers while maintaining fairness? Consider: tenant-aware routing, quota enforcement, and isolation guarantees.

Evaluate the claim that ‘session affinity is an anti-pattern.’ Under what conditions is this true? Are there scenarios where affinity is the right choice even in modern architectures?

Common Interview Questions

What’s the difference between round-robin and random selection? When would you choose each?

Why does consistent hashing use a hash ring instead of simple modulo arithmetic?

How do you handle session affinity when a server fails?

What metrics would you monitor to detect that your load balancing algorithm isn’t working well?

Red Flags to Avoid

Claiming that one algorithm is always best (context matters: workload, state, failure modes)

Not understanding the difference between static and dynamic algorithms

Thinking session affinity is required for all stateful applications (state can be externalized)

Not knowing that IP hash breaks when servers are added/removed (consistent hashing solves this)

Suggesting least response time without acknowledging oscillation risks

Not considering the impact of Layer 4 vs Layer 7 on algorithm choice (see Layer 4 Load Balancing and Layer 7 Load Balancing)

Key Takeaways

Algorithm choice depends on workload characteristics: Round-robin for stateless uniform requests, least connections for long-lived connections, consistent hashing for cache locality, and weighted algorithms when servers have different capacities.

Session affinity is a trade-off, not a requirement: Sticky sessions simplify stateful applications but limit scalability, complicate autoscaling, and reduce fault tolerance. Modern architectures externalize state to eliminate affinity (see Horizontal Scaling).

Consistent hashing solves the remapping problem: Unlike IP hash, consistent hashing minimizes key redistribution when servers are added or removed, making it essential for distributed caches where cache invalidation is expensive.

Dynamic algorithms adapt to reality but add complexity: Least response time provides the most accurate load measurement but requires monitoring infrastructure and can oscillate if not tuned carefully. Use simpler algorithms unless you have a proven need.

Layer matters for algorithm implementation: Some algorithms (IP hash, least connections) work at Layer 4, while others (cookie-based affinity, content-based routing) require Layer 7. See Layer 4 Load Balancing and Layer 7 Load Balancing for how this affects your architecture.