Reliability Patterns

Resiliency, high availability, availability, and security patterns.

32 topics 5 sections

Resiliency Patterns

7 topics

Resiliency Patterns in Distributed Systems

intermediate

Resiliency is a system's ability to detect, absorb, and recover from failures while maintaining acceptable service levels. Unlike high availability (which focus

13 min read

Bulkhead Pattern: Isolate Failures in Microservices

intermediate

The bulkhead pattern isolates services into separate resource pools so one failure can't cascade. Learn thread pool isolation, semaphore bulkheads, and Netflix Hystrix examples.

14 min read

Circuit Breaker Pattern: Stop Cascading Failures

intermediate

Circuit breakers prevent cascading failures by fast-failing requests to unhealthy dependencies instead of waiting for timeouts. When a service detects too many

12 min read

Compensating Transaction Pattern: Undo Distributed Work

intermediate

The Compensating Transaction pattern enables distributed rollback in microservices by executing reverse operations when multi-step workflows fail. Instead of lo

11 min read

Health Endpoint Monitoring: /health API Guide

intermediate

Health endpoint monitoring exposes standardized HTTP endpoints that external systems query to verify service health, enabling automated detection of failures an

15 min read

Retry Pattern: Exponential Backoff & Jitter in Practice

intermediate

The retry pattern handles transient failures by re-attempting operations with exponential backoff and jitter. Learn when to retry, when to fail fast, and how to avoid retry storms.

9 min read

Scheduler Agent Supervisor

intermediate

The Scheduler Agent Supervisor pattern coordinates distributed workflows as a single logical operation by separating concerns into three components: a Scheduler

11 min read

Additional Topics

9 topics

Distributed Locking: Redis, ZooKeeper & Redlock

advanced

Distributed locks prevent race conditions across multiple servers. Compare Redis Redlock, ZooKeeper, and database-based locking with tradeoffs for each approach.

15 min read

Distributed Consensus: Raft & Paxos Explained

advanced

Distributed consensus ensures multiple nodes agree on a single value or sequence of operations despite failures and network partitions. Algorithms like Raft and

17 min read

Distributed Transactions: 2PC & Saga Patterns

advanced

Two-Phase Commit (2PC) is a distributed algorithm that ensures all participants in a transaction either commit or abort together, maintaining ACID properties ac

25 min read

Gossip Protocol: Peer-to-Peer State Propagation

advanced

Gossip protocol is an epidemic-style communication pattern where nodes periodically exchange state with random peers, achieving eventual consistency across larg

9 min read

Heartbeat Mechanism: Node Health Detection

intermediate

Heartbeat mechanisms detect node failures in distributed systems by sending periodic alive signals between nodes. If a node misses several consecutive heartbeats, it is considered failed.

10 min read

Hinted Handoff: Handle Node Failures in Cassandra

advanced

Hinted handoff is a technique in distributed systems where a temporarily unavailable node's writes are stored on a healthy neighbor node with a hint about the intended destination.

29 min read

Leader Election (Resiliency)

intermediate

Leader election is a coordination pattern that designates one node in a distributed system as the authoritative decision-maker, preventing conflicts when multip

13 min read

Queue-Based Load Leveling (Resiliency)

intermediate

Queue-based load leveling inserts a message queue between producers and consumers to absorb traffic spikes, preventing downstream service overload and timeout c

9 min read

Split-Brain & Fencing: Prevent Distributed Conflicts

advanced

Split-brain occurs when network partitions cause multiple nodes to believe they're the leader, leading to conflicting writes and data corruption. Fencing mechan

17 min read