Reliability Patterns

Resiliency, high availability, availability, and security patterns.

32 topics 5 sections

Resiliency Patterns

7 topics

Resiliency Patterns in Distributed Systems

Resiliency is a system's ability to detect, absorb, and recover from failures while maintaining acceptable service levels. Unlike high availability (which focus

13 min read

Bulkhead Pattern: Isolate Failures in Microservices

intermediate

The bulkhead pattern isolates services into separate resource pools so one failure can't cascade. Learn thread pool isolation, semaphore bulkheads, and Netflix Hystrix examples.

14 min read

Circuit Breaker Pattern: Stop Cascading Failures

intermediate

Circuit breakers prevent cascading failures by fast-failing requests to unhealthy dependencies instead of waiting for timeouts. When a service detects too many

12 min read

Compensating Transaction Pattern: Undo Distributed Work

intermediate

The Compensating Transaction pattern enables distributed rollback in microservices by executing reverse operations when multi-step workflows fail. Instead of lo

11 min read

Health Endpoint Monitoring: /health API Guide

intermediate

Health endpoint monitoring exposes standardized HTTP endpoints that external systems query to verify service health, enabling automated detection of failures an

15 min read

Retry Pattern: Exponential Backoff & Jitter in Practice

intermediate

The retry pattern handles transient failures by re-attempting operations with exponential backoff and jitter. Learn when to retry, when to fail fast, and how to avoid retry storms.

9 min read

Scheduler Agent Supervisor

intermediate

The Scheduler Agent Supervisor pattern coordinates distributed workflows as a single logical operation by separating concerns into three components: a Scheduler

11 min read

High Availability Patterns

6 topics

High Availability: Design for 99.99% Uptime

intermediate

High availability (HA) is the practice of designing systems to remain operational and accessible even when components fail. It's measured in "nines" (99.9%, 99.

29 min read

Deployment Stamps Pattern: Multi-Region Scale-Out

intermediate

Deployment stamps (also called scale units or cells) are independent, self-contained copies of your entire application stack that serve a subset of users or ten

36 min read

Geodes Pattern: Globally Distributed Services

intermediate

Geodes deploy backend services into multiple geographical nodes that can each serve any request from any client in any region, enabling active-active multi-regi

32 min read

Bulkhead for High Availability: Resource Isolation

intermediate

The Bulkhead pattern isolates system components into separate resource pools to prevent cascading failures. Named after ship compartments that contain flooding,

23 min read

Circuit Breaker for High Availability Systems

intermediate

Circuit breakers prevent cascading failures in distributed systems by automatically stopping requests to failing services, giving them time to recover. When err

30 min read

Health Endpoint Monitoring for High Availability

intermediate

Health endpoint monitoring exposes HTTP endpoints that external systems can poll to verify application health. Unlike passive monitoring that waits for failures

25 min read

Availability Patterns

6 topics

Availability Overview

intermediate

Availability measures the percentage of time a system is operational and accessible to users, typically expressed in "nines" (99.9%, 99.99%, etc.). It's the fou

27 min read

Deployment Stamps for Availability: Multi-Region Guide

intermediate

Deployment stamps (also called scale units or cells) are self-contained, identical copies of your entire application stack deployed independently to serve a sub

28 min read

Geodes Pattern for Availability: Global Distribution

intermediate

Geodes is an active-active multi-region deployment pattern where every geographic node can serve any request for any user, regardless of location. Unlike tradit

29 min read

Health Endpoint Monitoring for Availability

intermediate

Health endpoint monitoring exposes dedicated HTTP endpoints that external systems can probe to verify application health. Instead of waiting for failures to man

34 min read

Queue-Based Load Leveling for Availability

intermediate

Queue-based load leveling uses a message queue as a buffer between producers and consumers to absorb traffic spikes and prevent downstream service overload. Ins

25 min read

Throttling Pattern: Protecting Services from Overload

intermediate

Throttling limits operation rates to protect systems under heavy load. Learn throttling vs rate limiting, implementation strategies, and how Azure and AWS apply it.

27 min read

Security Patterns

4 topics

Security Patterns in System Design

intermediate

Security in distributed systems provides confidentiality, integrity, and availability (CIA triad) against malicious attacks. It's not a feature you add at the e

34 min read

Federated Identity Pattern: SSO & OAuth Guide

intermediate

Federated identity allows users to authenticate once with a trusted identity provider (IdP) and access multiple services without re-entering credentials. Instea

30 min read

Gatekeeper Pattern: Protect Services with a Proxy

intermediate

The Gatekeeper pattern uses a dedicated intermediary host to validate, sanitize, and broker all requests between clients and backend services. This security-foc

27 min read

Valet Key Security Pattern: Limited Access Tokens

intermediate

The Valet Key pattern grants clients time-limited, scoped direct access to cloud resources (storage, queues) using signed tokens, bypassing your application ser

36 min read

Additional Topics

9 topics

Distributed Locking: Redis, ZooKeeper & Redlock

advanced

Distributed locks prevent race conditions across multiple servers. Compare Redis Redlock, ZooKeeper, and database-based locking with tradeoffs for each approach.

15 min read

Distributed Consensus: Raft & Paxos Explained

advanced

Distributed consensus ensures multiple nodes agree on a single value or sequence of operations despite failures and network partitions. Algorithms like Raft and

17 min read

Distributed Transactions: 2PC & Saga Patterns

advanced

Two-Phase Commit (2PC) is a distributed algorithm that ensures all participants in a transaction either commit or abort together, maintaining ACID properties ac

25 min read

Gossip Protocol: Peer-to-Peer State Propagation

advanced

Gossip protocol is an epidemic-style communication pattern where nodes periodically exchange state with random peers, achieving eventual consistency across larg

9 min read

Heartbeat Mechanism: Node Health Detection

intermediate

Heartbeat mechanisms detect node failures in distributed systems by sending periodic alive signals between nodes. If a node misses several consecutive heartbeats, it is considered failed.

10 min read

Hinted Handoff: Handle Node Failures in Cassandra

advanced

Hinted handoff is a technique in distributed systems where a temporarily unavailable node's writes are stored on a healthy neighbor node with a hint about the intended destination.

29 min read

Leader Election (Resiliency)

intermediate

Leader election is a coordination pattern that designates one node in a distributed system as the authoritative decision-maker, preventing conflicts when multip

13 min read

Queue-Based Load Leveling (Resiliency)

intermediate

Queue-based load leveling inserts a message queue between producers and consumers to absorb traffic spikes, preventing downstream service overload and timeout c

9 min read

Split-Brain & Fencing: Prevent Distributed Conflicts

advanced

Split-brain occurs when network partitions cause multiple nodes to believe they're the leader, leading to conflicting writes and data corruption. Fencing mechan

17 min read