Resiliency, high availability, availability, and security patterns.
Resiliency is a system's ability to detect, absorb, and recover from failures while maintaining acceptable service levels. Unlike high availability (which focus
The bulkhead pattern isolates services into separate resource pools so one failure can't cascade. Learn thread pool isolation, semaphore bulkheads, and Netflix Hystrix examples.
Circuit breakers prevent cascading failures by fast-failing requests to unhealthy dependencies instead of waiting for timeouts. When a service detects too many
The Compensating Transaction pattern enables distributed rollback in microservices by executing reverse operations when multi-step workflows fail. Instead of lo
Health endpoint monitoring exposes standardized HTTP endpoints that external systems query to verify service health, enabling automated detection of failures an
The retry pattern handles transient failures by re-attempting operations with exponential backoff and jitter. Learn when to retry, when to fail fast, and how to avoid retry storms.
The Scheduler Agent Supervisor pattern coordinates distributed workflows as a single logical operation by separating concerns into three components: a Scheduler
High availability (HA) is the practice of designing systems to remain operational and accessible even when components fail. It's measured in "nines" (99.9%, 99.
Deployment stamps (also called scale units or cells) are independent, self-contained copies of your entire application stack that serve a subset of users or ten
Geodes deploy backend services into multiple geographical nodes that can each serve any request from any client in any region, enabling active-active multi-regi
The Bulkhead pattern isolates system components into separate resource pools to prevent cascading failures. Named after ship compartments that contain flooding,
Circuit breakers prevent cascading failures in distributed systems by automatically stopping requests to failing services, giving them time to recover. When err
Health endpoint monitoring exposes HTTP endpoints that external systems can poll to verify application health. Unlike passive monitoring that waits for failures
Availability measures the percentage of time a system is operational and accessible to users, typically expressed in "nines" (99.9%, 99.99%, etc.). It's the fou
Deployment stamps (also called scale units or cells) are self-contained, identical copies of your entire application stack deployed independently to serve a sub
Geodes is an active-active multi-region deployment pattern where every geographic node can serve any request for any user, regardless of location. Unlike tradit
Health endpoint monitoring exposes dedicated HTTP endpoints that external systems can probe to verify application health. Instead of waiting for failures to man
Queue-based load leveling uses a message queue as a buffer between producers and consumers to absorb traffic spikes and prevent downstream service overload. Ins
Throttling limits operation rates to protect systems under heavy load. Learn throttling vs rate limiting, implementation strategies, and how Azure and AWS apply it.
Security in distributed systems provides confidentiality, integrity, and availability (CIA triad) against malicious attacks. It's not a feature you add at the e
Federated identity allows users to authenticate once with a trusted identity provider (IdP) and access multiple services without re-entering credentials. Instea
The Gatekeeper pattern uses a dedicated intermediary host to validate, sanitize, and broker all requests between clients and backend services. This security-foc
The Valet Key pattern grants clients time-limited, scoped direct access to cloud resources (storage, queues) using signed tokens, bypassing your application ser
Distributed locks prevent race conditions across multiple servers. Compare Redis Redlock, ZooKeeper, and database-based locking with tradeoffs for each approach.
Distributed consensus ensures multiple nodes agree on a single value or sequence of operations despite failures and network partitions. Algorithms like Raft and
Two-Phase Commit (2PC) is a distributed algorithm that ensures all participants in a transaction either commit or abort together, maintaining ACID properties ac
Gossip protocol is an epidemic-style communication pattern where nodes periodically exchange state with random peers, achieving eventual consistency across larg
Heartbeat mechanisms detect node failures in distributed systems by sending periodic alive signals between nodes. If a node misses several consecutive heartbeats, it is considered failed.
Hinted handoff is a technique in distributed systems where a temporarily unavailable node's writes are stored on a healthy neighbor node with a hint about the intended destination.
Leader election is a coordination pattern that designates one node in a distributed system as the authoritative decision-maker, preventing conflicts when multip
Queue-based load leveling inserts a message queue between producers and consumers to absorb traffic spikes, preventing downstream service overload and timeout c
Split-brain occurs when network partitions cause multiple nodes to believe they're the leader, leading to conflicting writes and data corruption. Fencing mechan