Event-Driven Architecture: Patterns & Trade-offs

After this topic, you will be able to:

Evaluate event-driven architecture patterns and assess their fit for different system requirements
Design event-driven workflows using pub-sub, event sourcing, and saga patterns
Analyze trade-offs between event-driven and request-driven architectures for specific use cases
Justify the use of message brokers vs direct service communication for event propagation

TL;DR

Event-driven architecture (EDA) is a design pattern where components communicate through asynchronous events rather than direct calls. Producers emit events when state changes occur, and consumers react independently without knowing about each other. This decouples services, enables scalability, and supports complex workflows across distributed systems—but introduces eventual consistency, debugging complexity, and ordering challenges.

Cheat Sheet:

Core Pattern: Producers → Events → Message Broker → Consumers
Key Variants: Pub-Sub, Event Sourcing, CQRS, Saga Pattern
When to Use: Microservices coordination, real-time notifications, audit trails, complex workflows
Trade-off: Loose coupling + scalability vs. debugging difficulty + eventual consistency
Interview Focus: Why events over RPC? How do you handle ordering? Idempotency strategies?

The Problem It Solves

Traditional request-driven architectures create tight coupling between services. When Service A calls Service B synchronously, A must know B’s location, wait for B’s response, and handle B’s failures immediately. This creates cascading failures—if B is down, A fails. It also creates deployment coupling—you can’t change B’s API without coordinating with all callers.

Consider Uber’s ride-matching system. When a driver accepts a ride, you need to: notify the rider, update analytics, send a push notification, log the event for compliance, trigger fraud detection, and update the driver’s earnings. If you implement this with synchronous calls, the acceptance endpoint must orchestrate six different services. If the notification service is slow, ride acceptance becomes slow. If fraud detection is down, rides can’t be accepted at all.

This tight coupling kills scalability. You can’t independently scale the notification service during peak hours. You can’t deploy a new version of analytics without risking ride acceptance. You can’t add a new consumer (like a machine learning pipeline) without modifying the ride acceptance code. The system becomes a distributed monolith where every change requires coordinating multiple teams.

Event-driven architecture solves this by inverting the dependency. Instead of the ride acceptance service knowing about six downstream systems, it simply publishes a “RideAccepted” event. Each interested service subscribes independently. The acceptance endpoint completes in milliseconds, regardless of how many consumers exist or how long they take to process the event.

Solution Overview

Event-driven architecture introduces three core abstractions: events (immutable facts about state changes), producers (services that emit events), and consumers (services that react to events). A message broker sits between them, routing events from producers to all interested consumers without either side knowing about the other.

When something significant happens—a user signs up, an order is placed, a payment succeeds—the service publishes an event describing what happened (past tense: “UserRegistered”, not “RegisterUser”). The event contains enough context for consumers to react: user ID, timestamp, relevant attributes. The producer doesn’t wait for acknowledgment; it continues immediately.

Consumers subscribe to event types they care about. When a “UserRegistered” event arrives, the email service sends a welcome email, the analytics service updates dashboards, the recommendation service initializes preferences, and the fraud service runs checks—all independently, in parallel, without blocking each other. If a new consumer needs to react to registrations, it simply subscribes; no changes to the producer required.

This pattern enables temporal decoupling (producer and consumer don’t need to be online simultaneously) and logical decoupling (they don’t know about each other’s existence). The message broker handles delivery guarantees, retries, and scaling. Services become autonomous units that communicate through a shared event stream rather than direct dependencies.

Event-Driven Architecture Core Pattern

graph LR
    Producer1["Ride Service<br/><i>Producer</i>"]
    Producer2["Payment Service<br/><i>Producer</i>"]
    Broker["Message Broker<br/><i>Kafka/RabbitMQ</i>"]
    Consumer1["Notification Service<br/><i>Consumer</i>"]
    Consumer2["Analytics Service<br/><i>Consumer</i>"]
    Consumer3["Fraud Detection<br/><i>Consumer</i>"]
    Consumer4["Email Service<br/><i>Consumer</i>"]
    
    Producer1 --"1. Publish 'RideAccepted' event"--> Broker
    Producer2 --"2. Publish 'PaymentSucceeded' event"--> Broker
    Broker --"3. Route events to subscribers"--> Consumer1
    Broker --"4. Parallel delivery"--> Consumer2
    Broker --"5. Independent processing"--> Consumer3
    Broker --"6. No blocking"--> Consumer4

Producers emit events to a message broker without knowing which consumers exist. The broker routes events to all subscribers in parallel, enabling temporal and logical decoupling. New consumers can subscribe without modifying producers.

How It Works

Let’s walk through how LinkedIn’s notification system uses event-driven architecture to handle millions of user interactions.

Step 1: Event Production When someone likes your post, LinkedIn’s engagement service doesn’t directly call the notification service. Instead, it publishes a “PostLiked” event to Kafka:

{
  "eventType": "PostLiked",
  "eventId": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2024-01-15T10:30:00Z",
  "postId": "12345",
  "postAuthorId": "alice",
  "likerId": "bob",
  "likerProfile": {"name": "Bob Smith", "headline": "Engineer at Google"}
}

The engagement service writes this to Kafka and immediately returns success to Bob’s client. It doesn’t wait for notifications to be sent.

Step 2: Message Broker Routing Kafka stores the event in the “engagement-events” topic. Multiple consumer groups subscribe to this topic:

Notification service (to alert Alice)
Analytics service (to track engagement metrics)
Recommendation service (to update Bob’s interest graph)
Search indexing service (to boost the post’s ranking)

Each consumer group gets its own copy of every event. Kafka tracks each group’s offset (position in the event stream) independently.

Step 3: Parallel Consumption All four services process the event simultaneously:

Notification service checks Alice’s preferences, determines she wants mobile push notifications, and queues a push job. Processing time: 50ms.
Analytics service increments counters in Redis and updates real-time dashboards. Processing time: 20ms.
Recommendation service updates Bob’s engagement graph in a graph database. Processing time: 200ms.
Search service re-indexes the post with higher engagement score. Processing time: 500ms.

If the search service is slow or crashes, it doesn’t affect the other three. Each service processes at its own pace.

Step 4: Idempotent Processing The notification service receives the event and checks its deduplication cache: “Have I already processed event 550e8400…?” This handles duplicate deliveries (Kafka guarantees at-least-once delivery, not exactly-once). If it’s a duplicate, discard it. If new, process and record the event ID.

Step 5: Failure Handling If the recommendation service crashes while processing, Kafka doesn’t mark the event as consumed. When the service restarts, it resumes from its last committed offset and reprocesses the event. The service’s idempotency logic ensures reprocessing doesn’t create duplicate graph edges.

Step 6: Dead Letter Queue If an event consistently fails processing (malformed data, business logic error), after N retries, the consumer moves it to a dead letter queue for manual investigation. This prevents one bad event from blocking the entire stream.

This flow repeats millions of times per second. New consumers can subscribe without touching the engagement service. Services can be deployed independently. Slow consumers don’t block fast ones.

LinkedIn Post Like Event Flow with Parallel Processing

sequenceDiagram
    participant User as Bob's Client
    participant Engagement as Engagement Service
    participant Kafka as Kafka Broker
    participant Notify as Notification Service
    participant Analytics as Analytics Service
    participant Recommend as Recommendation Service
    participant Search as Search Indexing
    
    User->>Engagement: 1. POST /like {postId: 12345}
    Engagement->>Kafka: 2. Publish PostLiked event
    Note over Engagement,Kafka: Event: {postId: 12345, likerId: bob, timestamp: ...}
    Engagement->>User: 3. 200 OK (immediate response)
    
    par Parallel Consumption
        Kafka->>Notify: 4a. Deliver event (50ms)
        Notify->>Notify: Check dedup cache
        Notify->>Notify: Queue push notification
    and
        Kafka->>Analytics: 4b. Deliver event (20ms)
        Analytics->>Analytics: Increment Redis counters
    and
        Kafka->>Recommend: 4c. Deliver event (200ms)
        Recommend->>Recommend: Update engagement graph
    and
        Kafka->>Search: 4d. Deliver event (500ms)
        Search->>Search: Re-index post with boost
    end
    
    Note over Notify,Search: All consumers process independently<br/>Slow consumers don't block fast ones

When Bob likes Alice’s post, the engagement service publishes an event and returns immediately. Four consumers process the event in parallel at their own pace. The 500ms search indexing doesn’t slow down the 20ms analytics update.

Event Processing with Idempotency and Failure Handling

stateDiagram-v2
    [*] --> EventReceived: Kafka delivers event
    
    EventReceived --> CheckDedup: Check deduplication cache
    
    CheckDedup --> Discard: Event ID exists (duplicate)
    CheckDedup --> Process: Event ID not found (new)
    
    Process --> RecordEventID: Store event ID in cache
    RecordEventID --> BusinessLogic: Execute business logic
    
    BusinessLogic --> Success: Processing succeeds
    BusinessLogic --> TransientFailure: Network error, timeout
    BusinessLogic --> PermanentFailure: Invalid data, business rule violation
    
    Success --> CommitOffset: Mark event as consumed
    CommitOffset --> [*]
    
    TransientFailure --> RetryCount: Check retry attempts
    RetryCount --> Retry: Attempts < 5
    RetryCount --> DeadLetter: Attempts >= 5
    
    Retry --> EventReceived: Exponential backoff
    
    PermanentFailure --> DeadLetter: Move to DLQ immediately
    DeadLetter --> Alert: Trigger monitoring alert
    Alert --> [*]
    
    Discard --> [*]: Skip duplicate

Event consumers must implement idempotency by checking if an event ID has been processed before. Transient failures trigger retries with exponential backoff. After 5 attempts or on permanent failures, events move to a dead letter queue for manual investigation.

Saga Pattern for Distributed Transactions

Distributed transactions are event-driven architecture’s hardest problem. When a business workflow spans multiple services—like booking a trip (reserve flight, reserve hotel, charge payment)—you can’t use a traditional database transaction. The saga pattern solves this with a sequence of local transactions coordinated by events.

Choreography-Based Sagas

In choreography, each service knows what event to publish next, creating a chain reaction. When Uber processes a ride payment:

Payment Service charges the card and publishes “PaymentSucceeded” event
Wallet Service listens for “PaymentSucceeded”, credits the driver, publishes “DriverCredited”
Notification Service listens for “DriverCredited”, sends receipt to rider, publishes “ReceiptSent”
Analytics Service listens for “ReceiptSent”, records completed transaction

Each service is autonomous. There’s no central coordinator. The workflow emerges from services reacting to each other’s events.

Compensation Logic: If the wallet service fails to credit the driver, it publishes “DriverCreditFailed”. The payment service listens for this and publishes “RefundInitiated”, triggering a compensating transaction that reverses the charge.

Pros: No single point of failure, services are loosely coupled, easy to add new steps.

Cons: Workflow logic is scattered across services, hard to understand the full flow, difficult to debug when things go wrong, circular dependencies can emerge.

Orchestration-Based Sagas

In orchestration, a central saga orchestrator explicitly tells each service what to do. Uber’s trip booking orchestrator:

Orchestrator sends “ReserveFlight” command to Flight Service
Flight Service reserves seat, returns success
Orchestrator sends “ReserveHotel” command to Hotel Service
Hotel Service reserves room, returns success
Orchestrator sends “ChargePayment” command to Payment Service
Payment succeeds, orchestrator marks saga complete

If step 5 fails, the orchestrator explicitly sends “CancelHotel” and “CancelFlight” commands to undo previous steps.

Compensation Logic: The orchestrator maintains a compensation table: for each completed step, what’s the undo operation? When a step fails, it walks backward through completed steps, executing compensations in reverse order.

Pros: Workflow is explicit in one place, easier to understand and debug, can implement complex compensation logic, can add timeouts and retries centrally.

Cons: Orchestrator is a single point of failure, services become more coupled to the orchestrator, orchestrator can become a bottleneck.

Choosing Between Them

Use choreography when:

Workflow is simple (3-4 steps)
Services are owned by different teams who want autonomy
You need maximum resilience (no central coordinator to fail)
Example: Social media notifications (post created → notify followers → update feed)

Use orchestration when:

Workflow is complex (5+ steps with conditional logic)
You need visibility into workflow state
Compensation logic is intricate
Example: E-commerce checkout (validate cart → reserve inventory → charge payment → create shipment → send confirmation)

Failure Handling Deep Dive

Sagas must handle three failure modes:

Service Unavailable: Retry with exponential backoff. If the hotel service is down, the orchestrator retries for 5 minutes before compensating.
Business Logic Failure: Immediate compensation. If payment is declined, don’t retry—immediately cancel flight and hotel.
Partial Failure: The payment succeeded but the orchestrator crashed before recording it. On restart, the orchestrator must query the payment service to determine actual state before deciding whether to proceed or compensate.

Twitter’s tweet creation saga handles partial failures by making every step idempotent with unique saga IDs. If the orchestrator crashes and restarts, it can safely retry “AddToTimeline” because the timeline service checks: “Have I already added tweet X for saga Y?”

Saga Pattern: Choreography vs Orchestration

graph TB
    subgraph Choreography["Choreography-Based Saga (Autonomous Services)"]
        direction LR
        C1["Payment Service"]
        C2["Wallet Service"]
        C3["Notification Service"]
        C4["Analytics Service"]
        
        C1 --"1. PaymentSucceeded event"--> C2
        C2 --"2. DriverCredited event"--> C3
        C3 --"3. ReceiptSent event"--> C4
        C2 -."On failure: DriverCreditFailed".-> C1
        C1 -."Compensation: RefundInitiated".-> C1
    end
    
    subgraph Orchestration["Orchestration-Based Saga (Central Coordinator)"]
        direction TB
        O["Saga Orchestrator<br/><i>Workflow Engine</i>"]
        O1["Flight Service"]
        O2["Hotel Service"]
        O3["Payment Service"]
        
        O --"1. ReserveFlight command"--> O1
        O1 --"Success"--> O
        O --"2. ReserveHotel command"--> O2
        O2 --"Success"--> O
        O --"3. ChargePayment command"--> O3
        O3 -."Failure".-> O
        O -."4. CancelHotel".-> O2
        O -."5. CancelFlight".-> O1
    end

Choreography distributes workflow logic across services that react to each other’s events—no central coordinator but harder to debug. Orchestration uses a central coordinator that explicitly commands each step—easier to understand but creates a single point of failure.

Variants

1. Pub-Sub (Publish-Subscribe)

The foundational pattern. Producers publish events to topics; consumers subscribe to topics. One event reaches multiple consumers. Message brokers like Kafka, RabbitMQ, and Google Pub/Sub implement this.

When to use: Broadcasting state changes to multiple interested parties. Slack uses pub-sub for workspace events—when a message is posted, it notifies the web client, mobile apps, search indexer, and analytics pipeline simultaneously.

Pros: Simple mental model, easy to add new consumers, natural fan-out.

Cons: No built-in ordering across topics, consumers must handle duplicates, debugging requires tracing events across multiple services.

2. Event Sourcing

Instead of storing current state, store the sequence of events that led to that state. The event log becomes the source of truth. Current state is derived by replaying events.

Stripe uses event sourcing for payment processing. Instead of storing “account balance: $1,000”, they store: “AccountCreated”, “PaymentReceived($500)”, “PaymentReceived($600)”, “RefundIssued($100)”. Balance is computed by replaying events: 0 + 500 + 600 - 100 = $1,000.

When to use: When you need complete audit trails, time-travel debugging, or the ability to rebuild state from scratch. Financial systems, compliance-heavy domains, systems where “how did we get here?” matters as much as “where are we?”

Pros: Perfect audit log, can rebuild state at any point in time, supports temporal queries (“what was the balance on March 1st?”), enables event replay for debugging.

Cons: Storage grows indefinitely, querying current state requires replaying events (slow without snapshots), schema evolution is hard (old events must remain compatible), eventual consistency between event store and projections.

3. CQRS (Command Query Responsibility Segregation)

Separate the write model (commands that change state) from the read model (queries that fetch state). Commands publish events; read models subscribe and build optimized views.

LinkedIn’s profile system uses CQRS. When you update your headline, the write model validates and stores the change, publishing a “ProfileUpdated” event. Multiple read models subscribe: one builds a document for search (Elasticsearch), one builds a graph for recommendations (Neo4j), one builds a cache for profile views (Redis). Each read model is optimized for its query pattern.

When to use: When read and write patterns differ dramatically. High read-to-write ratios (1000:1), complex queries that don’t map to the write model, need for multiple specialized views of the same data.

Pros: Read and write models can scale independently, each read model optimized for its queries, can use different databases for reads vs writes.

Cons: Increased complexity (two models instead of one), eventual consistency between write and read models, data duplication across read models, harder to reason about system state.

4. Event Streaming

Treat events as an infinite, append-only log that consumers can replay from any point. Kafka is the canonical example. Unlike traditional message queues (which delete messages after consumption), event streams retain events for days or weeks.

Uber’s surge pricing system uses event streaming. Location events from millions of drivers and riders flow into Kafka. The pricing service consumes the stream in real-time to calculate demand. The analytics team replays the same stream from 6 hours ago to debug a pricing anomaly.

When to use: Real-time analytics, stream processing, when multiple consumers need to process events at different speeds, when you need to replay history.

Pros: Consumers can rewind and replay, supports both real-time and batch processing, natural fit for stream processing frameworks (Kafka Streams, Flink).

Cons: Storage costs for retaining events, complexity of managing consumer offsets, harder to implement exactly-once semantics.

CQRS Architecture with Event-Driven Read Models

graph LR
    Client["Client Application"]
    WriteAPI["Write API<br/><i>Command Handler</i>"]
    WriteDB["Write Model<br/><i>PostgreSQL</i>"]
    EventBus["Event Bus<br/><i>Kafka</i>"]
    
    subgraph Read Models
        SearchRead["Search Read Model<br/><i>Elasticsearch</i>"]
        CacheRead["Cache Read Model<br/><i>Redis</i>"]
        GraphRead["Graph Read Model<br/><i>Neo4j</i>"]
    end
    
    ReadAPI["Read API<br/><i>Query Handler</i>"]
    
    Client --"1. POST /profile (write)"--> WriteAPI
    WriteAPI --"2. Validate & persist"--> WriteDB
    WriteAPI --"3. Publish ProfileUpdated"--> EventBus
    
    EventBus --"4a. Subscribe & update"--> SearchRead
    EventBus --"4b. Subscribe & update"--> CacheRead
    EventBus --"4c. Subscribe & update"--> GraphRead
    
    Client --"5. GET /profile (read)"--> ReadAPI
    ReadAPI --"6. Query optimized view"--> CacheRead
    CacheRead --"7. Return cached data"--> ReadAPI
    ReadAPI --"8. Response"--> Client

CQRS separates write operations (commands) from read operations (queries). LinkedIn’s profile system writes to PostgreSQL and publishes events. Multiple read models subscribe and build specialized views optimized for different query patterns—search, caching, and graph traversal.

Trade-offs

Coupling vs. Complexity

Synchronous RPC: Services are tightly coupled but the system is simple. You can trace a request through the call stack. When something breaks, stack traces show exactly what failed.

Event-Driven: Services are loosely coupled but the system is complex. A single user action might trigger 10 events across 8 services. When something breaks, you need distributed tracing to understand what happened.

Decision criteria: Choose events when you need to add consumers frequently without modifying producers, or when services are owned by different teams. Choose RPC when the workflow is simple and unlikely to change.

Consistency vs. Availability

Synchronous: Strong consistency. When the API returns success, all downstream effects have completed. If any step fails, the entire operation fails atomically.

Event-Driven: Eventual consistency. The API returns success immediately, but downstream effects happen asynchronously. A user might see “Order Placed” before inventory is actually reserved.

Decision criteria: Choose events when you can tolerate seconds of inconsistency and need high availability. Choose synchronous when consistency is critical (financial transactions, inventory reservation).

Performance vs. Debuggability

Synchronous: Slower (must wait for all downstream calls) but easier to debug. Logs and traces are linear.

Event-Driven: Faster (fire-and-forget) but harder to debug. Events flow through multiple services asynchronously. Correlating logs requires event IDs and distributed tracing.

Decision criteria: Choose events when latency matters and you can invest in observability infrastructure. Choose synchronous when debugging simplicity is more important than milliseconds.

Ordering Guarantees vs. Throughput

Strict Ordering: Kafka partitions guarantee order within a partition, but this limits parallelism. If all events for user X must be ordered, they go to the same partition, processed by one consumer.

No Ordering: Events can be processed in parallel across partitions, maximizing throughput, but you might process “ItemRemoved” before “ItemAdded”.

Decision criteria: Partition by entity ID (user ID, order ID) when order matters for that entity. Use timestamps and idempotency when order doesn’t matter. Twitter partitions tweets by user ID—your tweets are ordered, but global timeline order is eventual.

Operational Overhead

Event-driven systems require:

Message broker infrastructure (Kafka clusters, monitoring)
Schema registry for event definitions
Distributed tracing (Jaeger, Zipkin)
Dead letter queue monitoring
Consumer lag alerting
Event replay tools

This is significant operational complexity. Small teams or simple systems may not justify this overhead.

When to Use (and When Not To)

Use Event-Driven Architecture When:

Multiple services need to react to the same state change. If three services need to know about new user signups, events are cleaner than having the signup service call three APIs.
Services are owned by different teams. Events create boundaries. The payments team can add fraud detection by subscribing to “PaymentInitiated” without asking the checkout team to modify their code.
You need audit trails. Events naturally create a log of everything that happened. Financial systems, healthcare, compliance-heavy domains benefit from this.
Workflows span multiple services. Booking a trip involves flights, hotels, payments, notifications. Coordinating this with synchronous calls creates a distributed monolith. Sagas with events keep services autonomous.
You need to scale consumers independently. If notification sending is slow, scale up notification consumers without touching other services.
Real-time data processing. Streaming analytics, fraud detection, recommendation systems that need to react to events as they happen.

Avoid Event-Driven Architecture When:

You need immediate consistency. If a user must see updated inventory the moment they add to cart, synchronous calls are simpler.
The workflow is simple and linear. If Service A always calls Service B, and that’s it, RPC is fine. Don’t add event complexity for a two-step workflow.
You have a small team. Operating Kafka, implementing distributed tracing, and debugging async flows requires expertise and tooling. A three-person startup should use synchronous calls and a monolith.
Latency requirements are strict. Events add latency (message broker round-trip, consumer processing time). If you need sub-10ms responses, synchronous in-process calls are better.
The domain is CRUD-heavy with no complex workflows. A simple admin panel that creates/reads/updates/deletes records doesn’t benefit from events.

Anti-Patterns:

Event-driven monolith: Publishing events within a single service instead of using function calls. This adds complexity without decoupling benefits.
Request-response over events: Using events like RPC (“GetUserRequest” event, waiting for “GetUserResponse” event). This is synchronous communication disguised as events—use actual RPC.
Event chains without orchestration: Choreography sagas with 10+ steps become impossible to understand. Use orchestration for complex workflows.
No idempotency: Assuming events are delivered exactly once. They’re not. Every consumer must handle duplicates.

Real-World Examples

Uber: Real-Time Dispatch

Uber’s dispatch system processes 100+ million events per day using Kafka. When a rider requests a ride, the mobile app publishes a “RideRequested” event. This triggers a cascade: the matching service finds nearby drivers and publishes “MatchFound”, the pricing service calculates the fare and publishes “FareCalculated”, the notification service alerts the driver and publishes “DriverNotified”. Each service is autonomous, scaled independently (matching needs more resources during peak hours), and can be deployed without coordinating with others. The interesting detail: Uber uses event sourcing for trip state. Instead of storing “trip status: in_progress”, they store every event (“TripRequested”, “DriverAssigned”, “RiderPickedUp”, “TripCompleted”). This enables precise analytics (“how long between assignment and pickup?”) and debugging (“replay this trip’s events to see what went wrong”).

LinkedIn: Activity Streams

LinkedIn’s feed system uses Kafka to handle 4+ trillion messages per day. When you post an update, the post service publishes a “PostCreated” event. Dozens of consumers react: the feed service adds it to your followers’ timelines, the search service indexes it, the notification service alerts people you mentioned, the spam service checks for violations, the analytics service tracks engagement, the recommendation service updates your interest graph. The system uses CQRS—the write model (creating posts) is separate from read models (various feed views optimized for different clients: web, mobile, API). The interesting detail: LinkedIn built Brooklin, a change data capture system that publishes database changes as events, ensuring the event stream and database stay consistent even if the application crashes between writing to the database and publishing the event.

Slack: Workspace Events

Slack’s real-time messaging uses event-driven architecture with WebSockets for delivery. When you send a message, the message service publishes a “MessageSent” event to a Kafka topic partitioned by workspace ID (ensuring all messages in a workspace are ordered). Multiple consumers process this: the storage service persists to the database, the search service indexes for search, the notification service sends push notifications to offline users, the analytics service tracks usage, the bot platform triggers automated responses. The interesting detail: Slack uses choreography sagas for message delivery. The message service doesn’t know about all consumers—it just publishes the event. New features (like message translation) subscribe to “MessageSent” without modifying the core message service. This enabled Slack to add dozens of integrations (Giphy, Zoom, Google Drive) without touching the message pipeline.

Interview Essentials

Mid-Level

Explain the difference between pub-sub and message queues. (Pub-sub: one message, many consumers. Queue: one message, one consumer.)

How do you ensure events are processed exactly once? (Trick question—you can’t. Implement idempotent consumers that handle duplicates gracefully.)

What’s the difference between a topic and a partition in Kafka? (Topic: logical category. Partition: physical segment for parallelism and ordering.)

How do you handle failed event processing? (Retry with exponential backoff, then move to dead letter queue after N attempts.)

Why use events instead of direct service calls? (Decoupling: producer doesn’t know consumers exist, can add consumers without changing producer, temporal decoupling.)

Senior

Design an event-driven order processing system. Walk through the events, consumers, and failure handling. (Expect: OrderPlaced → InventoryReserved → PaymentCharged → ShipmentCreated, with compensation for failures.)

How do you maintain ordering guarantees in a distributed event system? (Partition by entity ID, single consumer per partition for that entity, use sequence numbers to detect out-of-order delivery.)

Compare choreography vs orchestration for sagas. When would you use each? (Choreography for simple workflows with autonomous services, orchestration for complex workflows needing central visibility.)

How do you handle schema evolution in event-driven systems? (Forward/backward compatibility, schema registry, version events, use optional fields, never remove required fields.)

What are the consistency implications of event-driven architecture? (Eventual consistency: consumers process asynchronously, different services see different states temporarily, need to design UX around this.)

Staff+

You’re seeing 10-second lag in event processing during peak traffic. How do you diagnose and fix this? (Check consumer lag metrics, identify slow consumers, scale consumer groups, optimize processing logic, consider partitioning strategy, check for hot partitions.)

Design a multi-region event-driven system with cross-region replication. How do you handle conflicts? (Active-active with conflict resolution, active-passive with failover, or event sourcing with deterministic replay. Discuss CAP theorem trade-offs.)

How would you migrate a synchronous monolith to event-driven microservices without downtime? (Strangler pattern: publish events from monolith, build new services as consumers, gradually move logic from monolith to services, use feature flags for cutover.)

Explain how you’d implement distributed tracing across an event-driven system with 50+ services. (Propagate trace context in event headers, use correlation IDs, implement OpenTelemetry, visualize with Jaeger, discuss sampling strategies for high-volume systems.)

What are the failure modes of Kafka, and how do you design around them? (Broker failures: replication. Partition leader election: client retries. Consumer failures: consumer group rebalancing. Network partitions: acks configuration. Discuss trade-offs between throughput and durability.)

Common Interview Questions

How do you debug an event-driven system when something goes wrong? (Distributed tracing with correlation IDs, centralized logging, event replay for reproduction, dead letter queue analysis.)

What’s the difference between event sourcing and event-driven architecture? (Event sourcing: events are the source of truth, state is derived. Event-driven: events trigger actions, state is stored separately.)

How do you handle duplicate events? (Idempotent consumers: check if event already processed using event ID, use database constraints, design operations to be naturally idempotent.)

When would you choose Kafka over RabbitMQ? (Kafka: high throughput, event streaming, replay, multiple consumers. RabbitMQ: complex routing, lower latency, traditional message queue patterns.)

Red Flags to Avoid

Claiming events are always better than synchronous calls (shows lack of understanding of trade-offs)

Not mentioning idempotency when discussing event processing (critical for correctness)

Designing choreography sagas with 10+ steps (shows poor judgment—use orchestration)

Ignoring eventual consistency implications (must understand how this affects UX and business logic)

Not considering operational complexity (message broker management, monitoring, debugging tools)

Key Takeaways

Event-driven architecture decouples services through asynchronous events: Producers publish events without knowing consumers exist. This enables independent scaling, deployment, and evolution, but introduces eventual consistency and debugging complexity.

Choose the right pattern for your workflow: Pub-sub for broadcasting, event sourcing for audit trails, CQRS for read-heavy systems, choreography sagas for simple workflows, orchestration sagas for complex ones. Each has distinct trade-offs.

Idempotency is non-negotiable: Message brokers guarantee at-least-once delivery, not exactly-once. Every consumer must handle duplicate events gracefully using event IDs, database constraints, or naturally idempotent operations.

Ordering requires partitioning strategy: Kafka guarantees order within a partition. Partition by entity ID (user, order) when order matters for that entity. Accept eventual ordering when global order isn’t critical.

Operational complexity is significant: Event-driven systems require message broker infrastructure, schema registries, distributed tracing, dead letter queues, and consumer lag monitoring. Don’t adopt this pattern unless you can support the operational overhead.