Returning Results in Distributed Systems: Sync vs Async
After this topic, you will be able to:
- Compare polling, webhooks, long-polling, and WebSockets for async result delivery and select appropriate approach
- Implement result retrieval patterns for background jobs with different latency and reliability requirements
- Demonstrate how to handle result expiration, retries, and failure scenarios in async processing
TL;DR
Background jobs execute asynchronously, creating a fundamental challenge: how does the client get the result? This topic covers four core patterns—polling, webhooks, long-polling, and WebSockets—each with different trade-offs between latency, complexity, and resource usage. The choice depends on your latency requirements, client capabilities, and scale. Cheat sheet: Polling (simple, wasteful), Webhooks (efficient, requires client endpoint), Long-polling (balanced, connection overhead), WebSockets (real-time, complex infrastructure).
The Problem It Solves
When you submit a background job—whether it’s processing a video, generating a report, or running a machine learning model—the job executes in a separate process or even a different server. The client that initiated the job needs to know when it completes and retrieve the result, but the asynchronous nature creates a coordination problem. The client can’t simply wait for a synchronous response because the job might take seconds, minutes, or hours. You need a mechanism to bridge this gap between fire-and-forget execution and result delivery. Without a solution, clients either block indefinitely (wasting resources), poll aggressively (overwhelming servers), or never learn when jobs complete (poor user experience). The challenge intensifies at scale: GitHub processes millions of CI/CD jobs daily, each needing to notify the correct client when builds finish. Slack handles background message indexing and search operations that must update users’ interfaces when complete. The pattern you choose determines your system’s responsiveness, resource efficiency, and operational complexity.
Solution Overview
The solution space offers four primary patterns, each representing a different communication model between client and server. Polling has the client repeatedly ask “is it done yet?” at intervals, trading simplicity for wasted requests. Webhooks flip the model: the server calls a client-provided URL when the job completes, eliminating waste but requiring the client to expose an endpoint. Long-polling creates a middle ground where the client makes a request that the server holds open until results are ready, reducing request overhead while maintaining client-initiated communication. WebSockets establish a persistent bidirectional channel, enabling instant server-to-client notifications but requiring more complex infrastructure. Each pattern makes different assumptions about client capabilities, acceptable latency, and operational complexity. The key insight is that there’s no universal best choice—the right pattern depends on whether your clients are mobile apps (limited connectivity), web browsers (no stable endpoints), or backend services (full control). Understanding when each pattern fits transforms this from a theoretical exercise into a practical design decision that affects user experience and operational costs.
How It Works
Let’s walk through each pattern with concrete examples. Polling is the simplest: after submitting a job, the client stores the job ID and periodically sends GET requests to check status. A video transcoding service might have clients poll /jobs/{id} every 5 seconds. The server responds with {"status": "processing"} until the job completes, then returns {"status": "complete", "result_url": "https://..."}. The naive approach polls at fixed intervals, but exponential backoff improves efficiency: start at 1 second, then 2, 4, 8, up to a maximum like 60 seconds. This reduces load for long-running jobs while maintaining responsiveness for quick ones. GitHub’s CI/CD status checks use polling with backoff—your browser doesn’t hammer their API while waiting for test results. Webhooks require the client to provide a callback URL when submitting the job: POST /jobs {"task": "transcode", "webhook_url": "https://client.com/callbacks"}. When the job completes, the server makes an HTTP POST to that URL with the result. Stripe uses webhooks extensively for payment processing—when a charge succeeds, they POST to your configured endpoint. The critical implementation detail is retry logic: if the webhook fails (client down, network error), the server must retry with exponential backoff, typically 3-5 attempts over several hours. Stripe retries failed webhooks for up to 3 days. Security matters: clients should verify webhook authenticity using HMAC signatures to prevent spoofing. Long-polling works differently: the client makes a GET request to /jobs/{id}/result, but instead of responding immediately with “not ready,” the server holds the connection open. When the job completes (or a timeout like 30 seconds expires), the server responds. If the timeout fires first, the client immediately reconnects. This creates a “hanging GET” pattern that feels synchronous to the client but doesn’t waste requests. The server needs to track which connections are waiting for which jobs—typically using an in-memory pub/sub system like Redis. When a job completes, the server publishes to the job’s channel, waking all waiting connections. WebSockets establish a full-duplex TCP connection that persists across multiple jobs. After the initial HTTP upgrade handshake, both sides can send messages anytime. The client subscribes to job updates over the socket: {"subscribe": "job:123"}. When the job completes, the server pushes: {"job_id": "123", "status": "complete", "result": {...}}. Slack uses WebSockets for real-time messaging—when someone mentions you, the notification arrives instantly without polling. The complexity comes from connection management: handling reconnects, distributing connections across servers, and routing messages to the right socket when you have millions of concurrent connections.
Polling with Exponential Backoff Flow
sequenceDiagram
participant C as Client
participant S as Server
participant Q as Job Queue
participant W as Worker
C->>S: 1. POST /jobs<br/>{task: "transcode"}
S->>Q: 2. Enqueue job
S-->>C: 3. {job_id: "123", status: "queued"}
Q->>W: 4. Dequeue job
W->>W: 5. Process job
Note over C,S: Poll attempt 1 (1s delay)
C->>S: 6. GET /jobs/123
S-->>C: 7. {status: "processing"}
Note over C,S: Poll attempt 2 (2s delay)
C->>S: 8. GET /jobs/123
S-->>C: 9. {status: "processing"}
Note over C,S: Poll attempt 3 (4s delay)
C->>S: 10. GET /jobs/123
S-->>C: 11. {status: "processing"}
W->>S: 12. Job complete
Note over C,S: Poll attempt 4 (8s delay)
C->>S: 13. GET /jobs/123
S-->>C: 14. {status: "complete",<br/>result_url: "https://..."}
Note over C: Stop polling,<br/>retrieve result
Exponential backoff reduces wasted requests by doubling the interval between polls (1s → 2s → 4s → 8s). The client makes fewer requests as the job runs longer, balancing responsiveness with efficiency.
Webhook Delivery with Retry Logic
graph LR
subgraph Client System
CE[Client Endpoint<br/>/webhooks/jobs]
end
subgraph Job Processing System
W[Worker] --> |Job complete| RC[Result Cache<br/>Redis]
RC --> WQ[Webhook Queue<br/>Kafka]
WQ --> WD[Webhook Delivery<br/>Service]
end
subgraph Retry Infrastructure
WD --> |Attempt 1<br/>Immediate| CE
CE -.->|5xx Error| RQ[Retry Queue]
RQ --> |Attempt 2<br/>+1 hour| WD
RQ --> |Attempt 3<br/>+3 hours| WD
RQ --> |Attempt 4<br/>+6 hours| WD
RQ --> |Failed 4x| DLQ[Dead Letter<br/>Queue]
end
subgraph Monitoring
WD --> ML[Metrics &<br/>Logs]
DLQ --> AL[Alert<br/>System]
end
CE -.->|200 OK| WD
WD --> |Success| ML
Webhook retry logic ensures delivery despite temporary failures. After 4 failed attempts with exponential backoff, webhooks move to a dead-letter queue for manual investigation, preventing infinite retry loops.
Long-Polling Connection Lifecycle
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant S1 as Server 1
participant PS as Redis Pub/Sub
participant W as Worker
C->>LB: 1. POST /jobs {task: "process"}
LB->>S1: Forward request
S1-->>C: 2. {job_id: "456"}
Note over C,S1: Long-poll request
C->>LB: 3. GET /jobs/456/result
LB->>S1: Forward request
S1->>PS: 4. SUBSCRIBE job:456
Note over S1: Hold connection open<br/>(30s timeout)
W->>W: 5. Process job
W->>PS: 6. PUBLISH job:456<br/>{status: "complete", result: {...}}
PS->>S1: 7. Message delivered
S1-->>C: 8. {status: "complete",<br/>result: {...}}
Note over C: Connection closed,<br/>result received
rect rgb(240, 240, 240)
Note over C,S1: Alternative: Timeout scenario
C->>S1: GET /jobs/789/result
S1->>PS: SUBSCRIBE job:789
Note over S1: Wait 30 seconds...
S1-->>C: 408 Timeout
Note over C: Immediately reconnect
C->>S1: GET /jobs/789/result
end
Long-polling holds the client request open until the job completes or a timeout expires. Redis pub/sub enables multi-server deployments by routing job completion events to the correct waiting connection.
Pattern Selection: Latency vs Scale Trade-offs
graph TB
Start["Job Result<br/>Delivery Needed"] --> Latency{"Latency<br/>Requirement?"}
Latency -->|>10s acceptable| Client1{"Client Type?"}
Latency -->|1-10s acceptable| Client2{"Client Type?"}
Latency -->|<1s required| Client3{"Client Type?"}
Client1 -->|Any| Scale1{"Scale?"}
Scale1 -->|<100 jobs/sec| Poll1["✓ Simple Polling<br/>Fixed interval"]
Scale1 -->|>100 jobs/sec| Poll2["✓ Polling with<br/>Exponential Backoff"]
Client2 -->|Backend Service| WH["✓ Webhooks<br/>with Retry Queue"]
Client2 -->|Browser/Mobile| LP["✓ Long-Polling<br/>with Timeout"]
Client3 -->|Backend Service| WH2["✓ Webhooks<br/>(if <1s acceptable)"]
Client3 -->|Browser/Mobile| WS["✓ WebSockets<br/>with Subscription"]
WH2 -.->|Need <100ms| WS2["⚠ Consider WebSockets<br/>for true real-time"]
Decision tree for selecting the right pattern based on latency requirements, client type, and scale. Green indicates simple implementation, yellow moderate complexity, blue high complexity.
Webhook Security and Exactly-Once Delivery
graph TB
subgraph Job Completion
W[Worker] --> |Job done| Gen[Generate<br/>Idempotency Key<br/>job_id + attempt]
Gen --> Store[Store Attempt<br/>in DynamoDB<br/>TTL: 7 days]
end
subgraph Webhook Delivery
Store --> Sign[Generate HMAC<br/>Signature<br/>SHA256 + secret]
Sign --> Send["POST to Client<br/>Headers:<br/>X-Signature<br/>X-Idempotency-Key<br/>X-Timestamp"]
end
subgraph Client Validation
Send --> Verify{"Verify<br/>Signature?"}
Verify -->|Invalid| Reject[Return 401<br/>Unauthorized]
Verify -->|Valid| Time{"Timestamp<br/>within 5min?"}
Time -->|No| Reject2[Return 400<br/>Replay Attack]
Time -->|Yes| Dedup{"Idempotency<br/>Key seen?"}
Dedup -->|Yes| Ack[Return 200<br/>Already Processed]
Dedup -->|No| Process[Process Webhook<br/>Store Key]
Process --> Success[Return 200 OK]
end
subgraph Retry Logic
Reject -.->|Log failure| Retry[Retry Queue]
Reject2 -.->|Log failure| Retry
Retry --> |Attempt 2<br/>+1 hour| Send
Retry --> |Attempt 3<br/>+3 hours| Send
Retry --> |Failed 3x| DLQ[Dead Letter<br/>Queue]
end
Success --> Metrics[Update Success<br/>Metrics]
DLQ --> Alert[Alert Ops Team]
Production webhook systems require HMAC signature verification, timestamp validation to prevent replay attacks, and idempotency keys to ensure exactly-once processing despite retries. DynamoDB stores delivery attempts with TTL for automatic cleanup.
Variants
name: Simple Polling with Fixed Interval description: Client polls at constant intervals (e.g., every 5 seconds) regardless of job duration or status when_to_use: Prototypes, internal tools, or systems with predictable job durations and low scale pros: Trivial to implement, no server-side state, works with any client cons: Wastes bandwidth and server resources, poor user experience for variable-duration jobs
name: Polling with Exponential Backoff description: Client increases polling interval exponentially (1s, 2s, 4s, 8s…) up to a maximum, resetting on status changes when_to_use: Production systems where job duration varies widely (seconds to hours) and you want to balance responsiveness with efficiency pros: Reduces load for long jobs while staying responsive for quick ones, simple client logic cons: Still generates unnecessary requests, maximum latency equals backoff ceiling
name: Webhooks with Retry Queue description: Server attempts webhook delivery, queues failures for retry with exponential backoff, eventually moves to dead-letter queue when_to_use: When clients can expose stable HTTP endpoints (backend services, not mobile apps) and you need guaranteed delivery pros: Zero client polling, efficient resource usage, scales to millions of jobs cons: Requires clients to implement and secure webhook endpoints, debugging delivery failures is complex
name: Long-Polling with Timeout description: Server holds client request open until result is ready or timeout (typically 30-60s) expires, client immediately reconnects on timeout when_to_use: Web applications where you want near-real-time updates but can’t use WebSockets (corporate firewalls, older browsers) pros: Feels synchronous to client, much more efficient than polling, works over HTTP cons: Holds server connections open, requires pub/sub infrastructure for multi-server deployments
name: WebSockets with Subscription Model description: Persistent bidirectional connection where clients subscribe to job IDs and receive pushed updates when jobs complete when_to_use: Real-time applications (chat, collaboration tools, live dashboards) where latency under 100ms matters and you control the client pros: Instant delivery, bidirectional communication, single connection for multiple jobs cons: Complex infrastructure (connection routing, load balancing), doesn’t work through all proxies, harder to debug
Pattern Selection Guide
Description
Choose your pattern based on four key dimensions: latency requirements, client type, scale, and reliability needs.
Decision Matrix
dimension: Latency Tolerance polling: Acceptable for >10s latency (analytics dashboards, batch reports) webhooks: Good for >1s latency (payment confirmations, CI/CD results) long_polling: Good for >500ms latency (chat applications, live feeds) websockets: Required for <100ms latency (collaborative editing, gaming, trading platforms)
dimension: Client Type polling: Any client (mobile apps, browsers, CLIs, backend services) webhooks: Backend services with stable endpoints only long_polling: Browsers and mobile apps with persistent connections websockets: Browsers and native apps where you control the client stack
dimension: Scale (jobs/second) polling: <100 jobs/sec (internal tools, small SaaS) webhooks: 100-10,000 jobs/sec (payment processors, CI/CD platforms) long_polling: 100-1,000 jobs/sec (chat apps, notification systems) websockets: 1,000-100,000+ jobs/sec (real-time collaboration, live sports scores)
dimension: Reliability Requirements polling: Client responsible for retries, simple failure model webhooks: Server guarantees delivery with retry queue, complex failure handling long_polling: Client detects timeouts and reconnects, moderate complexity websockets: Requires sophisticated reconnection logic and message deduplication
Hybrid Approaches
Many production systems combine patterns. GitHub uses webhooks for CI/CD notifications but falls back to polling for clients that can’t receive webhooks. Slack uses WebSockets for active users but sends push notifications (a webhook variant) to mobile devices. The key is matching the pattern to the client’s capabilities and the job’s latency requirements.
Trade-offs
dimension: Implementation Complexity option_a: Polling: Client-side only, 20 lines of code, no server changes option_b: WebSockets: Requires connection manager, message router, reconnection logic, 1000+ lines decision_criteria: Choose polling for MVPs and internal tools. Choose WebSockets when latency justifies the engineering investment.
dimension: Resource Efficiency option_a: Polling: Wastes bandwidth and CPU on empty responses, scales linearly with clients option_b: Webhooks: Zero wasted requests, server only sends when results exist decision_criteria: At 1000 jobs/day, polling costs are negligible. At 1M jobs/day, polling becomes expensive. Calculate: (jobs/day × poll_interval × avg_job_duration) = wasted requests.
dimension: Latency option_a: Polling: Average latency = poll_interval / 2 (5s interval = 2.5s average delay) option_b: WebSockets: <100ms latency, limited by network RTT decision_criteria: If users tolerate 5-10s delays (batch reports), polling is fine. If users expect instant feedback (chat), use WebSockets.
dimension: Client Requirements option_a: Polling/Long-polling: Works everywhere, even behind corporate proxies option_b: Webhooks: Requires client to expose public endpoint, handle retries, verify signatures decision_criteria: Mobile apps and browsers can’t receive webhooks. Backend services should prefer webhooks for efficiency.
dimension: Debugging and Observability option_a: Polling: Easy to debug with HTTP logs, clear request/response pairs option_b: WebSockets: Requires specialized tools to inspect binary frames, harder to replay failures decision_criteria: For systems where debugging production issues is frequent, polling’s simplicity has real value.
When to Use (and When Not To)
Use Polling When
Building an MVP or internal tool where engineering time is more expensive than server costs
Clients are mobile apps or browsers that can’t receive webhooks
Job completion times are predictable (use exponential backoff for variable durations)
Scale is low (<1000 jobs/day) and latency tolerance is high (>10 seconds acceptable)
Use Webhooks When
Clients are backend services that can expose stable HTTP endpoints
You need guaranteed delivery with retry semantics (payment confirmations, order processing)
Scale is high (>10,000 jobs/day) and you want to minimize server load
Jobs have variable durations (seconds to hours) and you want efficient resource usage
Use Long Polling When
You need near-real-time updates (<5s latency) but can’t use WebSockets (firewall restrictions)
Clients are web browsers and you want a simpler alternative to WebSockets
You have moderate scale (100-1000 jobs/sec) and can afford connection overhead
You want the feel of real-time without the complexity of WebSocket infrastructure
Use Websockets When
Latency requirements are strict (<100ms) for real-time collaboration or live updates
You need bidirectional communication (server pushes updates, client sends commands)
Scale justifies the infrastructure investment (>1000 concurrent connections)
You control both client and server and can implement sophisticated reconnection logic
Anti Patterns
Using WebSockets for infrequent updates (job runs once per hour) wastes persistent connections
Using polling with 1-second intervals for long-running jobs (hours) generates millions of wasted requests
Using webhooks for mobile apps (they can’t receive HTTP callbacks reliably)
Using long-polling without timeout limits (connections leak, servers run out of file descriptors)
Real-World Examples
company: GitHub system: CI/CD Status Checks how_they_use_it: GitHub Actions uses a hybrid approach: webhooks for repository owners who configure them, polling with exponential backoff for web UI clients. When a workflow completes, GitHub POSTs to configured webhook URLs with retry logic (3 attempts over 1 hour). The web interface polls the status API starting at 1-second intervals, backing off to 60 seconds for long-running workflows. This combination serves both backend integrations (webhooks) and human users (polling) efficiently. interesting_detail: GitHub’s webhook retry logic includes a dead-letter queue: after 3 failed attempts, webhooks move to a manual retry interface where developers can inspect payloads and trigger redelivery. This prevents infinite retry loops while maintaining delivery guarantees.
company: Stripe
system: Payment Processing Webhooks
how_they_use_it: Stripe sends webhooks for every payment event (charge succeeded, refund processed, dispute created) to merchant-configured endpoints. They retry failed webhooks with exponential backoff: immediately, 1 hour, 3 hours, 6 hours, 12 hours, then daily for up to 3 days. Each webhook includes an HMAC signature in the Stripe-Signature header that merchants verify using their secret key. Stripe also provides a webhook testing CLI tool that forwards events to localhost for development.
interesting_detail: Stripe discovered that 30% of webhook failures were due to merchants’ servers being temporarily down during deployments. They added a webhook dashboard showing delivery history and a manual retry button, reducing support tickets by 40%.
company: Slack system: Real-Time Messaging how_they_use_it: Slack uses WebSockets for active desktop and web clients, maintaining persistent connections that receive message events, typing indicators, and presence updates with <100ms latency. For mobile apps, they use push notifications (a webhook variant where Apple/Google servers call the app) combined with polling when the app is foregrounded. When a user sends a message, Slack’s backend publishes to a Redis pub/sub channel, which routes to all WebSocket connections for users in that channel. interesting_detail: Slack’s WebSocket infrastructure handles 10M+ concurrent connections by sharding connections across servers based on user ID. When a message arrives, they use consistent hashing to route it to the correct server, then push to all connected clients for that user. This architecture scales horizontally as they add servers.
Interview Essentials
Mid-Level
What You Should Know
Explain the four core patterns (polling, webhooks, long-polling, WebSockets) and when each fits
Implement exponential backoff for polling: start_interval × 2^attempt, capped at max_interval
Describe webhook retry logic: why you need it, typical retry schedules (immediate, 1h, 3h, 6h)
Calculate polling overhead: if 1000 jobs run for 5 minutes each with 5-second polling, how many wasted requests?
Example Question
You’re building a video transcoding service. Users upload videos and want to know when transcoding completes. How do you return results?
Strong Answer
I’d use polling with exponential backoff for the web UI and webhooks for API clients. Here’s why: transcoding takes 1-10 minutes, so polling every 5 seconds wastes requests. Start at 2 seconds, double to 4, 8, 16, capping at 60 seconds. For API clients (backend services), offer optional webhooks—they provide a callback URL, we POST the result when done. Implement retry logic: 3 attempts with exponential backoff, then dead-letter queue. Store results in S3 with a 24-hour TTL, return a signed URL in the response. This balances simplicity (polling for browsers) with efficiency (webhooks for services).
Senior
What You Should Know
Design webhook retry infrastructure: retry queue, exponential backoff, dead-letter queue, observability
Handle webhook security: HMAC signatures, replay attack prevention, IP allowlisting
Implement long-polling at scale: connection pooling, pub/sub for multi-server routing, timeout management
Compare resource costs: calculate server load for polling vs long-polling vs WebSockets at 10,000 jobs/sec
Example Question
Your webhook delivery success rate is 85%. How do you improve it to 99%?
Strong Answer
First, analyze failure modes: are webhooks timing out, returning 5xx errors, or unreachable? Add detailed logging with failure reasons. Implement tiered retries: immediate retry for 5xx (server error), 1-hour retry for timeouts, no retry for 4xx (client error). Add a webhook testing endpoint that merchants can use to verify their setup before going live. Provide a dashboard showing delivery history, failure reasons, and manual retry buttons. Implement circuit breakers: if a webhook fails 10 times in a row, pause delivery and alert the merchant. Finally, add webhook signature verification examples in every language to prevent merchants from accidentally blocking valid requests. At Stripe, these changes improved delivery from 85% to 98%.
Staff+
What You Should Know
Design multi-region webhook delivery with exactly-once semantics despite retries and network partitions
Architect WebSocket infrastructure for 10M+ concurrent connections: connection sharding, message routing, graceful degradation
Optimize long-polling for cost: calculate connection holding costs vs polling costs, determine break-even point
Handle result expiration at scale: TTL strategies, storage tiering (hot in Redis, warm in S3), garbage collection
Example Question
Design a result delivery system for 1M jobs/day with 99.99% delivery SLA and <1s latency for 95% of jobs.
Strong Answer
This requires a hybrid approach. For backend clients, use webhooks with exactly-once delivery: generate idempotency keys (job_id + attempt_number), store delivery attempts in DynamoDB with TTL, retry with exponential backoff up to 3 days. For web/mobile clients, use long-polling with 30-second timeout backed by Redis pub/sub. When a job completes, publish to Redis channel job:{id}. Long-polling requests subscribe to that channel; when the message arrives, respond immediately. If 30s timeout fires, client reconnects. Store results in Redis with 1-hour TTL, then move to S3 with 24-hour TTL for cost efficiency. For the 99.99% SLA, implement multi-region active-active: webhook delivery attempts from both regions, deduplicated by idempotency key. Monitor delivery latency: if p95 exceeds 1s, auto-scale webhook workers. This architecture handles 1M jobs/day (11 jobs/sec average, 100 jobs/sec peak) with <$500/month infrastructure costs.
Common Interview Questions
How do you handle webhook failures when the client is down for hours?
What’s the difference between long-polling and WebSockets? When would you choose each?
How do you prevent duplicate webhook deliveries during retries?
How would you implement exponential backoff with jitter to prevent thundering herd?
How do you secure webhooks against replay attacks and spoofing?
Red Flags to Avoid
Suggesting polling with fixed 1-second intervals for long-running jobs (shows no understanding of resource efficiency)
Recommending WebSockets for infrequent updates (over-engineering, wastes persistent connections)
Not mentioning retry logic for webhooks (shows lack of production experience with distributed systems)
Ignoring webhook security (HMAC signatures, replay prevention) in a payment processing context
Claiming WebSockets are always better than polling (shows lack of nuance about trade-offs and client capabilities)
Key Takeaways
The pattern you choose depends on four dimensions: latency requirements, client type, scale, and reliability needs. Polling works for any client but wastes resources. Webhooks are efficient but require stable endpoints. Long-polling balances efficiency and simplicity. WebSockets enable real-time but add complexity.
Exponential backoff is critical for polling efficiency: start at 1-2 seconds, double each attempt, cap at 60 seconds. This reduces load for long jobs while staying responsive for quick ones. Calculate polling overhead before choosing this pattern: (jobs/day × poll_interval × avg_duration) = wasted requests.
Webhook retry logic is non-negotiable in production: retry failed deliveries with exponential backoff (immediate, 1h, 3h, 6h), implement dead-letter queues for permanent failures, and provide dashboards for debugging. Stripe retries for 3 days; GitHub retries 3 times over 1 hour. Security requires HMAC signatures to prevent spoofing.
Long-polling creates a middle ground: clients make requests that servers hold open until results arrive or timeout (typically 30-60s). This feels synchronous to clients while being much more efficient than polling. Requires pub/sub infrastructure (Redis) for multi-server deployments and careful timeout management to prevent connection leaks.
Real-world systems use hybrid approaches: GitHub combines webhooks (for integrations) with polling (for web UI). Slack uses WebSockets (for desktop) with push notifications (for mobile). Match the pattern to the client’s capabilities and the job’s latency requirements rather than forcing a single solution.