Claim Check Pattern: Large Message Handling

After this topic, you will be able to:

Implement claim check pattern to handle large message payloads
Design reference-based messaging systems with external storage
Calculate cost-benefit trade-offs for claim check vs direct messaging

TL;DR

The Claim Check pattern separates large message payloads from message metadata by storing the payload in external storage (S3, blob storage) and passing only a lightweight reference token through the message broker. This prevents message size limit violations, reduces broker memory pressure, and lowers messaging costs while maintaining loose coupling between services.

Cheat Sheet:

Use when: Messages exceed 1MB or broker limits (Kafka: 1MB default, SQS: 256KB)
Core trade-off: Added latency (storage round-trip) vs. broker throughput
Key metric: Storage cost ($0.023/GB S3) vs. broker cost ($0.05/million messages SQS)

The Problem It Solves

Message brokers impose strict size limits to maintain performance and prevent memory exhaustion. Kafka defaults to 1MB per message, AWS SQS caps at 256KB, and RabbitMQ recommends staying under 128KB. When services need to exchange large payloads—video files, ML model weights, batch analytics results, or document archives—these limits become blocking constraints.

Directly increasing broker message limits creates cascading problems. Larger messages consume more broker memory, reducing the number of concurrent messages the broker can handle. Network bandwidth spikes during message transfer, causing latency for all messages sharing the same broker partition. Consumers must allocate larger buffers, increasing their memory footprint. At LinkedIn, engineers discovered that allowing 10MB messages in Kafka reduced overall throughput by 60% because the broker spent more time on network I/O than message routing.

The alternative—splitting large payloads into multiple smaller messages—introduces ordering complexity and partial failure scenarios. If message 3 of 10 fails, should the consumer discard messages 1-2? Reassembly logic becomes error-prone, and the message broker now handles 10x the message volume for a single logical operation. The Claim Check pattern solves this by treating the message broker as a coordination layer, not a data transport layer, offloading payload storage to systems designed for large objects.

Solution Overview

The Claim Check pattern splits each message into two components: a lightweight reference token (the “claim check”) that flows through the message broker, and the actual payload stored in external object storage like S3, Azure Blob Storage, or Google Cloud Storage. The producer writes the payload to storage, generates a unique reference token (typically a storage URL or UUID), and publishes only the token through the message broker. The consumer receives the token, retrieves the payload from storage, processes it, and optionally deletes the stored object.

This approach leverages the strengths of each system. Message brokers excel at ordered delivery, pub-sub routing, and guaranteed delivery semantics—but not at storing gigabytes of data. Object storage systems provide cheap, durable storage for large blobs with high throughput for parallel reads—but lack message ordering or delivery guarantees. By combining both, you get reliable message delivery for coordination plus efficient storage for data.

The pattern’s name comes from airline baggage handling: you check your luggage (payload) at the counter, receive a claim ticket (reference token), and later retrieve your luggage using that ticket. The ticket is lightweight and easy to carry, while the airline handles the heavy lifting of transporting your bags through their storage and routing system.

Reference Token Structure Options

graph TB
    subgraph Direct URL Approach
        URL["Direct Storage URL<br/><i>Simple, exposes topology</i>"]
        URLEx["s3://bucket/videos/abc123.mp4<br/>blob.core.windows.net/container/file"]
        URL --> URLEx
    end
    
    subgraph Pre-signed URL Approach
        PreSigned["Pre-signed URL<br/><i>Secure, time-limited</i>"]
        PreSignedEx["s3://bucket/video?<br/>AWSAccessKeyId=...&<br/>Expires=1705320000&<br/>Signature=..."]
        PreSigned --> PreSignedEx
    end
    
    subgraph UUID with Registry
        UUID["UUID + Registry Lookup<br/><i>Secure, adds indirection</i>"]
        UUIDEx["claim_check: 550e8400-e29b-41d4-a716"]
        Registry[("Registry<br/><i>Redis/DynamoDB</i>")]
        UUIDEx -."Lookup".-> Registry
        Registry -."Returns storage path".-> UUIDEx
    end
    
    subgraph Composite Key Approach
        Composite["Composite Key<br/><i>Structured, flexible</i>"]
        CompositeEx["{<br/>  storage: 's3',<br/>  region: 'us-east-1',<br/>  bucket: 'payloads',<br/>  key: 'abc123'<br/>}"]
        Composite --> CompositeEx
    end

Four common reference token formats, each with different trade-offs. Direct URLs are simplest but expose storage topology. Pre-signed URLs add time-limited security. UUIDs with registry lookup provide maximum security but require an additional lookup service. Composite keys offer structured flexibility for multi-region or multi-storage deployments.

How It Works

Step 1: Producer stores the payload. When the producer has a message exceeding the size threshold (typically 256KB-1MB), it writes the payload to object storage. For a video processing service, this might be a 50MB video file uploaded to S3 with key videos/2024/01/15/abc123.mp4. The storage operation returns a reference—either the full S3 URL (s3://bucket/videos/2024/01/15/abc123.mp4) or a UUID that maps to the storage location in a separate registry.

Step 2: Producer publishes the claim check. The producer constructs a lightweight message containing the reference token plus essential metadata: {"claim_check": "s3://bucket/videos/abc123.mp4", "content_type": "video/mp4", "size_bytes": 52428800, "uploaded_at": "2024-01-15T10:30:00Z"}. This message is typically 200-500 bytes and flows through the message broker using standard pub-sub semantics. The broker handles ordering, retries, and delivery guarantees as usual.

Step 3: Consumer receives the claim check. The consumer polls the message broker and receives the claim check message. It validates the reference token format and checks metadata (content type, size) to ensure it can process this payload type. If the consumer is a video transcoding service, it verifies the content type is video before proceeding.

Step 4: Consumer retrieves the payload. Using the reference token, the consumer fetches the payload from object storage. For S3, this is a standard GetObject API call. The consumer streams the data directly into its processing pipeline—for example, reading the video file in chunks and transcoding each chunk without loading the entire 50MB into memory. This streaming approach is crucial for handling payloads larger than available RAM.

Step 5: Consumer processes and optionally cleans up. After processing completes, the consumer decides whether to delete the stored payload. If the payload is single-use (like a temporary upload), the consumer deletes it to avoid storage costs. If multiple consumers need the same payload (fan-out pattern), the producer sets an S3 lifecycle policy to auto-delete after 7 days, or the last consumer deletes it after coordination.

Reference token generation: Tokens must be globally unique and contain enough information for retrieval. Common formats include: (1) Direct storage URLs with embedded credentials via pre-signed URLs (AWS S3 supports time-limited signed URLs), (2) UUIDs that map to storage locations in a separate key-value store like Redis, or (3) Composite keys encoding bucket, region, and object path: {"storage": "s3", "region": "us-east-1", "bucket": "payloads", "key": "abc123"}. Pre-signed URLs are simplest but expose storage topology; UUIDs add indirection for security but require a registry lookup.

Claim Check Pattern: Complete Message Flow

graph LR
    Producer["Producer Service"]
    Storage[("Object Storage<br/><i>S3/Blob Storage</i>")]
    Broker["Message Broker<br/><i>Kafka/SQS/RabbitMQ</i>"]
    Consumer["Consumer Service"]
    
    Producer --"1. Upload payload<br/>(50MB video file)"--> Storage
    Storage --"2. Return reference<br/>(s3://bucket/video.mp4)"--> Producer
    Producer --"3. Publish claim check<br/>(500 bytes message)"--> Broker
    Broker --"4. Deliver claim check<br/>(reference + metadata)"--> Consumer
    Consumer --"5. Fetch payload<br/>(using reference token)"--> Storage
    Storage --"6. Stream payload<br/>(50MB video file)"--> Consumer
    Consumer -."7. Optional: Delete payload<br/>(after processing)".-> Storage

The claim check pattern separates payload storage from message delivery. The producer uploads the large payload to object storage (step 1-2), then publishes only a lightweight reference token through the message broker (step 3-4). The consumer retrieves the token, fetches the actual payload from storage (step 5-6), and optionally cleans up after processing (step 7).

Variants

Inline Small Messages Variant: Instead of using claim check for all messages, implement a size threshold check in the producer. Messages under 256KB flow directly through the broker with the payload embedded; messages over 256KB trigger claim check logic. This hybrid approach optimizes for the common case (small messages) while handling edge cases (large payloads). The trade-off is added producer complexity—you need conditional logic and two code paths. Use this when 90%+ of messages are small but you must support occasional large payloads.

Claim Check with Metadata Enrichment: Store not just the payload but also derived metadata in the claim check message. For a document processing system, the claim check might include: {"claim_check": "s3://docs/abc123.pdf", "page_count": 47, "file_hash": "sha256:...", "detected_language": "en"}. The consumer can make routing decisions based on metadata without fetching the full payload. For example, a language detection service might skip documents already marked as English. This variant reduces unnecessary storage retrievals but requires the producer to compute metadata upfront, adding latency to message publishing.

Claim Check with Streaming: For extremely large payloads (10GB+ ML datasets), use streaming protocols instead of batch upload/download. The producer writes to object storage using multipart upload, publishing the claim check before the upload completes. The consumer begins streaming the payload as soon as the first chunks are available, processing data in parallel with the upload. This variant minimizes end-to-end latency for large files but requires careful coordination—the consumer must handle incomplete uploads and retry logic if the producer fails mid-stream. Netflix uses this approach for distributing large video encoding jobs across regions.

Hybrid Pattern: Size-Based Message Routing

flowchart TB
    Start(["Producer has message"])
    Check{"Payload size<br/>> 256KB?"}
    Direct["Direct Messaging<br/><i>Embed payload in message</i>"]
    ClaimCheck["Claim Check Pattern<br/><i>Store in S3, send reference</i>"]
    Broker["Message Broker"]
    Storage[("Object Storage")]
    Consumer(["Consumer"])
    
    Start --> Check
    Check --"No<br/>(90% of messages)"--> Direct
    Check --"Yes<br/>(10% of messages)"--> ClaimCheck
    Direct --> Broker
    ClaimCheck --> Storage
    Storage -."Reference token".-> Broker
    Broker --> Consumer
    Consumer -."Fetch if needed".-> Storage

The inline small messages variant optimizes for the common case by routing small messages (<256KB) directly through the broker while using claim check only for large payloads. This reduces latency for 90% of messages while handling edge cases that exceed broker limits.

Trade-offs

Latency vs. Throughput: Direct messaging has lower latency (single network hop from producer to consumer) but lower throughput for large messages (broker becomes bottleneck). Claim check adds two storage round-trips (write + read), increasing latency by 50-200ms depending on storage region, but enables higher broker throughput since the broker only handles small reference messages. Decision criteria: Use claim check when broker throughput matters more than per-message latency, or when message size exceeds broker limits.

Cost vs. Simplicity: Claim check reduces broker costs (fewer bytes transferred, less memory used) but adds storage costs and operational complexity (storage lifecycle management, access control). For AWS, SQS charges $0.40 per million requests; storing 1GB in S3 costs $0.023/month. If you send 1 million 1MB messages monthly, direct SQS would cost $400 in data transfer (1TB at $0.09/GB after free tier), while claim check costs $0.40 (SQS) + $23 (S3 storage) + $0.90 (S3 GET requests) = $24.30. Decision criteria: Use claim check when message volume is high and payloads are large (>1MB). For small messages (<100KB) or low volume (<10K messages/day), direct messaging is simpler and cheaper.

Consistency vs. Availability: Claim check introduces eventual consistency—the payload might not be available in storage when the consumer receives the claim check, especially if storage replication is slow. Direct messaging guarantees the payload arrives with the message. Decision criteria: Use claim check when you can tolerate retry logic in consumers (“payload not found, retry in 5 seconds”) or when storage replication is fast enough (S3 provides read-after-write consistency). Avoid claim check for real-time systems requiring strict consistency (financial transactions, live bidding).

Cost Comparison: Direct Messaging vs Claim Check

graph TB
    subgraph Direct Messaging via SQS
        D1["1M messages × 1MB each"]
        D2["SQS: $0.40<br/><i>per million requests</i>"]
        D3["Data Transfer: $400<br/><i>1TB at $0.09/GB after free tier</i>"]
        D4["Total: $400.40"]
        D1 --> D2
        D1 --> D3
        D2 --> D4
        D3 --> D4
    end
    
    subgraph Claim Check Pattern
        C1["1M messages × 1MB each"]
        C2["SQS: $0.40<br/><i>reference tokens only</i>"]
        C3["S3 Storage: $23<br/><i>1TB at $0.023/GB/month</i>"]
        C4["S3 GET: $0.90<br/><i>1M requests at $0.0004/1K</i>"]
        C5["S3 PUT: $5<br/><i>1M requests at $0.005/1K</i>"]
        C6["Total: $29.30"]
        C1 --> C2
        C1 --> C3
        C1 --> C4
        C1 --> C5
        C2 --> C6
        C3 --> C6
        C4 --> C6
        C5 --> C6
    end
    
    Comparison["Claim Check saves<br/>93% on costs<br/>($400 → $29)<br/><br/>Trade-off:<br/>+50-200ms latency"]

Cost analysis for 1 million 1MB messages per month shows claim check reduces costs from $400 to $29 by avoiding expensive broker data transfer fees. The pattern is most cost-effective for large payloads (>1MB) and high message volumes, though it adds 50-200ms latency for storage round-trips.

When to Use (and When Not To)

Use Claim Check when:

Message payloads regularly exceed broker limits (>256KB for SQS, >1MB for Kafka)
You’re hitting broker memory or throughput limits due to large messages
Storage costs are significantly lower than broker data transfer costs (typically true for payloads >1MB)
Consumers need to process payloads asynchronously or in parallel (multiple consumers fetching the same S3 object)
Payloads have different access patterns than messages (e.g., messages are consumed once, but payloads might be accessed multiple times for debugging)

Avoid Claim Check when:

Most messages are small (<100KB) and fit comfortably within broker limits—the added complexity isn’t worth it
You need strict transactional consistency between message delivery and payload availability (use direct messaging or distributed transactions)
Latency is critical and you can’t afford the storage round-trip overhead (real-time trading systems, live video streaming)
Your storage system is less reliable than your message broker (rare, but possible with self-hosted storage)
Consumers are stateless and can’t handle “payload not found” errors gracefully

Anti-patterns: Don’t use claim check as a workaround for poor message design. If your messages are large because you’re embedding entire database records, refactor to send only IDs and let consumers query the database. Don’t use claim check for sensitive data without encryption—storage systems have different access control models than message brokers, and you might accidentally expose payloads. Don’t forget to implement storage cleanup—orphaned payloads accumulate costs and clutter storage namespaces.

Real-World Examples

LinkedIn (Kafka + HDFS): LinkedIn’s data pipeline processes billions of events daily, including large payloads like member profile snapshots (100KB-2MB) and analytics reports (10MB+). They use claim check to keep Kafka messages under 1MB while storing payloads in HDFS. The claim check message includes the HDFS path and a checksum for validation. This design reduced Kafka broker memory usage by 70% and improved throughput from 50K to 200K messages/second per broker. Interesting detail: LinkedIn’s claim check implementation includes a “payload TTL” field—consumers must fetch payloads within 24 hours, after which HDFS auto-deletes them to control storage costs.

Netflix (SQS + S3): Netflix’s video encoding pipeline uses claim check to distribute encoding jobs across regions. When a new video is uploaded, the ingestion service stores the raw video in S3 (often 10GB+ for 4K content) and publishes a claim check message to SQS with the S3 URL and encoding parameters. Encoding workers in multiple AWS regions poll SQS, retrieve videos from S3 using cross-region replication, and produce encoded outputs. This architecture allows Netflix to scale encoding capacity independently from storage—they can spin up 1000 encoding workers without overwhelming the message broker. Interesting detail: Netflix uses S3 pre-signed URLs with 1-hour expiration in claim checks to prevent unauthorized access if SQS messages leak.

Stripe (RabbitMQ + PostgreSQL): Stripe’s payment processing system uses a variant of claim check for handling large webhook payloads. When a webhook event includes extensive metadata (invoice line items, customer history), Stripe stores the full payload in PostgreSQL and publishes a claim check message to RabbitMQ with the payload ID. Webhook delivery workers fetch payloads from PostgreSQL, retry failed deliveries, and mark payloads as delivered. This design keeps RabbitMQ messages under 10KB and allows Stripe to implement sophisticated retry logic (exponential backoff, circuit breakers) without re-transmitting large payloads. Interesting detail: Stripe’s claim check messages include a “payload_version” field to handle schema evolution—if the payload format changes, old workers can still process messages by fetching the appropriate version from PostgreSQL.

Netflix Video Encoding: Multi-Region Claim Check Architecture

graph TB
    subgraph Upload Region: us-east-1
        Ingestion["Ingestion Service"]
        S3_Primary[("S3 Primary<br/><i>Raw video 10GB+</i>")]
        SQS["SQS Queue<br/><i>Encoding jobs</i>"]
    end
    
    subgraph Encoding Region: us-west-2
        S3_Replica[("S3 Replica<br/><i>Cross-region replication</i>")]
        Worker1["Encoding Worker 1"]
        Worker2["Encoding Worker 2"]
        WorkerN["Encoding Worker N"]
    end
    
    subgraph Encoding Region: eu-west-1
        S3_Replica2[("S3 Replica<br/><i>Cross-region replication</i>")]
        Worker3["Encoding Worker 3"]
        Worker4["Encoding Worker 4"]
    end
    
    Ingestion --"1. Upload video"--> S3_Primary
    Ingestion --"2. Publish claim check<br/>(pre-signed URL, 1hr expiry)"--> SQS
    S3_Primary -."Replicate".-> S3_Replica
    S3_Primary -."Replicate".-> S3_Replica2
    
    SQS --"3. Poll for jobs"--> Worker1
    SQS --"3. Poll for jobs"--> Worker2
    SQS --"3. Poll for jobs"--> WorkerN
    SQS --"3. Poll for jobs"--> Worker3
    SQS --"3. Poll for jobs"--> Worker4
    
    Worker1 & Worker2 & WorkerN --"4. Fetch video"--> S3_Replica
    Worker3 & Worker4 --"4. Fetch video"--> S3_Replica2

Netflix’s video encoding pipeline uses claim check to distribute 10GB+ video files across multiple AWS regions. The ingestion service uploads videos to S3 in us-east-1 and publishes claim checks with pre-signed URLs to SQS. Encoding workers in us-west-2 and eu-west-1 poll SQS and fetch videos from their regional S3 replicas, enabling massive parallel encoding capacity without overwhelming the message broker.

Interview Essentials

Mid-Level

Explain the basic claim check flow: producer stores payload in S3, publishes reference token to SQS, consumer fetches payload using token. Discuss why this pattern exists (broker message size limits). Calculate a simple cost comparison: 1 million 1MB messages via direct SQS ($400 data transfer) vs. claim check ($24 SQS + S3). Describe how you’d implement error handling when the payload isn’t found in storage (retry with exponential backoff, dead letter queue after 3 attempts). Know the typical size thresholds: SQS 256KB, Kafka 1MB default.

Senior

Design a claim check system with security considerations: pre-signed URLs with time limits, encryption at rest in S3, IAM roles for consumers. Discuss trade-offs between different reference token formats (direct URLs vs. UUIDs with registry lookup). Explain how to handle payload lifecycle: who deletes stored payloads, what happens if a consumer crashes mid-processing (idempotency, cleanup on retry). Describe monitoring: track storage retrieval latency, payload not found errors, orphaned payloads. Compare claim check to alternatives: message chunking (complex reassembly), increasing broker limits (reduced throughput), direct service-to-service transfer (tight coupling).

Staff+

Architect a multi-region claim check system with cross-region replication: how do you ensure consumers in us-west-2 can access payloads stored in us-east-1? Discuss consistency models: S3 read-after-write consistency, eventual consistency for cross-region replication, how to handle “payload not yet replicated” errors. Design a hybrid system that automatically chooses between direct messaging and claim check based on payload size, with dynamic threshold adjustment based on broker load. Explain how claim check interacts with event sourcing: should events reference external payloads or embed them? Discuss cost optimization strategies: S3 Intelligent-Tiering for infrequently accessed payloads, lifecycle policies to auto-delete after N days, compression before storage. Address failure scenarios: what if S3 is down when a consumer tries to fetch a payload? How do you implement circuit breakers and fallback strategies?

Common Interview Questions

Why not just increase the message broker’s size limit? (Answer: Larger messages reduce broker throughput, increase memory usage, and affect all messages sharing the broker, not just large ones. Claim check isolates the impact.)

How do you prevent orphaned payloads when consumers crash? (Answer: Implement payload TTL with storage lifecycle policies, track payload access in a registry, or use consumer heartbeats to detect crashes and trigger cleanup.)

What happens if the storage system is slower than the message broker? (Answer: Consumers experience higher latency, but broker throughput remains high. You can mitigate with caching, parallel fetches, or pre-fetching payloads based on message metadata.)

How do you handle schema evolution for stored payloads? (Answer: Include a version field in the claim check message, store payloads with version-specific keys in storage, or use self-describing formats like Avro with schema registry.)

Red Flags to Avoid

Not discussing storage cleanup strategy—orphaned payloads are a common production issue that wastes money and clutters storage

Ignoring security implications—claim check messages often contain storage URLs that could leak if the message broker is compromised

Assuming synchronous storage availability—S3 has 99.99% availability, meaning 52 minutes of downtime per year; you need retry logic

Not considering cost at scale—claiming claim check is “always better” without calculating actual costs for your message volume and size distribution

Forgetting about monitoring—you need metrics for storage retrieval failures, latency percentiles, and payload size distribution to detect issues

Key Takeaways

Claim Check solves message size limit problems by storing large payloads in object storage (S3, Azure Blob) and passing only lightweight reference tokens through the message broker, enabling brokers to maintain high throughput.

The pattern adds 50-200ms latency per message (storage round-trip) but reduces broker costs and memory pressure. Use it when payloads exceed 1MB or when broker throughput is constrained by large messages.

Reference tokens can be direct storage URLs (simple, but exposes topology), UUIDs with registry lookup (secure, but adds indirection), or pre-signed URLs (secure with time limits). Choose based on security requirements and operational complexity.

Always implement storage cleanup strategies—orphaned payloads accumulate costs. Use lifecycle policies (auto-delete after N days), consumer-driven cleanup (delete after processing), or registry-based tracking to prevent leaks.

Monitor three key metrics: storage retrieval latency (p99 should be <200ms), payload not found errors (indicates cleanup issues or replication lag), and orphaned payload count (should trend toward zero). These metrics predict production issues before they impact users.