Stream Processing: Kafka Streams, Flink & Spark
After this topic, you will be able to:
- Differentiate between stream processing and batch processing for real-time data pipelines
- Explain windowing strategies (tumbling, sliding, session) and their use cases
- Evaluate stateful vs stateless stream processing trade-offs for aggregations and joins
- Design stream processing architectures handling event time vs processing time skew
TL;DR
Stream processing analyzes continuous data flows in real-time, enabling sub-second insights from unbounded event streams. Unlike batch processing that operates on static datasets, stream processing handles infinite data with windowing strategies, stateful computations, and sophisticated time semantics to power systems like Netflix’s real-time recommendations and Twitter’s trending topics. The key challenge is managing state, handling late arrivals, and providing exactly-once guarantees while maintaining low latency at scale.
Background
Stream processing emerged from the need to react to data as it arrives rather than waiting for batch jobs to complete. Traditional batch systems like Hadoop MapReduce process finite datasets on schedules (hourly, daily), introducing latency measured in hours. This works for historical analysis but fails for use cases requiring immediate action: fraud detection must flag suspicious transactions in milliseconds, not tomorrow morning. The shift from batch to streaming reflects a fundamental change in how businesses operate—from retrospective analysis to real-time decision-making.
The evolution started with simple event processing systems in the 1990s, but modern stream processing began with Twitter’s Storm (2011), which introduced the topology model of spouts (sources) and bolts (processors). LinkedIn’s Kafka (2011) provided the distributed log foundation for reliable event streaming. Apache Flink (2014) advanced the field with true event-time processing and exactly-once state consistency. Today’s stream processors handle millions of events per second while maintaining stateful computations across distributed clusters.
The core insight is treating data as unbounded streams rather than bounded batches. A stream is an infinite sequence of events ordered by time. Each event represents something that happened: a user clicked a button, a sensor reported temperature, a payment was processed. Stream processing applies continuous queries to these streams, producing results incrementally as new data arrives. This paradigm shift requires rethinking fundamental concepts like “complete dataset” (streams never complete), “sort” (you can’t sort infinity), and “join” (when do you know all matching records have arrived?).
Architecture
A stream processing system consists of three primary layers: ingestion, processing, and storage/output. The ingestion layer captures events from sources (application logs, database change streams, IoT sensors, user interactions) and feeds them into a distributed messaging system like Kafka. This decouples producers from consumers and provides durability through replicated logs. Events are partitioned by key (user ID, device ID) to enable parallel processing while maintaining ordering guarantees within each partition.
The processing layer runs the actual stream processing jobs. Each job is a directed acyclic graph (DAG) of operators: sources read from message queues, transformations apply business logic (filter, map, aggregate), and sinks write results to databases, caches, or downstream systems. Operators are distributed across worker nodes for parallelism. The framework handles task scheduling, data shuffling between operators, and fault recovery. State stores maintain intermediate results for stateful operations like aggregations and joins—this state is typically backed by RocksDB or similar embedded databases and checkpointed to distributed storage for fault tolerance.
The output layer persists results and makes them available to applications. For low-latency serving, results flow to in-memory caches (Redis) or fast databases (Cassandra, DynamoDB). For analytics, results land in data warehouses (Snowflake, BigQuery). The architecture often includes a lambda or kappa pattern: lambda maintains both batch and stream paths for accuracy vs speed trade-offs, while kappa uses only streaming with reprocessing capabilities. Modern systems favor kappa as stream processors have matured to handle both real-time and historical data.
Critical to the architecture is the separation of compute and storage. Stream processors are stateless at the infrastructure level—they can be killed and restarted without data loss because state is externalized to checkpoints. This enables elastic scaling: add workers during peak load, remove them during quiet periods. The messaging layer (Kafka) retains events for configurable retention periods, allowing replay for debugging, reprocessing with new logic, or recovering from failures.
Stream Processing Architecture: Three-Layer Design
graph LR
subgraph Ingestion Layer
App["Application<br/><i>Event Source</i>"]
IoT["IoT Sensors<br/><i>Device Data</i>"]
DB["Database CDC<br/><i>Change Stream</i>"]
Kafka[("Kafka<br/><i>Distributed Log</i>")]
end
subgraph Processing Layer
Source["Source Operator<br/><i>Read from Kafka</i>"]
Transform["Transform Operators<br/><i>Filter/Map/Aggregate</i>"]
StateStore[("State Store<br/><i>RocksDB</i>")]
Checkpoint[("Checkpoints<br/><i>S3/HDFS</i>")]
end
subgraph Output Layer
Cache[("Redis<br/><i>Low-latency Cache</i>")]
OLTP[("Cassandra<br/><i>Operational DB</i>")]
Warehouse[("Snowflake<br/><i>Analytics</i>")]
end
App --"1. Emit events"--> Kafka
IoT --"2. Stream data"--> Kafka
DB --"3. CDC events"--> Kafka
Kafka --"4. Consume partitions"--> Source
Source --"5. Process records"--> Transform
Transform <--"6. Read/Write state"--> StateStore
StateStore --"7. Periodic snapshot"--> Checkpoint
Transform --"8. Real-time results"--> Cache
Transform --"9. Persist data"--> OLTP
Transform --"10. Analytics sink"--> Warehouse
Stream processing systems separate ingestion (durable message queues), processing (stateful operators with checkpointing), and output (multiple sinks for different latency requirements). This decoupling enables independent scaling and fault tolerance at each layer.
Internals
Under the hood, stream processors use a combination of data structures and algorithms optimized for continuous computation. The fundamental abstraction is the operator, which processes records one at a time or in micro-batches. Operators maintain local state in memory-mapped data structures (hash tables for aggregations, sorted maps for windowed joins) backed by RocksDB for durability. State access must be fast—every record may trigger state lookups—so processors use techniques like bloom filters to avoid disk reads and incremental checkpointing to minimize serialization overhead.
Checkpointing is the mechanism for fault tolerance. Periodically (every few seconds), the framework snapshots all operator state to distributed storage (HDFS, S3). Flink uses the Chandy-Lamport algorithm for distributed snapshots: it injects barrier markers into the data stream, and when an operator receives barriers from all input channels, it checkpoints its state. This creates a consistent global snapshot without stopping the pipeline. If a worker fails, the job restarts from the last completed checkpoint, replaying events from the message queue. This provides exactly-once state consistency even with failures.
Watermarks are special records that flow through the stream indicating progress in event time. A watermark with timestamp T means “no more events with timestamp < T will arrive.” Operators use watermarks to trigger window computations and garbage collect old state. Watermark generation is heuristic-based: a source might emit a watermark at the minimum timestamp seen in the last N seconds minus a safety margin. If events arrive after their watermark (late data), they can be handled by allowed lateness policies: drop them, emit them to a side output, or update already-emitted results.
For stateful operations like joins, stream processors use temporal joins that match events within time windows. A stream-stream join buffers events from both sides in state stores, indexed by join key and timestamp. When an event arrives, the processor probes the other side’s buffer for matches within the window, emits joined records, and stores the event for future matches. State grows with window size and cardinality, so processors implement eviction policies based on watermarks. For large state, processors partition by key and distribute across workers, using consistent hashing to route records to the correct state partition.
Back pressure propagates from slow operators to fast producers. See Back Pressure for detailed flow control mechanisms. When an operator’s input buffer fills, it stops reading from upstream, which cascades back to the source. This prevents out-of-memory errors but can cause the pipeline to fall behind. Tuning involves balancing parallelism, buffer sizes, and checkpoint intervals to maximize throughput while maintaining latency SLAs.
Stateful Processing with Checkpointing and Recovery
graph LR
subgraph Normal Operation
Kafka1[("Kafka<br/>Partition 0<br/>Offset: 1000")]
Op1["Operator Instance<br/><i>Worker 1</i>"]
State1[("Local State<br/>RocksDB<br/>User Aggregates")]
Op1 --"1. Read events"--> Kafka1
Op1 <--"2. Update state"--> State1
end
subgraph Checkpoint Process
Barrier["Checkpoint Barrier<br/><i>Marker in stream</i>"]
Snapshot[("State Snapshot<br/>S3/HDFS<br/>Checkpoint #42")]
Meta["Checkpoint Metadata<br/>Offset: 1000<br/>State: snapshot-42"]
Kafka1 -."3. Inject barrier".-> Barrier
Barrier -."4. Trigger snapshot".-> Op1
State1 --"5. Serialize state"--> Snapshot
Op1 --"6. Record offset"--> Meta
end
subgraph Failure Recovery
Kafka2[("Kafka<br/>Partition 0<br/>Replay from 1000")]
Op2["New Operator Instance<br/><i>Worker 2</i>"]
State2[("Restored State<br/>RocksDB<br/>From snapshot-42")]
Crash["Worker 1 Crashes<br/><i>Lost in-memory state</i>"]
Meta -."7. Read last checkpoint".-> Op2
Snapshot --"8. Restore state"--> State2
Kafka2 --"9. Replay events"--> Op2
Op2 <--"10. Continue processing"--> State2
end
Op1 -."Failure".-> Crash
Crash -."Restart".-> Op2
Stream processors maintain stateful computations in local stores (RocksDB) and periodically checkpoint state to distributed storage. Checkpoint barriers flow through the stream to create consistent snapshots. On failure, operators restore state from the last checkpoint and replay events from the message queue, providing exactly-once state consistency.
Windowing Strategies
Windowing divides infinite streams into finite chunks for aggregation. Without windows, you can’t compute meaningful aggregates like “count of events”—the stream never ends. Windows define boundaries in time or count, allowing the processor to emit results when a window completes.
Tumbling windows are fixed-size, non-overlapping intervals: 0-5min, 5-10min, 10-15min. Each event belongs to exactly one window. Use tumbling windows for periodic reports (“sales per hour”) or rate limiting (“requests per minute per user”). Implementation is straightforward: hash the event timestamp to a window ID, update that window’s aggregate in state, and emit when the watermark passes the window end. Memory usage is predictable—one aggregate per active window per key.
Sliding windows overlap, creating a moving average effect. A 10-minute window sliding every 1 minute produces windows [0-10min], [1-11min], [2-12min]. Each event belongs to multiple windows. Use sliding windows for smoothing metrics (“average latency over the last 5 minutes”) or detecting trends. Implementation requires storing events in state for the window duration, then recomputing aggregates for each slide. This is expensive—a 60-minute window sliding every minute requires 60x the computation of a tumbling window. Optimizations include incremental aggregation (add new events, subtract expired events) for associative functions.
Session windows group events by activity bursts, closing after a gap of inactivity. A 30-minute session window for user clicks creates a new window when a user starts clicking and closes it 30 minutes after their last click. Use session windows for user sessions, anomaly detection (“alert if no heartbeat for 5 minutes”), or batching related events. Implementation is complex: each event may extend an existing session or start a new one, and sessions can merge if a late event fills a gap. Processors maintain session state per key and use timers to trigger window closure after the gap timeout.
Global windows never close—they accumulate all events for a key. Use global windows with custom triggers (“emit every 1000 events” or “emit when a specific event arrives”). This is rare in practice but useful for stateful computations that don’t fit windowing semantics, like maintaining a running total or detecting patterns across arbitrary time spans.
Memory trade-offs are critical. Tumbling windows require O(keys) state. Sliding windows require O(keys × events_per_window). Session windows require O(keys × active_sessions). For high-cardinality keys (millions of users) and large windows (hours), state can exceed memory. Solutions include using approximate algorithms (HyperLogLog for distinct counts), spilling to disk (RocksDB), or pre-aggregating before windowing.
Window Types: Tumbling, Sliding, and Session Comparison
graph TB
subgraph Event Stream
E1["Event 1<br/>10:00:00"]
E2["Event 2<br/>10:02:00"]
E3["Event 3<br/>10:04:00"]
E4["Event 4<br/>10:06:00"]
E5["Event 5<br/>10:08:00"]
E6["Event 6<br/>10:12:00"]
end
subgraph Tumbling Windows - 5 min
T1["Window 1<br/>10:00-10:05<br/>Events: 1,2,3"]
T2["Window 2<br/>10:05-10:10<br/>Events: 4,5"]
T3["Window 3<br/>10:10-10:15<br/>Events: 6"]
end
subgraph Sliding Windows - 5 min, slide 2 min
S1["Window A<br/>10:00-10:05<br/>Events: 1,2,3"]
S2["Window B<br/>10:02-10:07<br/>Events: 2,3,4"]
S3["Window C<br/>10:04-10:09<br/>Events: 3,4,5"]
S4["Window D<br/>10:06-10:11<br/>Events: 4,5"]
end
subgraph Session Windows - 3 min gap
SS1["Session 1<br/>10:00-10:08<br/>Events: 1,2,3,4,5<br/><i>Gap < 3min</i>"]
SS2["Session 2<br/>10:12-10:12<br/>Events: 6<br/><i>Gap > 3min</i>"]
end
E1 & E2 & E3 -.-> T1
E4 & E5 -.-> T2
E6 -.-> T3
E1 & E2 & E3 -.-> S1
E2 & E3 & E4 -.-> S2
E3 & E4 & E5 -.-> S3
E4 & E5 -.-> S4
E1 & E2 & E3 & E4 & E5 -.-> SS1
E6 -.-> SS2
Tumbling windows partition time into non-overlapping intervals (each event in one window). Sliding windows overlap for moving averages (events in multiple windows, higher memory cost). Session windows group activity bursts separated by inactivity gaps (dynamic window boundaries).
Event Time vs Processing Time
Time in stream processing is subtle because events have multiple timestamps. Event time is when the event occurred in the real world (“user clicked at 10:00:00”). Processing time is when the stream processor sees the event (“processed at 10:00:05”). Ingestion time is when the event entered the messaging system (“arrived at Kafka at 10:00:02”). These diverge due to network delays, system outages, and clock skew.
Event time is the gold standard for correctness. If you’re counting hourly sales, you want to count each sale in the hour it occurred, not the hour it was processed. Event time requires events to carry timestamps and the processor to handle out-of-order arrivals. A sale at 10:00:00 might arrive after a sale at 10:00:05 due to network delays. The processor must buffer events and reorder them, which introduces latency and complexity.
Processing time is simpler—just use the system clock—but produces incorrect results when events are delayed. If a network partition delays events by 10 minutes, processing-time aggregates will show a drop during the partition and a spike when events arrive. Processing time is acceptable for monitoring (“current system load”) where approximate real-time is sufficient, but not for business logic.
Watermarks bridge event time and processing time. A watermark is the processor’s estimate of event time progress: “I’ve seen all events up to time T.” Watermarks allow the processor to close windows and emit results without waiting forever for late events. Watermark generation is heuristic: “emit a watermark at the minimum timestamp seen in the last 10 seconds minus 30 seconds.” The 30-second slack accommodates typical delays. If events arrive more than 30 seconds late, they’re considered late data.
Late data handling requires policy decisions. The strictest policy drops late events—simple but loses data. A lenient policy updates already-emitted results, sending corrections downstream. This requires downstream systems to handle updates, which complicates application logic. See Idempotent Operations for handling duplicate or corrected results. A middle ground routes late events to a side output for manual inspection or batch reprocessing.
Allowed lateness configures how long after a watermark the processor keeps window state. Setting allowed lateness to 1 hour means events up to 1 hour late can update results; after that, state is garbage collected and late events are dropped. This trades memory (keeping state longer) for accuracy (handling more late events). Tuning requires understanding your data’s delay distribution: if 99% of events arrive within 1 minute but 1% take 10 minutes, you must decide whether that 1% is worth the extra state.
In practice, most systems use event time with watermarks and a small allowed lateness (minutes). This balances correctness, latency, and resource usage. Processing time is reserved for low-stakes monitoring where simplicity trumps accuracy.
Event Time vs Processing Time with Watermarks
graph TB
subgraph Event Timeline - Event Time
E1["Event A<br/>Event Time: 10:00:00<br/>Value: 100"]
E2["Event B<br/>Event Time: 10:00:30<br/>Value: 150"]
E3["Event C<br/>Event Time: 10:00:15<br/>Value: 200<br/><i>Out of order!</i>"]
E4["Event D<br/>Event Time: 10:00:05<br/>Value: 50<br/><i>Very late!</i>"]
end
subgraph Processing Timeline - Processing Time
P1["Processed: 10:01:00<br/>Event A arrives"]
P2["Processed: 10:01:02<br/>Event B arrives"]
P3["Processed: 10:01:03<br/>Event C arrives<br/><i>15 sec late</i>"]
W1["Watermark: 10:00:25<br/><i>No events < 10:00:25</i>"]
P4["Processed: 10:01:30<br/>Event D arrives<br/><i>85 sec late!</i>"]
W2["Watermark: 10:00:55<br/><i>Window closes</i>"]
end
subgraph Window State
Win["Window: 10:00:00-10:01:00<br/>Sum: 500<br/><i>A+B+C included</i>"]
Late["Late Data Handler<br/>Event D: 50<br/><i>After watermark</i>"]
end
E1 --"1. Arrives on time"--> P1
E2 --"2. Arrives on time"--> P2
E3 --"3. Out of order"--> P3
P1 & P2 & P3 --"4. Before watermark"--> Win
P3 -."5. Emit watermark".-> W1
E4 --"6. Very late"--> P4
P4 -."7. After watermark".-> Late
W1 -."8. Trigger window".-> Win
P4 -."9. Final watermark".-> W2
Event time reflects when events actually occurred, while processing time is when they’re processed. Watermarks estimate event time progress, allowing windows to close despite out-of-order arrivals. Events arriving after watermarks (late data) require special handling—either dropped, routed to side outputs, or used to update already-emitted results.
Performance Characteristics
Stream processing latency ranges from single-digit milliseconds to seconds, depending on complexity and guarantees. Simple stateless transformations (filter, map) add <1ms per record. Stateful operations (windowed aggregations, joins) add 10-100ms due to state lookups and checkpointing overhead. End-to-end latency from event ingestion to result output typically ranges from 100ms to 1 second for most production systems. LinkedIn’s activity streams process events with p99 latency under 500ms. Netflix’s real-time recommendations achieve sub-second updates for 200+ million users.
Throughput scales linearly with parallelism up to bottlenecks. A single Flink task can process 100K-1M records/second for simple operations. Stateful operations drop to 10K-100K records/second due to state access overhead. Kafka Streams achieves similar numbers. Spark Streaming, using micro-batching, trades latency for throughput: 1-2 second latency but can process millions of records per second per cluster. Scaling horizontally by adding workers increases throughput proportionally until you hit message queue limits (Kafka partition count) or state size limits (memory, disk I/O).
State size is the primary scalability constraint. Windowed aggregations with high-cardinality keys (millions of users) and large windows (hours) can produce terabytes of state. RocksDB-backed state stores handle this by spilling to disk, but disk I/O becomes the bottleneck. Flink’s incremental checkpointing reduces checkpoint overhead from minutes to seconds for large state by only snapshotting changed data. State can be partitioned across workers, but rebalancing during scaling or failures requires moving gigabytes of data, causing temporary slowdowns.
Checkpoint intervals trade latency for fault tolerance. Frequent checkpoints (every 10 seconds) minimize data loss on failure but add overhead: serializing state, writing to distributed storage, and coordinating barriers. Infrequent checkpoints (every 5 minutes) reduce overhead but increase recovery time and duplicate processing. Most systems use 30-60 second intervals as a sweet spot. Asynchronous checkpointing allows processing to continue during checkpoints, hiding most of the overhead.
Exactly-once processing adds 10-30% overhead compared to at-least-once due to transactional coordination. Flink’s two-phase commit protocol for sinks ensures atomicity but requires additional round trips. For many use cases, at-least-once with idempotent operations (see Idempotent Operations) provides equivalent guarantees with better performance.
Trade-offs
Stream processing excels at real-time insights and immediate action. It enables sub-second response to events, powering fraud detection, real-time personalization, and operational monitoring. The continuous query model is natural for event-driven architectures where data flows through systems as it’s created. Stream processors handle unbounded data elegantly, processing today’s events with the same code that processed yesterday’s, without batch job scheduling complexity.
The cost is complexity. Stateful stream processing requires understanding windowing, watermarks, and time semantics—concepts that don’t exist in batch processing. Debugging is harder because you can’t easily replay production traffic (though Kafka retention helps). Exactly-once guarantees require careful coordination between the processor and external systems. State management adds operational burden: monitoring state size, tuning checkpoints, handling state migration during upgrades.
Stream processing trades accuracy for speed in some scenarios. Watermarks are estimates, so late data handling is always approximate. Aggregations over large windows may produce partial results before the window closes. For use cases requiring perfect accuracy (financial reconciliation, regulatory reporting), batch processing with complete datasets is safer. The lambda architecture pattern runs both stream and batch pipelines, using streaming for fast approximate results and batch for accurate corrections.
Resource usage is higher than batch for the same throughput. Stream processors run continuously, consuming CPU and memory even during idle periods. Batch jobs spin up, process data, and shut down, paying only for active computation. For workloads with bursty traffic or infrequent updates, batch may be more cost-effective. Stream processing shines when data arrives continuously and results are needed immediately.
Stream processors struggle with iterative algorithms (PageRank, machine learning training) that require multiple passes over data. Batch systems like Spark handle iterations naturally by caching datasets in memory. Stream processors would need to buffer entire datasets, negating the streaming advantage. For such workloads, batch or hybrid approaches (stream for feature extraction, batch for model training) work better.
The choice between stream and batch isn’t binary. Modern systems like Flink and Spark support both, allowing you to use the same API for real-time and historical processing. This unified approach (kappa architecture) simplifies operations by eliminating duplicate pipelines, though it requires stream processors mature enough to handle batch-scale data efficiently.
Stream vs Batch Processing: Architecture Comparison
graph TB
subgraph Stream Processing
S_Source["Continuous Sources<br/><i>Kafka, Kinesis</i>"]
S_Process["Always-On Processors<br/><i>Flink, Kafka Streams</i>"]
S_State[("Stateful Computation<br/><i>Windows, Joins</i>")]
S_Output["Real-time Output<br/><i>Sub-second latency</i>"]
S_Source --"Unbounded stream"--> S_Process
S_Process <--> S_State
S_Process --"Incremental results"--> S_Output
S_Pros["✓ Sub-second latency<br/>✓ Continuous processing<br/>✓ Event-driven<br/>✗ Complex state mgmt<br/>✗ Higher resource cost<br/>✗ Approximate results"]
end
subgraph Batch Processing
B_Source["Bounded Datasets<br/><i>HDFS, S3 files</i>"]
B_Process["Scheduled Jobs<br/><i>Spark, MapReduce</i>"]
B_Compute["Complete Dataset<br/><i>Sort, Join, Iterate</i>"]
B_Output["Batch Output<br/><i>Hours latency</i>"]
B_Source --"Fixed-size batch"--> B_Process
B_Process --> B_Compute
B_Compute --"Final results"--> B_Output
B_Pros["✓ Perfect accuracy<br/>✓ Simpler model<br/>✓ Cost-effective<br/>✗ Hours of latency<br/>✗ Scheduling complexity<br/>✗ Not event-driven"]
end
subgraph Use Case Decision
UC1["Fraud Detection<br/>Real-time Dashboards<br/>Live Recommendations"]
UC2["Financial Reporting<br/>ML Training<br/>Data Warehousing"]
UC3["Hybrid: Kappa Architecture<br/><i>Stream with reprocessing</i>"]
UC1 -."Choose".-> S_Process
UC2 -."Choose".-> B_Process
UC3 -."Unified".-> S_Process
UC3 -."Unified".-> B_Process
end
S_Pros -.-> S_Process
B_Pros -.-> B_Process
Stream processing excels at low-latency, continuous computation on unbounded data but requires complex state management and provides approximate results. Batch processing offers perfect accuracy and simpler operations on complete datasets but introduces hours of latency. Modern kappa architectures use stream processors for both real-time and historical data, eliminating duplicate pipelines.
When to Use (and When Not To)
Choose stream processing when latency matters more than throughput. If your SLA requires results within seconds of an event occurring—fraud detection, real-time bidding, live dashboards—stream processing is the only option. Batch jobs with hourly or daily schedules can’t meet these requirements. Twitter’s trending topics must update within minutes to be relevant; batch processing would show yesterday’s trends.
Use stream processing for continuous data sources that never stop. IoT sensor data, application logs, user activity streams, and financial market data arrive 24/7. Stream processors handle this naturally, while batch systems require complex scheduling and incremental processing logic. If your data source is a message queue (see Message Queues), stream processing is the idiomatic consumer.
Stream processing is ideal for event-driven architectures where actions trigger reactions. When a user signs up, send a welcome email, update recommendations, and log analytics—all in real-time. Stream processors orchestrate these workflows by consuming events and producing new events for downstream systems. The alternative—polling databases for changes—is inefficient and introduces latency.
Avoid stream processing for workloads requiring multiple passes over data. Training machine learning models, graph algorithms, and complex joins across large datasets are better suited to batch processing. Stream processors can’t efficiently iterate over unbounded data. Also avoid streaming for infrequent updates: if your data changes once per day, a daily batch job is simpler and cheaper than a continuously running stream processor.
Consider batch processing when accuracy is paramount and latency is flexible. Financial reconciliation, regulatory reporting, and data warehousing benefit from batch’s ability to process complete, immutable datasets. Batch systems can sort, join, and aggregate without worrying about late arrivals or partial results. If your SLA allows hours or days, batch is often the right choice.
For hybrid requirements, use the kappa architecture: stream processing with the ability to reprocess historical data. Flink and Spark Streaming support bounded sources (files, database snapshots) alongside unbounded sources (Kafka), allowing you to use the same code for both real-time and batch workloads. This eliminates the operational overhead of maintaining separate lambda pipelines.
Technology selection depends on your team’s expertise and existing infrastructure. Kafka Streams is the easiest entry point if you’re already using Kafka—it’s a library, not a separate cluster. Flink offers the most sophisticated features (event time, exactly-once, savepoints) but requires running a distributed cluster. Spark Streaming integrates with the Spark ecosystem (MLlib, GraphX) but uses micro-batching, adding latency. For simple use cases, consider managed services like AWS Kinesis or Google Dataflow to avoid operational complexity.
Real-World Examples
company: Netflix system: Real-Time Recommendations use_case: Netflix processes 500+ billion events per day through stream processing pipelines to update recommendations in real-time. When a user watches a show, streams of events flow through Kafka to Flink jobs that update user profiles, compute similarity scores, and refresh recommendation lists within seconds. The system uses session windows to group viewing activity and stateful joins to correlate viewing patterns with content metadata. Interesting detail: Netflix’s stream processors maintain petabytes of state across thousands of Flink tasks, using incremental checkpointing to RocksDB backed by S3. They handle late arrivals from mobile devices that come online after hours offline by using 24-hour allowed lateness windows, ensuring recommendations eventually reflect all viewing activity even with extreme delays. impact: Real-time recommendations increase engagement by 10-15% compared to batch-updated recommendations, as users see relevant content immediately after expressing preferences.
company: LinkedIn system: Activity Streams use_case: LinkedIn’s activity streams process member actions (profile views, connection requests, job applications) in real-time to power notifications, feed updates, and analytics. The system uses Kafka Streams to process 1+ trillion events per day with sub-second latency. Tumbling windows aggregate metrics like “profile views per hour” while sliding windows compute moving averages for anomaly detection. Stateful processors maintain session state for each member to detect patterns like rapid-fire actions (potential bots) or engagement spikes. The architecture uses event time with 5-minute watermarks to handle mobile app events that arrive delayed. Interesting detail: LinkedIn’s stream processors use local state stores with changelog topics in Kafka for fault tolerance, avoiding external databases for low-latency state access. This design achieves 99.9% availability with automatic recovery from failures in under 30 seconds. impact: Real-time activity streams enable LinkedIn’s notification system to deliver updates within 100ms of actions, driving 40% of daily active usage through timely engagement prompts.
company: Twitter system: Trending Topics use_case: Twitter’s trending topics system uses stream processing to identify emerging hashtags and topics in real-time across 500+ million tweets per day. The pipeline uses Apache Heron (Twitter’s successor to Storm) to process tweet streams with sliding windows that count hashtag occurrences over 5-minute intervals, updated every 30 seconds. Stateful operators track historical trends to distinguish genuine spikes from baseline noise. The system uses event time based on tweet creation timestamps to ensure trends reflect when topics actually emerged, not when tweets were processed. Watermarks with 2-minute allowed lateness handle delayed tweets from mobile clients. Interesting detail: Twitter’s stream processors use custom state backends optimized for high-cardinality aggregations (millions of hashtags), employing probabilistic data structures like Count-Min Sketch to reduce memory usage from gigabytes to megabytes per window while maintaining 99%+ accuracy. impact: Real-time trending topics drive 30% of Twitter’s engagement by surfacing relevant conversations within minutes of emergence, compared to hourly batch updates that miss fast-moving events.
Interview Essentials
Mid-Level
Explain the difference between stream and batch processing with concrete examples. Interviewers expect you to articulate that streams are unbounded and processed continuously while batches are bounded and processed on schedules. Mention latency (seconds vs hours) and use cases (fraud detection vs reporting).
Describe tumbling, sliding, and session windows. Draw timelines showing how events map to windows. Explain when to use each: tumbling for periodic aggregates, sliding for moving averages, session for activity bursts. Mention memory implications—sliding windows are expensive.
Explain event time vs processing time. Give an example where they diverge (mobile app events delayed by hours) and why event time is more correct for business logic. Mention watermarks as the mechanism for handling out-of-order events.
Describe how stateful stream processing works. Explain that state is maintained in local stores (RocksDB), checkpointed for fault tolerance, and partitioned by key for parallelism. Mention that state size is the primary scalability constraint.
Walk through exactly-once processing. Explain that it requires checkpointing operator state and transactional sinks. Mention that at-least-once with idempotent operations is often sufficient and more performant.
Senior
Design a real-time analytics system (e.g., Twitter trending topics). Start with requirements (latency, scale, accuracy). Propose Kafka for ingestion, Flink for processing with sliding windows and event time. Discuss state management (how much state per window?), watermark tuning (how late can events arrive?), and output (where do results go?). Address failure recovery and scaling.
Explain watermarks in depth. Describe how they’re generated (heuristics based on observed timestamps), propagated through the pipeline, and used to trigger window computations. Discuss the trade-off between latency (aggressive watermarks) and completeness (conservative watermarks). Mention allowed lateness for handling stragglers.
Compare Kafka Streams, Flink, and Spark Streaming. Kafka Streams is a library (easy deployment) but limited to Kafka sources. Flink offers true streaming with event time and exactly-once but requires a cluster. Spark Streaming uses micro-batching (higher latency) but integrates with Spark’s ecosystem. Recommend based on requirements.
Design for exactly-once processing with external sinks. Explain two-phase commit: checkpoint operator state, write to sink transactionally, commit on checkpoint completion. Discuss that sinks must support transactions (Kafka, databases) or be idempotent. Mention performance overhead (10-30%) and when at-least-once suffices.
Discuss late data handling strategies. Explain dropping (simple, loses data), updating results (requires downstream to handle corrections), and side outputs (manual inspection). Mention allowed lateness configuration and the trade-off between state size and accuracy. Give a concrete example with numbers (“99% arrive within 1 minute, 1% take 10 minutes”).
Staff+
Architect a multi-region stream processing system for global scale (e.g., Netflix recommendations). Discuss data locality (process events near where they’re generated), cross-region aggregation (how to combine regional results?), and consistency (eventual consistency across regions). Address failure scenarios (region outage) and cost optimization (minimize cross-region traffic).
Optimize stream processing for large state (terabytes). Discuss incremental checkpointing (only snapshot changed data), state TTL (expire old keys), and pre-aggregation (reduce cardinality before windowing). Mention RocksDB tuning (block cache, compaction) and when to use external state stores (Redis, Cassandra) despite latency overhead.
Design a lambda architecture and argue for/against it. Lambda runs both stream (fast, approximate) and batch (slow, accurate) pipelines, merging results at query time. Argue for: batch corrects stream errors, proven pattern. Argue against: operational complexity, duplicate code, kappa architecture (stream-only with reprocessing) is simpler. Discuss when each makes sense.
Explain how to migrate a batch system to streaming without downtime. Discuss dual-writing (write to both batch and stream), shadow mode (run stream in parallel, compare results), and gradual cutover (route increasing traffic to stream). Address data consistency during migration and rollback plans. Mention that Flink/Spark support bounded sources, allowing code reuse.
Discuss stream processing for machine learning. Explain feature extraction (real-time features from event streams), online learning (update models incrementally), and model serving (low-latency inference). Address challenges: stateful feature stores, model versioning, A/B testing in streams. Give examples like fraud detection or recommendation systems.
Common Interview Questions
How do you handle late-arriving events? (Watermarks, allowed lateness, side outputs)
What’s the difference between event time and processing time? (When event occurred vs when processed; event time is correct for business logic)
Explain exactly-once processing. (Checkpointing + transactional sinks; at-least-once with idempotency is often sufficient)
How do you scale stream processing? (Increase parallelism, partition by key, add workers; state size is the constraint)
When would you choose batch over streaming? (Iterative algorithms, infrequent updates, perfect accuracy requirements, cost constraints)
Red Flags to Avoid
Confusing stream processing with message queues. Queues are the transport layer; stream processing is the computation layer that consumes from queues.
Not understanding time semantics. Saying “just use the current time” for aggregations shows lack of depth. Event time is critical for correctness.
Ignoring state management. Claiming stream processing is stateless or not discussing how state is checkpointed and recovered indicates surface-level knowledge.
Proposing stream processing for batch workloads. Suggesting streaming for daily reports or iterative algorithms shows poor judgment about tool selection.
Not considering failure scenarios. Failing to discuss checkpointing, replay, and exactly-once guarantees suggests lack of production experience.
Key Takeaways
Stream processing analyzes unbounded data flows in real-time, enabling sub-second insights that batch processing cannot achieve. Use it when latency matters more than throughput and data arrives continuously.
Windowing (tumbling, sliding, session) divides infinite streams into finite chunks for aggregation. Time semantics (event time vs processing time) and watermarks handle out-of-order arrivals, with allowed lateness balancing accuracy and resource usage.
Stateful stream processing maintains intermediate results in local stores (RocksDB), checkpointed for fault tolerance. State size is the primary scalability constraint—optimize with pre-aggregation, TTL, and incremental checkpointing.
Exactly-once processing requires checkpointing operator state and transactional sinks, adding 10-30% overhead. At-least-once with idempotent operations often provides equivalent guarantees with better performance.
Technology selection depends on requirements: Kafka Streams for simplicity, Flink for sophisticated features (event time, exactly-once), Spark Streaming for ecosystem integration. Modern systems support both streaming and batch (kappa architecture) to eliminate duplicate pipelines.