Batch vs stream processing

Batch processing runs a job over a finite, complete dataset — yesterday's orders, last hour's logs — on a schedule, optimizing for throughput and producing results once the whole input has been chewed through. Stream processing runs continuously over an unbounded sequence of events as they arrive, optimizing for latency so a result appears seconds (or milliseconds) after the event that caused it. Same goal — turn raw data into something useful — but they make the opposite tradeoff on when you get the answer, and that one choice drives almost everything else about the system.

Bounded vs unbounded data

The cleanest way to tell them apart is the shape of the input. A batch job reads bounded data: a dataset with a known beginning and end. The job starts, processes every record, and finishes. You can re-run it, count its rows, and reason about it as a single immutable thing. Tools like Apache Spark¹ were built for exactly this.

🔗 Learn more — ¹ What is Apache Spark?

A stream is unbounded: events keep coming and there is no "end" to wait for. A stream processor never finishes — it maintains state and emits results incrementally, forever. Apache Flink² and Apache Kafka³ are the usual backbone here. The deep insight behind modern systems is that a batch is just a stream you decided to stop reading; bounded data is a special case of unbounded data, which is why some engines run both modes on one runtime.

🔗 Learn more — ² What is Apache Flink?

🔗 Learn more — ³ What is Apache Kafka?

flowchart TD
    SRC["Raw data source"] --> BATCH["Batch: bounded dataset, runs on a schedule"]
    SRC --> STREAM["Stream: unbounded events, runs continuously"]
    BATCH --> BOUT["Result after the whole batch finishes — high latency, high throughput"]
    STREAM --> SOUT["Result per event, seconds after arrival — low latency"]

    %% color = which processing model: amber batch, green stream, grey shared
    classDef batch stroke:#ebcb8b,stroke-width:2.5px
    classDef stream stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class BATCH,BOUT batch
    class STREAM,SOUT stream
    class SRC plain

The latency–throughput tradeoff

Batch wins on throughput: by waiting until it has a large chunk, it reads sequentially, amortizes startup cost across millions of rows, and lets the engine plan an efficient pass over everything. The price is latency — the answer is at best as fresh as the last run, so a daily job means results can be nearly a day stale.

Stream wins on latency: each event is handled as it lands, so freshness is measured in seconds. The price is operational and computational overhead per event, and generally lower peak throughput than a tightly optimized batch pass over the same volume. You are paying continuously to never wait.

Micro-batching is the pragmatic middle ground. Instead of one record at a time or one giant nightly job, the engine collects events into tiny batches — say every few seconds — and processes each as a mini batch. Spark Structured Streaming is the canonical example. You trade a little latency for much simpler exactly-once semantics and the throughput benefits of batching, which is why micro-batching is often "good enough streaming" for analytics-flavored work.

Processing time vs event time

Streaming forces a question batch mostly ignores: which clock counts? Processing time is when your system saw the event; event time is when it actually happened. They diverge because events arrive late — a phone goes offline, a network hiccups, and an event timestamped 12:00 shows up at 12:05. A batch job over a full day rarely cares; by the time it runs, the stragglers have landed. A stream that closes its "noon" window at noon will miss them.

Real stream processors handle this with windows keyed on event time plus watermarks — a heuristic for "we've probably seen everything up to time T" — so late data can still be folded into the right window. This machinery, formalized in Google's Dataflow model, is most of what makes streaming genuinely harder than it looks.

Which tool for which job

Pick batch when the work is periodic and correctness-over-freshness: nightly analytics, reporting, OLAP⁴ rollups, ML training sets, backfills, anything feeding a data pipeline⁵ where "yesterday's numbers" is fine. It is cheaper, simpler to test, and trivially re-runnable.

🔗 Learn more — ⁴ OLTP vs OLAP: two opposite jobs

🔗 Learn more — ⁵ What is a data pipeline?

Pick stream when freshness is the feature: fraud and anomaly alerting, real-time dashboards and metrics, live recommendation and personalization features, and event-driven integrations between services. If a five-minute-old answer is worthless, you need a stream.

The honest part: most organizations need both. A classic split is streaming for the live, low-latency path and batch for the authoritative, reprocessable one — the same data handled two ways. And streaming is the operationally heavier choice: stateful jobs run 24/7, late data and out-of-order events are your problem, exactly-once delivery takes real care, and there is no quiet window to redeploy. Batch fails loudly and you re-run it; a stream fails subtly at 3am and quietly drops events. Reach for streaming when the latency requirement actually demands it — not because it sounds more modern than a nightly job that would have done the same work for a fraction of the effort.