What is Apache Beam?

Apache Beam is a unified programming model for both batch and streaming. You write one pipeline — built from PCollections and transforms — and run it on a pluggable runner such as Dataflow, Apache Flink¹, or Apache Spark². The pitch is write-once, run-anywhere: the same code that crunches a finite dataset can process an unbounded event stream, and the same code can move between execution engines without a rewrite. Beam grew out of the model Google described in its Dataflow paper,¹ later open-sourced as an Apache project.

🔗 Learn more — ¹ What is Apache Flink?

🔗 Learn more — ² What is Apache Spark?

🔗 Learn more — ¹ The Dataflow Model (Google Research)

One model for batch and stream

Everything in Beam is a pipeline that reads input, applies transforms, and writes output. The data lives in a PCollection, an immutable distributed dataset, and you reshape it with transforms (map, filter, group-by-key, and so on). The key idea is that a PCollection can be bounded — a fixed dataset with a known end, the world of batch processing — or unbounded — an endless event stream, the world of stream processing³. Crucially, the transforms you write are the same either way. A batch is just an unbounded stream you decided to stop reading, and Beam treats it that way.

🔗 Learn more — ³ Batch vs stream processing

That unification is the genuinely elegant part. Most engines historically gave you two separate APIs and two mental models; Beam collapses them into one, so a team does not have to maintain a batch pipeline and a streaming pipeline that compute the same thing by different means.

flowchart TD
    PIPE["Beam pipeline: PCollections + transforms"] --> RUN["Runner (execution backend)"]
    RUN --> DF["Dataflow"]
    RUN --> FL["Apache Flink"]
    RUN --> SP["Apache Spark"]

    %% color = green: the portable code you write, amber: pluggable backends
    classDef code stroke:#a3be8c,stroke-width:2.5px
    classDef backend stroke:#ebcb8b,stroke-width:2.5px
    class PIPE code
    class RUN,DF,FL,SP backend

Event time: windowing, watermarks, triggers

The hard part of streaming is that events arrive late and out of order — an event stamped 12:00 might land at 12:05 — so you cannot just bucket data by when you happened to see it. Beam's model, formalized in that Dataflow paper, is built around event time: when something actually happened, not when your system processed it.

You group unbounded data into windows keyed on event time (fixed, sliding, or session windows). A watermark is the system's best guess that it has seen everything up to some point in event time, so it knows when a window is probably complete. And triggers decide when to actually emit a window's result — early, on time, or late — letting you choose between getting a fast provisional answer and waiting for stragglers. This trio is most of what makes correct streaming genuinely harder than it looks, and getting it into one coherent API is Beam's real contribution.

Runners are the engine, not Beam

Beam itself does not execute anything. You pick a runner — the execution backend — and it translates your pipeline into work for a specific engine: Google Cloud Dataflow, a Flink cluster, a Spark cluster, or a local direct runner for testing. In principle this is portability: swap the runner flag, keep the code.

In practice the abstraction has a cost, and it is worth being honest about it. Beam adds a layer between your code and the engine, which can mean less direct control and occasionally worse performance than writing for an engine natively. Runner support varies: features land first and most completely on Dataflow (Google's hosted service), and other runners can lag or implement a subset, so "run anywhere" is more aspirational than guaranteed for every feature. And many teams simply do not need it — if you have committed to Flink or Spark, using that engine's own API directly is often simpler than routing through Beam, and the portability you paid for goes unused.

When Beam earns its keep

Beam shines when portability is a real requirement: you want to avoid locking into one engine, you run the same logic across batch and streaming, or you are on Dataflow and want a managed service without writing Dataflow-specific code. Its event-time model is also a clean, well-thought-out way to express windowed streaming regardless of where it runs.

But it is a means, not a default. If you are not switching engines and not straddling batch and stream, the extra abstraction buys you little. The fair summary: the unified model and the portability story are elegant and were genuinely influential — the watermark-and-trigger machinery shaped the whole field — yet plenty of solid teams reach for Flink or Spark directly and never miss it.