What is Apache Flink?

Apache Flink is a distributed engine for stateful stream processing¹. It treats an unbounded sequence of events as the primary thing it operates on, handling each record as it arrives rather than collecting events into small batches first. Around that core it adds the two features that make real streaming usable in production: event-time processing with watermarks, so results are correct even when data shows up late, and large managed state with periodic checkpoints, so a job can keep terabytes of running state and still recover to a consistent point after a crash. The combination is what people mean when they call Flink "true streaming with exactly-once."

🔗 Learn more — ¹ Batch vs stream processing

How it differs from Spark

The clearest way to place Flink is against Apache Spark². Spark was born batch-first: its native model is a job over a finite dataset, and its streaming layer, Structured Streaming, runs as micro-batch — it slices the incoming events into tiny batches and runs a small batch job on each. That is a pragmatic design and "good enough streaming" for a lot of analytics work, but the latency floor and the mental model are batch-shaped.

🔗 Learn more — ² What is Apache Spark?

Flink is streaming-first. Its native model is record-at-a-time over an unbounded stream, and batch processing is treated as the special case where the stream happens to be bounded — the opposite framing from Spark. In practice that means Flink reaches lower per-event latency and expresses continuous, stateful logic (sessionization, running aggregations, joins across streams) more directly. If your workload is genuinely event-driven and latency-sensitive, the streaming-first model fits; if it is periodic analytics, Spark's batch heritage is often the simpler match.

Event time, processing time, and watermarks

Streaming forces a question batch can usually ignore: which clock counts? Processing time is when Flink saw the event; event time is when the event actually happened, carried as a timestamp in the record. They diverge because events arrive out of order and late — a device goes offline, a network stalls, and an event stamped 12:00 lands at 12:05.

Flink computes results in event time and uses watermarks to make progress. A watermark is a marker flowing through the stream that asserts "we have probably seen all events up to time T," which lets Flink decide when a time window is complete enough to emit, while still giving late records a bounded chance to be folded in. This machinery — windows keyed on event time plus watermarks — is the part of streaming that is genuinely hard, and it was formalized in Google's Dataflow model (building on the earlier MillWheel work). Getting watermarks and allowed-lateness right is most of the real work in a correct streaming job.

Checkpointing, state, and exactly-once

Flink jobs are stateful: an operator might hold counters, window contents, or join buffers that grow with the data. Flink stores this in a pluggable state backend — typically an embedded RocksDB instance that spills to local disk so state can far exceed memory — and periodically writes a consistent snapshot, a checkpoint, to durable storage.

flowchart TD
    SRC["Event source (Apache Kafka)"] --> OP["Stateful operator keeps local state"]
    OP --> SINK["Sink"]
    OP --> CKPT["Periodic checkpoint of state"]
    CKPT --> STORE["Durable store (object storage)"]
    STORE --> REC["On failure: restore state, replay from offset"]
    REC --> OP

    %% color = green: recovery path that gives exactly-once, grey: normal flow
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class SRC,OP,SINK plain
    class CKPT,STORE,REC key

Checkpoints are what give Flink exactly-once semantics for state: after a crash it restores the last good snapshot and rewinds the source (an offset in Apache Kafka³, for instance) to match, so each event affects the state exactly once. Note this is exactly-once state, not magic end-to-end delivery — exactly-once to an external sink additionally requires that sink to support transactions or idempotent⁴ writes.

🔗 Learn more — ³ What is Apache Kafka?

🔗 Learn more — ⁴ What is idempotency (in data pipelines)?

The SQL API and the honest tradeoff

Not everything needs the low-level DataStream API. Flink SQL lets you express streaming logic as ordinary SQL — windowed aggregations, stream-to-stream joins, continuous queries — over both streaming and batch sources, which is the most approachable way in for analysts and SQL-fluent teams.

The honest part: Flink is powerful and genuinely hard to run well. It is a JVM cluster with a noticeable memory footprint, it runs 24/7 with no quiet window to redeploy, and tuning state backends, checkpoint intervals, and watermark strategies is real, ongoing operational work. The exactly-once stateful streaming model is worth it when the latency requirement actually demands it. But many "streaming" needs are satisfied by simpler tools — micro-batch in an existing engine, or a lighter stream processor — and reaching for a full Flink deployment because it sounds more modern is a common, expensive mistake. Use it when continuous, correct, large-state streaming is the requirement, not the default.