What is idempotency (in data pipelines)?

Idempotency is the property of an operation that produces the same result whether you run it once or many times. Running it twice leaves the system in exactly the state it would be in after running it once; the extra runs are no-ops in effect. In a data pipeline¹ that single property is the difference between a job you can safely retry and a job that silently corrupts its own output every time it restarts.

🔗 Learn more — ¹ What is a data pipeline?

Why pipelines need it

Pipelines fail. A node dies mid-run, a network blip kills a connection, a scheduler retries a task it thinks timed out, someone re-runs yesterday's job to backfill a fix. Every one of those is a re-execution of a step that may have already partially or fully completed. If a step is not idempotent, re-execution does damage: a job that does INSERT INTO sales SELECT ... FROM staging will append the same rows a second time, and now your revenue total is double. A counter that does += 1 runs twice and counts one event as two.

The realistic assumption for any distributed system is at-least-once delivery: a message or task will arrive one or more times, never guaranteed exactly once. You cannot make the infrastructure stop retrying — retries are how it survives failure. So the burden moves to your steps: make them safe to repeat, and at-least-once stops being a correctness problem. Idempotency is how you get an exactly-once result out of at-least-once execution.

How to make a step idempotent

The techniques all share one idea — make the write depend on a key or a target, not on "whatever ran before me":

Overwrite a partition instead of appending. Rather than appending today's rows, replace the whole dt=2026-06-18 partition with the freshly computed set. Re-running the job for that day overwrites the same partition with the same data — identical result every time. This is the most common idempotency pattern in batch pipelines and pairs naturally with data partitioning².
Upsert / MERGE on a key. Instead of a blind insert, MERGE (or INSERT ... ON CONFLICT) keyed on a primary key: update the row if the key exists, insert it if it doesn't. Re-applying the same record converges to the same row rather than creating a duplicate. This is how CDC³ consumers stay correct when the change stream replays.
Deterministic IDs plus deduplication. Give each record a stable, content-derived ID (a hash of its natural key, not a random UUID). A second arrival of the same event carries the same ID, so a dedupe or unique constraint drops it.
Delete-then-insert by partition. A blunt, reliable variant: in one transaction, delete the target slice and insert the recomputed rows. The net effect is the same whether it ran zero or three times before.

🔗 Learn more — ² What is data partitioning?

🔗 Learn more — ³ What is Change Data Capture (CDC)?

The anti-pattern to recognize: appends, in-place increments, and "insert only the new rows since last time" logic that relies on remembering exactly where it stopped. Those break the moment a step runs twice.

The HTTP analogy

If this feels familiar from the web, it should. HTTP defines some request methods as idempotent: GET, PUT, and DELETE are specified so that issuing the same request several times has the same intended effect as issuing it once, while POST is not. PUT /users/42 with a body sets that user to a known state — send it five times, the user ends up the same. POST /users creates a new one each call. A partition-overwrite pipeline step is the PUT of data engineering; a blind-append step is the POST.

flowchart TD
    START["Step runs, then crashes; scheduler retries it"] --> APPEND["Append-style step: INSERT new rows"]
    START --> UPSERT["Idempotent step: overwrite partition / MERGE on key"]
    APPEND --> DUP["Output has duplicate rows — totals double"]
    UPSERT --> SAME["Output identical to a single run — correct"]

    %% color = red: corrupted result, green: correct result, grey: shared
    classDef bad stroke:#bf616a,stroke-width:2.5px
    classDef good stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class APPEND,DUP bad
    class UPSERT,SAME good
    class START plain

The honest caveat

True exactly-once delivery — guaranteeing a message is processed once and only once across an unreliable network — is famously hard, and most systems that advertise it achieve it by combining at-least-once delivery with idempotent processing under the hood. So idempotency is not a nice-to-have you bolt on later; it is the practical mechanism by which a re-running, crash-prone pipeline produces correct results at all. Design every step in a DAG⁴ so that running it again changes nothing, and a retry becomes boring instead of dangerous.

🔗 Learn more — ⁴ What is a DAG (and why orchestrators use them)?