What is a DAG (and why orchestrators use them)?

A directed acyclic graph is the data structure underneath almost every modern orchestrator. Break the name into its three parts and you have the whole definition. Graph: a set of nodes connected by edges. Directed: each edge points one way, from one node to another. Acyclic: you can never follow those edges and arrive back where you started — there are no cycles.

In orchestration the mapping is concrete: nodes are tasks (extract this table, run that transform, load the result) and edges are dependencies ("load" runs only after "transform" succeeds). The acyclic constraint is the load-bearing part. Because no task can — directly or transitively — depend on itself, there is always at least one valid order to execute everything in. Computing that order is called a topological sort, and it is exactly what an orchestrator does before it runs anything.

What the structure buys you

flowchart TD
    EXT["extract: pull source tables"] --> CLEAN["clean: dedupe + cast types"]
    EXT --> ENRICH["enrich: join reference data"]
    CLEAN --> LOAD["load: write to warehouse"]
    ENRICH --> LOAD
    LOAD --> REPORT["report: refresh dashboards"]

    %% color = on the critical join point where both branches must finish
    classDef grey stroke:#7b88a1,stroke-width:2.5px
    classDef green stroke:#a3be8c,stroke-width:2.5px
    class EXT,CLEAN,ENRICH,REPORT grey
    class LOAD green

Read the edges literally: clean and enrich both depend on extract, and load depends on both of them. Two things fall straight out of that.

First, ordering. The topological sort guarantees a task never starts before its inputs are ready. You declare dependencies; you do not hand-write a sequence.

Second, parallelism. clean and enrich share no edge between them, so an orchestrator is free to run them at the same time. The DAG encodes not just what must come first but what is independent — and independent work can run concurrently.

Why orchestrators model pipelines this way

Airflow¹, Dagster², and Prefect³ all represent a data pipeline⁴ as a DAG for the same set of reasons:

🔗 Learn more — ¹ What is Apache Airflow?

🔗 Learn more — ² What is Dagster?

🔗 Learn more — ³ What is Prefect?

🔗 Learn more — ⁴ What is a data pipeline?

Dependency ordering. The graph is the single source of truth for run order. Change a dependency and the schedule recomputes itself.
Parallelism. Sibling branches run together automatically, bounded only by your worker pool.
Targeted retries and restarts. When enrich fails, the orchestrator knows precisely which downstream tasks are blocked (load, report) and which are unaffected (clean already succeeded). It retries just the failed task, then resumes the blocked subgraph — instead of rerunning the entire pipeline.
Idempotent re-runs. Because each run is a fresh traversal of the same graph over a defined data slice, you can re-execute a task and expect the same result. That idempotency⁵ is what makes retries and backfills safe rather than destructive.

🔗 Learn more — ⁵ What is idempotency (in data pipelines)?

The DAG also drives scheduling. Each run is the graph executed for one time window; a backfill is just the same graph replayed across many historical windows. This is the leap past plain cron⁶: cron fires a command at a time, but it has no model of what depends on what, so a mid-pipeline failure leaves you with a silently half-finished job and no notion of which step to resume.

🔗 Learn more — ⁶ What is cron?

Where the model strains

A DAG is honest about exactly one thing: order. It says nothing about whether your data is correct. A pipeline can be perfectly acyclic, run green end to end, and still load garbage — wrong joins, dropped rows, stale source. Topological validity is not data validity, and conflating the two is a common and expensive mistake.

The structure also resists dynamic, data-dependent branching. The shape of a classic DAG is fixed before the run starts, but real workloads sometimes want loops ("retry until converged"), or branches chosen from values computed mid-run ("process only the partitions that actually changed"). You can approximate this — dynamic task mapping, conditional branches, sensors that wait on a condition — but each workaround pushes against the acyclic premise. When a workflow genuinely needs cycles or runtime-decided control flow, a DAG is the wrong abstraction, and you reach for an engine that models state machines or general control flow instead.

For batch data pipelines, though, the trade is overwhelmingly worth it: a structure that is trivial to reason about, parallelize, retry, and schedule — bought with one rule, no cycles.