What is Apache Airflow?

Apache Airflow is a workflow orchestrator: you describe a pipeline as a graph of tasks with dependencies, and Airflow figures out what to run, when, in what order, retries the failures, and shows you the whole thing on a dashboard. The key distinction — the one people get wrong — is that Airflow orchestrates work; it does not do the heavy work. It is the conductor, not the orchestra.

Pipelines as DAGs of tasks

An Airflow pipeline is a DAG — a directed acyclic graph¹ — written in Python. Nodes are tasks; edges are dependencies ("run transform only after extract succeeds"). "Acyclic" matters: there are no loops, so there is always a well-defined order and the graph is guaranteed to terminate.

🔗 Learn more — ¹ What is a DAG (and why orchestrators use them)?

flowchart TD
    SCHED["Scheduler — reads DAGs, decides what is runnable now"] --> META["Metadata DB — task states, schedules, history"]
    SCHED --> EX["Executor — hands runnable tasks to workers"]
    EX --> W1["Worker: extract"]
    W1 --> W2["Worker: transform (Spark job)"]
    W2 --> W3["Worker: load to warehouse"]
    W3 --> W4["Worker: run dbt + notify"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class SCHED key
    class META,EX,W1,W2,W3,W4 plain

The pieces:

Scheduler. The brain. It continuously parses the DAG files, checks each task's dependencies and schedule, and decides which tasks are runnable right now. This is the load-bearing component — everything else is plumbing around its decisions.
Metadata database. Every task's state (queued, running, success, failed, retrying), every past run, every schedule lives in a Postgres/MySQL database. Airflow is stateless without it; the DB is the source of truth.
Executor + workers. The executor takes runnable tasks and dispatches them to workers (local processes, Celery workers, or Kubernetes pods). Workers run the actual task code.
Operators. A task is an instance of an operator — a reusable template for a kind of work. BashOperator runs a command, PythonOperator runs a function, and provider operators trigger external systems (SparkSubmitOperator, S3..., database operators). Most real tasks just trigger something else and wait.

Scheduled and dependency-driven

Airflow earns its keep on two axes at once:

Time. A DAG has a schedule (@daily, or a cron expression — Airflow's scheduling syntax is the same five fields). Each scheduled interval creates a DAG run.
Dependencies. Within a run, tasks execute in topological order. If transform fails, load never starts; Airflow retries transform per its retry policy, and only the failed branch is affected.

Together these give you the thing a bare cron job cannot: when a step fails at 3am, Airflow retries it, alerts you, shows exactly which task in which run broke, and lets you re-run just that task once it is fixed — instead of leaving a silently half-finished pipeline.

Is Airflow too heavy for you?

Airflow is a scheduler, a metadata database, an executor, and a pool of workers — a lot of moving parts for what is often "run these five tasks in order, once a day." It is not JVM bloat (Airflow is Python), but it is operational bloat: real to run, real to upgrade, and the managed escapes — MWAA, Cloud Composer, Astronomer — are another pay-more-on-the-cloud line item for a tool that is itself free.

Lighter, open-source orchestrators are worth reaching for first on smaller setups:

Mage².ai — Python-native pipelines with a notebook-style editor and far less ceremony than Airflow's DAG-file model.
Dagster³ and Prefect⁴ — Python-first orchestrators with better local development, typing, and testing than Airflow.
For genuinely simple pipelines, plain cron plus a Makefile beats standing up Airflow at all.

🔗 Learn more — ² What is Mage?

🔗 Learn more — ³ What is Dagster?

🔗 Learn more — ⁴ What is Prefect?

Airflow earns its complexity on large teams with hundreds of interdependent DAGs. Below that, it is usually more infrastructure than the problem deserves.

What Airflow is not

It is not a compute engine. An Airflow task that "processes a terabyte" is almost always a thin operator that submits a Spark⁵ job or kicks off dbt⁶ — the cluster does the work, Airflow just waits and tracks the result. Putting the actual data processing inside a Python task on an Airflow worker is the classic anti-pattern; the worker becomes the bottleneck and you have turned an orchestrator into a (bad) compute engine.

🔗 Learn more — ⁵ What is Apache Spark?

🔗 Learn more — ⁶ What is dbt?

It is also not a streaming tool. Airflow thinks in scheduled batches; for continuous event flow you want Kafka⁷ and a streaming engine, not a DAG that runs every minute.

🔗 Learn more — ⁷ What is Apache Kafka?

The short version: Airflow is a scheduler + metadata DB + executor that runs DAGs of tasks in dependency order, retrying and reporting along the way — orchestrating the pipeline while delegating the heavy lifting to the systems each task triggers.