What is data lineage?

Data lineage is the recorded provenance of data: for any table or column, where it came from, every transformation it passed through, and everything downstream that now depends on it. It is the map of how data flows through your systems — and the difference between confidently answering "what breaks if I change this column?" and finding out in production.

Table-level vs column-level

Lineage comes at two resolutions.

flowchart TD
    SRC["Source: orders table (OLTP)"] --> STG["Staging: cleaned orders"]
    STG --> FCT["Model: daily_revenue"]
    FCT --> DASH["Dashboard: revenue report"]
    FCT --> ML["Feature: churn model input"]

    %% color = the asset in question; grey upstream, green downstream consumers
    classDef focus stroke:#ebcb8b,stroke-width:2.5px
    classDef up stroke:#7b88a1,stroke-width:2.5px
    classDef down stroke:#a3be8c,stroke-width:2.5px
    class FCT focus
    class SRC,STG up
    class DASH,ML down

Table-level lineage tracks dependencies between datasets: daily_revenue is built from cleaned orders, which comes from the source orders table. Column-level lineage goes finer: it traces that the revenue field specifically is derived from price × quantity in a particular upstream column. Column-level is much harder to capture but far more useful — it tells you exactly which downstream fields a schema change touches, not just which tables.

What it's for

Lineage pays off in four recurring situations:

Impact analysis. Before changing or dropping a column, see everything downstream that consumes it — so a rename doesn't silently break a dashboard three hops away.
Debugging. When a number is wrong, walk the lineage upstream to find which transformation or source introduced the error, instead of guessing.
Compliance and PII. Trace where personal data flows and lands, which is most of what a privacy audit actually asks for.
Trust. A consumer who can see where a metric came from is far more likely to rely on it.

How it's captured — and why coverage is everything

Lineage is built by parsing the SQL of transformations, reading metadata from orchestrators and pipelines, and ingesting it into a data catalog that assembles the graph across systems. Tools like dbt¹ produce lineage as a byproduct of how models reference each other; open standards like OpenLineage exist to carry it between tools.

🔗 Learn more — ¹ What is dbt?

The honest caveat is that lineage is only as good as its coverage. The graph is trustworthy only if every hop is captured. One hand-run script, one transformation in a tool the catalog doesn't read, one export to a spreadsheet — and the chain silently breaks, leaving lineage that looks complete but lies. Manually maintained lineage rots within weeks, because nobody updates a diagram when they ship. The only lineage worth trusting is the kind derived automatically from the code and pipelines that actually run — parsed, not drawn.