The other data catalog: governance, lineage, and OpenMetadata

A library card catalog — drawers of index cards describing what the library holds and where

A card catalog never held a single book. It told you what existed, where it lived, and who had touched it — exactly the job that moved up a layer in the lakehouse. Photo by Alicia Fagerving, CC BY-SA 3.0.

Iceberg won the table-format war, and the moment it did, the fight moved up a layer to the catalog. But "catalog" is one of the most overloaded words in data infrastructure, and the one that grabbed the headline is not the one most teams actually wrestle with. There are two things wearing the name, they sit at different layers, and conflating them is why every catalog conversation goes in circles.

Two things are called a catalog

The first is the technical catalog, the one hit at query runtime. It lives in the query path. When an engine wants to read a table, it asks the technical catalog two questions — where is this table's current metadata and am I allowed to touch it — and the catalog also brokers atomic commits when concurrent writers race. This is the catalog of the format war: Unity Catalog, Apache Polaris, AWS Glue¹, all speaking the Iceberg² REST spec. Every query hits it. It is infrastructure.

🔗 Learn more — ¹ What is AWS Glue?

🔗 Learn more — ² How Apache Iceberg actually works

The second is the governance catalog, sometimes called a metadata catalog or data discovery platform. It sits beside the query path, not in it. Not much breaks at 3am if it goes down — though as AI agents start querying it to decide what data they can trust, that is beginning to change. Its job is for humans and, increasingly, for AI agents: what data do we have, where did this column come from, who owns it, is it any good, and is it allowed to leave the building. This is OpenMetadata, DataHub, Amundsen, and the commercial tier — Collibra, Alation, Atlan.

flowchart TD
    QUERY["Every query, at runtime"] --> RUNTIME["Technical catalog"]
    RUNTIME --> DATA["Iceberg tables on object storage"]

    %% green = in the query path; this catalog is infrastructure and fails loud
    classDef path stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class RUNTIME path
    class QUERY,DATA plain

Catalog one, the technical catalog, lives in the query path: every read goes through it to locate the table and settle the commit, so when it is down, queries stop. Unity, Polaris, and Glue sit here. In the diagram, green marks the catalog in the query path; grey is the query and the data it connects.

%%{init: {'flowchart': {'curve': 'linear'}}}%%
flowchart TD
    PROD["Producers"] -- publish --> CONTRACT["Data contract"]
    CONTRACT -- guarantees --> CONS["Consumers"]
    GOV["Governance catalog"] -- registers --> CONTRACT
    PROD -. metadata .-> GOV
    PROC["Processors"] -. metadata .-> GOV
    STEWARD["Data stewards"] -- curate --> GOV

    %% amber = governance layer (catalog, contract, stewards); grey = data-plane it governs (producers, processors, consumers)
    classDef govc stroke:#ebcb8b,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class GOV,CONTRACT,STEWARD govc
    class PROD,PROC,CONS plain

Catalog two, the governance catalog, sits beside the query path, not in it. The data contract is the promise between the two sides: producers publish it, consumers rely on it. The catalog is what makes that promise real — it registers the contract, ingests metadata from producers and processors, and audits conformance so a broken promise shows up. Data stewards curate it. In the diagram, amber is the governance layer — catalog, contract, stewards; grey is the data-plane it governs — producers, processors, consumers.

Role	Concrete examples
Producers	Snowflake, Postgres, Kafka³
Processors	dbt⁴, Airflow⁵, Spark⁶
Consumers	Analysts, BI tools, AI agents
Governance catalog	OpenMetadata, DataHub, Collibra, Atlan
Data stewards	Data owners, the platform team

🔗 Learn more — ³ What is Apache Kafka?

🔗 Learn more — ⁴ What is dbt?

🔗 Learn more — ⁵ What is Apache Airflow?

🔗 Learn more — ⁶ What is Apache Spark?

The technical catalog got the Iceberg-era attention because it is in the hot path and because two giants are loudly fighting over it. But it answers a narrow question — name to location, plus a lock. The governance catalog answers the questions a data platform actually fails on: discovery, trust, and blast radius. That is the one worth understanding, and it is what the rest of this is about.

What a governance catalog actually does

Strip the marketing and a governance catalog is a metadata index over everything you run, kept out of band. Concretely:

Discovery and search across every source at once — find the table, the dashboard, the dbt model, without knowing which system it lives in.
Column-level lineage — not "table A feeds table B" but "this revenue column is derived from these three upstream fields," so you can answer what breaks if I change this before you change it.
Quality and profiling — freshness, null rates, distribution drift, test results, attached to the asset rather than buried in an orchestrator log.
Ownership and stewardship — a name to page when a pipeline lies.
Access and policy — who may read, change, or export an asset: role-based rules on the metadata, and PII classifications that access policy enforces.
Glossary and classification — business terms mapped to physical columns, plus PII and sensitivity tags on the assets.
Change history — every metadata change versioned (who set this description, tag, or owner, and when), with one-click revert. This is the audit trail the data contract leans on.

The mechanism is the important part for a practitioner: a governance catalog ingests metadata through connectors, it does not sit in front of your data. It reaches into the warehouse, the BI tool, the orchestrator, the streaming bus, and the technical catalog itself, and assembles a graph. Because it is out of band, it can span engines that otherwise share nothing — which turns out to be the whole point.

The players, and the fork that matters

The open field has three names worth knowing, and the difference between them is architectural, not cosmetic.

OpenMetadata is schema-first. It was open-sourced in 2021 under Apache 2.0 by Suresh Srinivas and Sriharsha Chintalapani — names from Apache Hadoop⁷, Apache Atlas, and Uber's Databook — and its bet is a single set of JSON Schema specifications that every entity conforms to, assembled into one unified metadata graph. You model the metadata, then pull it in through 100+ connectors. The commercial steward is Collate.

🔗 Learn more — ⁷ What is Hadoop (and why MapReduce faded)?

DataHub is event-first. It grew out of LinkedIn's WhereHows, was rewritten and open-sourced in 2020, and is now stewarded by Acryl Data (since rebranded DataHub, the company). Its defining choice is a push model on top of pull: producers emit metadata change events onto a Kafka stream, and the catalog reacts. That makes it the natural fit when metadata changes constantly and you want near-real-time propagation and federated, team-owned ingestion.

Amundsen (from Lyft) is the lighter, search-first option when discovery is the whole ask, though its momentum has stalled — sparse releases since Lyft pulled back, and not where you would start a new deployment in 2026. Above them sit the commercial governance suites — Collibra, Alation, Atlan — which trade openness for enterprise policy workflows and a sales team.

Schema-first versus event-first is the choice that separates the two, and it mirrors an argument that keeps recurring in this space about where metadata should physically live: a modeled, queryable source of truth, or a stream you fold up into one. Neither is wrong. They optimize for different failure modes.

The collision

Here is where it gets interesting, and where it ties back to the format war. The two are distinct in function, but the vendors are busy merging them in product — and that merger, not the distinction, is where the risk lives. The technical catalog is climbing into the governance catalog's territory.

Databricks open-sourced Unity Catalog in June 2024 under the Linux Foundation — but read the fine print. What is open is the API and a server compatible with the Iceberg REST spec and the Hive metastore⁸. The parts that make it a governance catalog — managed lineage, the Catalog Explorer UI — remain in the commercial Databricks⁹ product. Snowflake did the symmetric thing: it donated Polaris to the Apache Software Foundation in 2024, and its Horizon governance layer sits above that same open catalog tech. AWS runs the same play with a wider stack: the Glue Data Catalog is the technical layer, Lake Formation bolts fine-grained access governance on top of it, and Amazon DataZone — now surfaced as SageMaker Catalog in SageMaker Unified Studio — adds the discovery, lineage, and glossary side. Same shape, three giants deep.

🔗 Learn more — ⁸ What is the Hive metastore?

🔗 Learn more — ⁹ What is Databricks?

So all three now pitch their technical catalog as also being your governance catalog — lineage, access control, discovery, tags, all bundled in. For a shop that lives entirely inside one of them, that bundling is genuinely convenient. It is also exactly how the governance layer becomes the vendor's.

flowchart TD
    EST{"How many engines in the estate?"} -->|one vendor| BUNDLED["The bundled catalog is enough"]
    EST -->|many engines| NEUTRAL["A neutral catalog earns its place"]
    BUNDLED --> LOCK["Governance becomes the vendor's"]
    NEUTRAL --> CROSS["One view across every engine"]

    %% amber = single-vendor bundled (convenient, conflicted); green = neutral cross-engine
    classDef warm stroke:#ebcb8b,stroke-width:2.5px
    classDef good stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class BUNDLED,LOCK warm
    class NEUTRAL,CROSS good
    class EST plain

The fork the format war reached, one layer up: with a single vendor the bundled catalog (Unity, Snowflake Horizon, AWS SageMaker Catalog) is enough; with many engines only a neutral one (OpenMetadata, DataHub) can see across them all. The convenience of the bundle is also the capture — it is how governance quietly becomes the vendor's. In the diagram, amber is the bundled path (convenient but conflicted); green is the neutral one that sees across engines.

Same lesson, one layer up

The Iceberg post's argument was that in open infrastructure the winning artifact is the one with the least conflicted ownership, not the most features. That law does not stop at the table format. A governance catalog's entire value is a vendor-neutral view across all your engines — and a view across everything cannot be owned by one of the things it is supposed to view neutrally. Unity's lineage is excellent inside Databricks; it is not going to be your honest broker for what is happening in Snowflake and Trino¹⁰ and DuckDB and your streaming bus. Whoever sees the whole estate has to be standing outside all of it.

🔗 Learn more — ¹⁰ What is a query engine (Trino, Presto, and friends)?

That is the structural reason the neutral, open governance catalogs are worth betting on even though the bundled ones are easier to switch on. The threat to them is not technical; it is that the conflicted-but-convenient incumbent absorbs governance so thoroughly that you never reach for the neutral tool — and then the cross-engine view simply does not exist, because no single vendor has any incentive to build it.

When to actually use which

Governance follows heterogeneity. The honest rule of thumb:

One vendor, end to end — if you are all-Databricks or all-Snowflake, the bundled catalog is probably enough, and a standalone governance catalog is overhead you will resent. Don't buy a cross-engine tool to govern one engine.
Many engines — the moment you have a real spread (a warehouse, a lakehouse¹¹, a streaming path, ad-hoc DuckDB, BI on top), the neutral catalog is the only thing that can see all of it at once. That is the case it was built for, and nothing bundled will cover it without quietly assuming you standardize on the bundler. The honest cost: that neutral catalog is itself a platform to run — DataHub's Kafka-and-search stack, OpenMetadata's ingestion fleet — so the cross-engine view is real work, not a free lunch.

🔗 Learn more — ¹¹ What is a data lakehouse?

The trap is picking the bundled catalog while you are single-vendor, then growing into a multi-engine estate the bundle structurally cannot describe — and discovering your "governance" was vendor lock-in with a nicer UI.

The catalog you query, the catalog you govern

Two catalogs, two jobs. One resolves a table name in the hot path; the other tells you what you have, where it came from, and who may touch it. The format war was loud because it was a clean two-vendor fight. The governance layer is quieter and matters more, because it is where trust in a data platform is actually built or lost — and, increasingly, where AI agents will go to find out what is safe to use.

The real prize was never the table format. It is the cross-engine view of the metadata, and by the same logic that handed Iceberg the format war, that view belongs to whoever is least conflicted holding it. Bet accordingly.