← Blog··Updated 18 Jun 2026·7 min read

The other data catalog: governance, lineage, and OpenMetadata

"Catalog" means two different things in the lakehouse: the runtime catalog in your query path (Unity, Polaris) and the governance catalog beside it (OpenMetadata, DataHub). The second is where lineage, ownership, and trust live — and where the next fight is.

AI-assisted postDrafted with help from Claude, edited and fact-checked by Mart. See transparency policy →
A library card catalog — drawers of index cards describing what the library holds and where

A card catalog never held a single book. It told you what existed, where it lived, and who had touched it — exactly the job that moved up a layer in the lakehouse. Photo by Alicia Fagerving, CC BY-SA 3.0.

Iceberg won the table-format war, and the moment it did, the fight moved up a layer to the catalog. But "catalog" is one of the most overloaded words in data infrastructure, and the one that grabbed the headline is not the one most teams actually wrestle with. There are two things wearing the name, they sit at different layers, and conflating them is why every catalog conversation goes in circles.

Two things are called a catalog

The first is the runtime catalog. It lives in the query path. When an engine wants to read a table, it asks the runtime catalog two questions — where is this table's current metadata and am I allowed to touch it — and the catalog also brokers atomic commits when concurrent writers race. This is the catalog of the format war: Unity Catalog, Apache Polaris, AWS Glue1, all speaking the Iceberg2 REST spec. Every query hits it. It is infrastructure.

🔗 Learn more1 What is AWS Glue?
🔗 Learn more2 How Apache Iceberg actually works

The second is the governance catalog, sometimes called a metadata catalog or data discovery platform. It sits beside the query path, not in it. Nothing breaks at 3am if it goes down. Its job is for humans and, increasingly, for AI agents: what data do we have, where did this column come from, who owns it, is it any good, and is it allowed to leave the building. This is OpenMetadata, DataHub, Amundsen, and the commercial tier — Collibra, Alation, Atlan.

flowchart TD
    QUERY["Every query, at runtime"] --> RUNTIME["Runtime catalog: resolves a table, gates atomic commits"]
    RUNTIME --> DATA["Iceberg tables on object storage"]
    SOURCES["Warehouses · BI · dbt · orchestrators · streams"] -. metadata ingested .-> GOV["Governance catalog: search · lineage · quality · ownership · policy"]
    HUMANS["Engineers · analysts · stewards · AI agents"] --> GOV

    %% green = in the query path (Unity, Polaris, Glue); amber = governance layer beside it (OpenMetadata, DataHub)
    classDef path stroke:#a3be8c,stroke-width:2.5px
    classDef govc stroke:#ebcb8b,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class QUERY,RUNTIME,DATA path
    class GOV govc
    class SOURCES,HUMANS plain

The runtime catalog got the Iceberg-era attention because it is in the hot path and because two giants are fighting over it. But it answers a narrow question — name to location, plus a lock. The governance catalog answers the questions a data platform actually fails on: discovery, trust, and blast radius. That is the one worth understanding, and it is what the rest of this is about.

What a governance catalog actually does

Strip the marketing and a governance catalog is a metadata index over everything you run, kept out of band. Concretely:

  • Discovery and search across every source at once — find the table, the dashboard, the dbt3 model, without knowing which system it lives in.
  • Column-level lineage — not "table A feeds table B" but "this revenue column is derived from these three upstream fields," so you can answer what breaks if I change this before you change it.
  • Quality and profiling — freshness, null rates, distribution drift, test results, attached to the asset rather than buried in an orchestrator log.
  • Ownership and stewardship — a name to page when a pipeline lies.
  • Glossary and classification — business terms mapped to physical columns, and PII tagging that policy can act on.
🔗 Learn more3 What is dbt?

The mechanism is the important part for a practitioner: a governance catalog ingests metadata through connectors, it does not sit in front of your data. It reaches into the warehouse, the BI tool, the orchestrator, the streaming bus, and the runtime catalog itself, and assembles a graph. Because it is out of band, it can span engines that otherwise share nothing — which turns out to be the whole point.

The players, and the fork that matters

The open field has three names worth knowing, and the difference between them is architectural, not cosmetic.

OpenMetadata is schema-first. It was open-sourced in 2021 under Apache 2.0 by Suresh Srinivas and Sriharsha Chintalapani — names from Apache Hadoop4, Apache Atlas, and Uber's Databook — and its bet is a single set of JSON Schema specifications that every entity conforms to, assembled into one unified metadata graph. You model the metadata, then pull it in through 130-odd connectors. The commercial steward is Collate.

🔗 Learn more4 What is Hadoop (and why MapReduce faded)?

DataHub is event-first. It grew out of LinkedIn's WhereHows, was rewritten and open-sourced in 2019, and is now stewarded by Acryl Data. Its defining choice is a push model on top of pull: producers emit metadata change events onto a Kafka5 stream, and the catalog reacts. That makes it the natural fit when metadata changes constantly and you want near-real-time propagation and federated, team-owned ingestion.

🔗 Learn more5 What is Apache Kafka?

Amundsen (from Lyft) is the lighter, search-first option when discovery is the whole ask. Above them sit the commercial governance suites — Collibra, Alation, Atlan — which trade openness for enterprise policy workflows and a sales team.

Schema-first versus event-first is the real decision, and it mirrors an argument that keeps recurring in this space about where metadata should physically live: a modeled, queryable source of truth, or a stream you fold up into one. Neither is wrong. They optimize for different failure modes.

The collision

Here is where it gets interesting, and where it ties back to the format war. The runtime catalog is climbing into the governance catalog's territory.

Databricks open-sourced Unity Catalog in June 2024 under the Linux Foundation — but read the fine print. What is open is the API and a server compatible with the Iceberg REST spec and the Hive metastore6. The parts that make it a governance catalog — managed lineage, the Catalog Explorer UI — remain in the commercial Databricks7 product. Snowflake did the symmetric thing: it donated Polaris to the Apache Software Foundation in 2024, and its Horizon governance layer runs on top of that same Polaris engine.

🔗 Learn more6 What is the Hive metastore?
🔗 Learn more7 What is Databricks?

So both giants now pitch their runtime catalog as also being your governance catalog — lineage, access control, discovery, tags, all bundled in. For a shop that lives entirely inside one of them, that bundling is genuinely convenient. It is also exactly how the governance layer becomes the vendor's.

flowchart TD
    EST{"How many engines in the estate?"} -->|one vendor| BUNDLED["Bundled catalog covers it: Unity · Snowflake Horizon"]
    EST -->|many engines| NEUTRAL["Neutral catalog earns its place: OpenMetadata · DataHub"]
    BUNDLED --> LOCK["Governance becomes the vendor's; lock-in deepens"]
    NEUTRAL --> CROSS["One governance view across every engine"]

    %% amber = single-vendor bundled (convenient, conflicted); green = neutral cross-engine
    classDef warm stroke:#ebcb8b,stroke-width:2.5px
    classDef good stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class BUNDLED,LOCK warm
    class NEUTRAL,CROSS good
    class EST plain

Same lesson, one layer up

The Iceberg post's argument was that in open infrastructure the winning artifact is the one with the least conflicted ownership, not the most features. That law does not stop at the table format. A governance catalog's entire value is a vendor-neutral view across all your engines — and a view across everything cannot be owned by one of the things it is supposed to view neutrally. Unity's lineage is excellent inside Databricks; it is not going to be your honest broker for what is happening in Snowflake and Trino8 and DuckDB and your streaming bus. Whoever sees the whole estate has to be standing outside all of it.

🔗 Learn more8 What is a query engine (Trino, Presto, and friends)?

That is the structural reason the neutral, open governance catalogs are worth betting on even though the bundled ones are easier to switch on. The threat to them is not technical; it is that the conflicted-but-convenient incumbent absorbs governance so thoroughly that you never reach for the neutral tool — and then the cross-engine view simply does not exist, because no single vendor has any incentive to build it.

When to actually use which

Governance follows heterogeneity. The honest rule of thumb:

  • One vendor, end to end — if you are all-Databricks or all-Snowflake, the bundled catalog is probably enough, and a standalone governance catalog is overhead you will resent. Don't buy a cross-engine tool to govern one engine.
  • Many engines — the moment you have a real spread (a warehouse, a lakehouse9, a streaming path, ad-hoc DuckDB, BI on top), the neutral catalog is the only thing that can see all of it at once. That is the case it was built for, and nothing bundled will cover it without quietly assuming you standardize on the bundler.
🔗 Learn more9 What is a data lakehouse?

The trap is picking the bundled catalog while you are single-vendor, then growing into a multi-engine estate the bundle structurally cannot describe — and discovering your "governance" was vendor lock-in with a nicer UI.

The catalog you query, the catalog you govern

Two catalogs, two jobs. One resolves a table name in the hot path; the other tells you what you have, where it came from, and who may touch it. The format war was loud because it was a clean two-vendor fight. The governance layer is quieter and matters more, because it is where trust in a data platform is actually built or lost — and, increasingly, where AI agents will go to find out what is safe to use.

The real prize was never the table format. It is the cross-engine view of the metadata, and by the same logic that handed Iceberg the format war, that view belongs to whoever is least conflicted holding it. Bet accordingly.

Read next