The other data catalog: governance, lineage, and OpenMetadata
"Catalog" means two different things in the lakehouse: the runtime catalog in your query path (Unity, Polaris) and the governance catalog beside it (OpenMetadata, DataHub). The second is where lineage, ownership, and trust live — and where the next fight is.
AI-assisted postDrafted with help from Claude, edited and fact-checked by Mart. See transparency policy →
A card catalog never held a single book. It told you what existed, where it lived, and who had touched it — exactly the job that moved up a layer in the lakehouse. Photo by Alicia Fagerving, CC BY-SA 3.0.
Iceberg won the table-format war, and the moment it did, the fight moved up a layer to the catalog. But "catalog" is one of the most overloaded words in data infrastructure, and the one that grabbed the headline is not the one most teams actually wrestle with. There are two things wearing the name, they sit at different layers, and conflating them is why every catalog conversation goes in circles.
Two things are called a catalog
The first is the runtime catalog. It lives in the query path. When an engine wants to read a table, it asks the runtime catalog two questions — where is this table's current metadata and am I allowed to touch it — and the catalog also brokers atomic commits when concurrent writers race. This is the catalog of the format war: Unity Catalog, Apache Polaris, AWS Glue1, all speaking the Iceberg2 REST spec. Every query hits it. It is infrastructure.
🔗 Learn more — 1 What is AWS Glue?
🔗 Learn more — 2 How Apache Iceberg actually works
The second is the governance catalog, sometimes called a metadata catalog or data discovery platform. It sits beside the query path, not in it. Nothing breaks at 3am if it goes down. Its job is for humans and, increasingly, for AI agents: what data do we have, where did this column come from, who owns it, is it any good, and is it allowed to leave the building. This is OpenMetadata, DataHub, Amundsen, and the commercial tier — Collibra, Alation, Atlan.
flowchart TD
QUERY["Every query, at runtime"] --> RUNTIME["Runtime catalog: resolves a table, gates atomic commits"]
RUNTIME --> DATA["Iceberg tables on object storage"]
SOURCES["Warehouses · BI · dbt · orchestrators · streams"] -. metadata ingested .-> GOV["Governance catalog: search · lineage · quality · ownership · policy"]
HUMANS["Engineers · analysts · stewards · AI agents"] --> GOV
%% green = in the query path (Unity, Polaris, Glue); amber = governance layer beside it (OpenMetadata, DataHub)
classDef path stroke:#a3be8c,stroke-width:2.5px
classDef govc stroke:#ebcb8b,stroke-width:2.5px
classDef plain stroke:#7b88a1,stroke-width:2.5px
class QUERY,RUNTIME,DATA path
class GOV govc
class SOURCES,HUMANS plain
The runtime catalog got the Iceberg-era attention because it is in the hot path and because two giants are fighting over it. But it answers a narrow question — name to location, plus a lock. The governance catalog answers the questions a data platform actually fails on: discovery, trust, and blast radius. That is the one worth understanding, and it is what the rest of this is about.
What a governance catalog actually does
Strip the marketing and a governance catalog is a metadata index over everything you run, kept out of band. Concretely:
- Discovery and search across every source at once — find the table, the dashboard, the dbt3 model, without knowing which system it lives in.
- Column-level lineage — not "table A feeds table B" but "this revenue column is derived from these three upstream fields," so you can answer what breaks if I change this before you change it.
- Quality and profiling — freshness, null rates, distribution drift, test results, attached to the asset rather than buried in an orchestrator log.
- Ownership and stewardship — a name to page when a pipeline lies.
- Glossary and classification — business terms mapped to physical columns, and PII tagging that policy can act on.
🔗 Learn more — 3 What is dbt?
The mechanism is the important part for a practitioner: a governance catalog ingests metadata through connectors, it does not sit in front of your data. It reaches into the warehouse, the BI tool, the orchestrator, the streaming bus, and the runtime catalog itself, and assembles a graph. Because it is out of band, it can span engines that otherwise share nothing — which turns out to be the whole point.
The players, and the fork that matters
The open field has three names worth knowing, and the difference between them is architectural, not cosmetic.
OpenMetadata is schema-first. It was open-sourced in 2021 under Apache 2.0 by Suresh Srinivas and Sriharsha Chintalapani — names from Apache Hadoop4, Apache Atlas, and Uber's Databook — and its bet is a single set of JSON Schema specifications that every entity conforms to, assembled into one unified metadata graph. You model the metadata, then pull it in through 130-odd connectors. The commercial steward is Collate.
🔗 Learn more — 4 What is Hadoop (and why MapReduce faded)?
DataHub is event-first. It grew out of LinkedIn's WhereHows, was rewritten and open-sourced in 2019, and is now stewarded by Acryl Data. Its defining choice is a push model on top of pull: producers emit metadata change events onto a Kafka5 stream, and the catalog reacts. That makes it the natural fit when metadata changes constantly and you want near-real-time propagation and federated, team-owned ingestion.
🔗 Learn more — 5 What is Apache Kafka?
Amundsen (from Lyft) is the lighter, search-first option when discovery is the whole ask. Above them sit the commercial governance suites — Collibra, Alation, Atlan — which trade openness for enterprise policy workflows and a sales team.
Schema-first versus event-first is the real decision, and it mirrors an argument that keeps recurring in this space about where metadata should physically live: a modeled, queryable source of truth, or a stream you fold up into one. Neither is wrong. They optimize for different failure modes.
The collision
Here is where it gets interesting, and where it ties back to the format war. The runtime catalog is climbing into the governance catalog's territory.
Databricks open-sourced Unity Catalog in June 2024 under the Linux Foundation — but read the fine print. What is open is the API and a server compatible with the Iceberg REST spec and the Hive metastore6. The parts that make it a governance catalog — managed lineage, the Catalog Explorer UI — remain in the commercial Databricks7 product. Snowflake did the symmetric thing: it donated Polaris to the Apache Software Foundation in 2024, and its Horizon governance layer runs on top of that same Polaris engine.
🔗 Learn more — 6 What is the Hive metastore?
🔗 Learn more — 7 What is Databricks?
So both giants now pitch their runtime catalog as also being your governance catalog — lineage, access control, discovery, tags, all bundled in. For a shop that lives entirely inside one of them, that bundling is genuinely convenient. It is also exactly how the governance layer becomes the vendor's.
flowchart TD
EST{"How many engines in the estate?"} -->|one vendor| BUNDLED["Bundled catalog covers it: Unity · Snowflake Horizon"]
EST -->|many engines| NEUTRAL["Neutral catalog earns its place: OpenMetadata · DataHub"]
BUNDLED --> LOCK["Governance becomes the vendor's; lock-in deepens"]
NEUTRAL --> CROSS["One governance view across every engine"]
%% amber = single-vendor bundled (convenient, conflicted); green = neutral cross-engine
classDef warm stroke:#ebcb8b,stroke-width:2.5px
classDef good stroke:#a3be8c,stroke-width:2.5px
classDef plain stroke:#7b88a1,stroke-width:2.5px
class BUNDLED,LOCK warm
class NEUTRAL,CROSS good
class EST plain
Same lesson, one layer up
The Iceberg post's argument was that in open infrastructure the winning artifact is the one with the least conflicted ownership, not the most features. That law does not stop at the table format. A governance catalog's entire value is a vendor-neutral view across all your engines — and a view across everything cannot be owned by one of the things it is supposed to view neutrally. Unity's lineage is excellent inside Databricks; it is not going to be your honest broker for what is happening in Snowflake and Trino8 and DuckDB and your streaming bus. Whoever sees the whole estate has to be standing outside all of it.
🔗 Learn more — 8 What is a query engine (Trino, Presto, and friends)?
That is the structural reason the neutral, open governance catalogs are worth betting on even though the bundled ones are easier to switch on. The threat to them is not technical; it is that the conflicted-but-convenient incumbent absorbs governance so thoroughly that you never reach for the neutral tool — and then the cross-engine view simply does not exist, because no single vendor has any incentive to build it.
When to actually use which
Governance follows heterogeneity. The honest rule of thumb:
- One vendor, end to end — if you are all-Databricks or all-Snowflake, the bundled catalog is probably enough, and a standalone governance catalog is overhead you will resent. Don't buy a cross-engine tool to govern one engine.
- Many engines — the moment you have a real spread (a warehouse, a lakehouse9, a streaming path, ad-hoc DuckDB, BI on top), the neutral catalog is the only thing that can see all of it at once. That is the case it was built for, and nothing bundled will cover it without quietly assuming you standardize on the bundler.
🔗 Learn more — 9 What is a data lakehouse?
The trap is picking the bundled catalog while you are single-vendor, then growing into a multi-engine estate the bundle structurally cannot describe — and discovering your "governance" was vendor lock-in with a nicer UI.
The catalog you query, the catalog you govern
Two catalogs, two jobs. One resolves a table name in the hot path; the other tells you what you have, where it came from, and who may touch it. The format war was loud because it was a clean two-vendor fight. The governance layer is quieter and matters more, because it is where trust in a data platform is actually built or lost — and, increasingly, where AI agents will go to find out what is safe to use.
The real prize was never the table format. It is the cross-engine view of the metadata, and by the same logic that handed Iceberg the format war, that view belongs to whoever is least conflicted holding it. Bet accordingly.
Read next


