What is the Hive metastore?
The Hive metastore maps table names to schemas, partitions, and file locations so engines can treat directories of files as SQL tables.
The Hive metastore is the original table catalog of the Hadoop1 era: a relational database2 (plus a service exposing a Thrift API) that maps table names to their schema, partitions, and file locations in storage, so that a query engine3 can treat a directory of raw files as a proper SQL table. Without it, a folder of Parquet4 files is just bytes on a filesystem. With it, that folder becomes sales.orders — a table with named, typed columns, partitions, and a known storage path that any engine can read.
🔗 Learn more — 1 What is Hadoop (and why MapReduce faded)?
🔗 Learn more — 2 What is a database?
🔗 Learn more — 3 What is a query engine (Trino, Presto, and friends)?
🔗 Learn more — 4 How Parquet works: columnar storage explained
What it stores
The metastore holds metadata about data, not the data itself. The actual rows live in storage as files; the metastore records what those files mean. Concretely it keeps:
- Databases (also called schemas or namespaces) — the top-level grouping for tables.
- Tables — the logical objects, each with a type (managed or external), an owner, and properties.
- Columns — names and data types, so engines know how to parse the underlying files.
- Partitions — the values (for example
date=2026-06-18) that map to specific subdirectories, letting an engine skip files it does not need. - Storage locations — the URI of the directory or bucket holding the files, plus the file format (Parquet, ORC, Avro5, text) and serialization details.
🔗 Learn more — 5 What is Apache Avro (and how is it different from Parquet)?
That mapping is the whole point. A query engine asks the metastore "where does sales.orders live, and what are its columns?", gets back a path and a schema, and reads the files directly. The metastore never touches the data on the read path.
The service and its backing database
There are two pieces. The backing RDBMS — usually MySQL, PostgreSQL, or another relational database — actually stores the metadata in a fixed set of tables. In front of it sits the Hive Metastore Service (HMS), a server process that exposes a Thrift API. Clients talk to HMS over Thrift rather than connecting to the database directly, which decouples engines from the storage schema and lets the metastore be shared, secured, and scaled independently.
flowchart TD
ENG["Query engines: Hive, Spark, Trino, Presto"]
HMS["Hive Metastore Service (Thrift API)"]
DB[("Backing RDBMS: schema, partitions, locations")]
FILES[("Files in storage: Parquet / ORC")]
ENG -->|ask: schema + location| HMS
HMS --> DB
ENG -->|read rows directly| FILES
classDef catalog stroke:#a3be8c,stroke-width:2.5px
classDef plain stroke:#7b88a1,stroke-width:2.5px
%% color = green: the metadata catalog path; grey: engines and raw files
class HMS,DB catalog
class ENG,FILES plain
Because this contract was simple and stable, nearly every engine learned to speak it. Apache Spark6, Trino, Presto, and Hive itself can all point at the same metastore and see the same tables. That shared catalog is exactly what made a single set of files usable from many engines at once — and it is why the metastore became the de facto standard catalog across the Hadoop and early data lake7 world.
🔗 Learn more — 6 What is Apache Spark?
🔗 Learn more — 7 What is a data lake?
Why it is being superseded
The metastore was built for a Hadoop world of relatively static, partition-organized tables, and at modern scale its design shows its age. Two limits stand out. First, partition listing: to plan a query, an engine often has to enumerate partitions and the files within them, and on a table with hundreds of thousands of partitions that listing becomes a real bottleneck. Second, there are no atomic multi-table commits — the metastore tracks where files are, but it has no native notion of an atomic, versioned snapshot of a table, so concurrent writers can see partial or inconsistent state.
Modern table formats answer both. Apache Iceberg8 and its peers move the file-tracking metadata into the table itself (manifest and snapshot files), giving atomic commits, snapshot isolation, and fast planning without scanning a partition database. The catalog layer is moving with them: the Iceberg REST catalog, Unity Catalog, and Apache Polaris are taking over the role HMS used to play, often with finer-grained governance and credential vending the metastore never had.
🔗 Learn more — 8 How Apache Iceberg actually works
None of this means the metastore was a mistake. It was foundational and genuinely ubiquitous — the thing that let a query engine treat files as a data catalog at all, and a piece of infrastructure most data teams still run somewhere. But it is increasingly a legacy layer being replaced by table-format catalogs, rather than the thing new architectures are built on. If you work with an older data lake you will still meet it daily; if you are building fresh, you will more likely reach for a table format and its catalog instead.