← Learn··Updated 18 Jun 2026·3 min read

What is Project Nessie?

Project Nessie is an open-source transactional catalog that brings git-like branches, tags, and atomic commits to data lake tables.

#data
#catalog
#lakehouse
#ai-assisted

Project Nessie is an open-source transactional catalog for data lakes that brings git-like semantics — branches, tags, commits, and merges — to tables, especially Apache Iceberg1 ones. Instead of a catalog that just maps table names to the latest set of files, Nessie keeps a version history of your entire data lake2: every change is a commit, and you can branch off it, tag it, and merge it back. It is, in the most literal sense, "git for the data lake."

🔗 Learn more1 How Apache Iceberg actually works
🔗 Learn more2 What is a data lake?

If you have used Apache Iceberg on its own, you already get per-table snapshots and time travel. Nessie's contribution is moving that idea up a level — to the catalog spanning all your tables — so versioning and isolation apply across many tables at once rather than one table in isolation.

Branches and tags: isolation and reproducibility

The two ideas you reach for first are branches and tags, and they behave just like their git counterparts.

A branch is a named, isolated line of changes. You create etl_dev off main, run a messy experimental pipeline against it, inspect the results, and nobody querying main ever sees the half-finished state. If the experiment works, you merge it into main and it becomes visible atomically. If it does not, you delete the branch and there is nothing to clean up. This is genuinely useful for ETL3 that writes to many tables: you stage the whole run on a branch and promote it only when it is correct and complete.

🔗 Learn more3 What is ETL (and how is ELT different)?

A tag is an immutable, named pointer to a specific commit — a reproducible snapshot of the lake at a moment in time. Tag the state you trained a model on, or the version that fed last quarter's report, and you can query exactly that data months later regardless of how much the lake has changed since. "Re-run this analysis against the data as it stood on the release date" stops being a forensic exercise.

Cross-table atomic commits

The feature that is hard to get any other way is the multi-table atomic commit. In a plain Iceberg-plus-object-store setup, each table commits independently. A pipeline that updates orders, inventory, and shipments together can fail partway and leave the three tables mutually inconsistent, with consumers reading a torn state.

Nessie treats a commit as spanning the whole catalog. You write to all three tables on a branch and merge once; downstream readers see either none of the changes or all of them, never a partial mix. That all-or-nothing guarantee across tables is the closest a data lake gets to the transactional behavior people expect from a database.

%% color = green: the atomic merge that publishes all tables at once
flowchart TD
    MAIN["main (consumers read here)"] --> DEV["branch: etl_run_2026"]
    DEV --> T1["write orders"]
    DEV --> T2["write inventory"]
    DEV --> T3["write shipments"]
    T1 --> MERGE["merge: atomic"]
    T2 --> MERGE
    T3 --> MERGE
    MERGE --> MAIN

    classDef grey stroke:#7b88a1,stroke-width:2.5px
    classDef green stroke:#a3be8c,stroke-width:2.5px
    class MERGE green
    class MAIN,DEV,T1,T2,T3 grey

How it fits an Iceberg lakehouse

Nessie speaks the Iceberg REST catalog protocol, so engines that already talk to an Iceberg REST catalog — Spark4, Trino5, Flink6, Dremio and others — can point at Nessie with mostly configuration, not code rewrites. It sits where a data catalog normally sits in a data lakehouse7: between the query engines and the table files in your data lake, tracking which files belong to which version of which table.

🔗 Learn more4 What is Apache Spark?
🔗 Learn more5 What is a query engine (Trino, Presto, and friends)?
🔗 Learn more6 What is Apache Flink?
🔗 Learn more7 What is a data lakehouse?

Be honest about where it stands, though. Nessie is one option among catalogs, alongside Unity Catalog, Apache Polaris, and AWS Glue8, and its adoption is narrower than those. The git-for-data model is the differentiator — if branching, tagging, and cross-table atomic merges map onto how your team actually works, Nessie earns its place. If you mostly need a plain catalog and never branch your data, a simpler catalog will do the job with less to operate. Pick it for the version-control workflow, not by default.

🔗 Learn more8 What is AWS Glue?