How Apache Iceberg actually works
Iceberg turns thousands of immutable Parquet files in object storage into something that behaves like a database table — atomic commits, time travel, schema evolution. It does it with a tree of metadata files and one atomic pointer swap. A walk through the actual mechanics.
In this series
How data is storedParquet1 gives you fast, compressed, columnar files. It does not give you a table. A real table needs answers to questions a single file cannot have: which files belong to the table right now, how to add files without a half-written table ever being visible, how to see the table as it looked yesterday, and how to rename a column without rewriting petabytes. Apache Iceberg is a table format — a specification for the metadata that answers those questions over a pile of Parquet files sitting in object storage like S3.
🔗 Learn more — 1 How Parquet works: columnar storage explained
The clever part is that object storage has no transactions. S3 cannot do "update these thousand files atomically." Iceberg gets database-grade guarantees out of a store that offers almost none, and it does it with a tree of metadata files plus a single atomic pointer swap.
The metadata tree
An Iceberg table is a layered tree. Each level points down to the next, and the data files at the bottom are ordinary Parquet.
flowchart TD
CAT["Catalog: name → current metadata pointer"] --> META["metadata.json — schema (with field IDs), partition spec, snapshot list"]
META --> SNAP["Snapshot — the table at one point in time"]
SNAP --> ML["Manifest list (Avro) — the manifests in this snapshot, with partition summaries"]
ML --> M1["Manifest file (Avro) — lists data files + per-file stats"]
ML --> M2["Manifest file (Avro)"]
M1 --> D1["data-001.parquet"]
M1 --> D2["data-002.parquet"]
M2 --> D3["data-003.parquet"]
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class CAT,META key
class SNAP,ML,M1,M2,D1,D2,D3 plain
Top to bottom:
metadata.json— the root. A JSON file holding the table's full schema (with a unique integer ID for every column — remember this, it matters later), the partition spec, the table location, and the list of every snapshot the table has ever had. Each change to the table writes a new numberedmetadata.json.- Snapshot — the state of the table at one commit. Not a copy of the data; a definitive pointer to the exact set of files that made up the table at that moment.
- Manifest list — one per snapshot, stored as an Avro file. It lists the manifest files in that snapshot, each annotated with partition-value summaries so a planner can skip whole manifests.
- Manifest file — an immutable Avro file listing actual data files, each with its partition values and per-file statistics (row count, null counts, column min/max).
- Data files — the Parquet files holding the rows.
Notice the same trick Parquet plays with its footer, repeated at every level: statistics are pushed up the tree so a query can prune without opening the layer below. The manifest list lets you skip manifests; manifests let you skip data files; the Parquet footer lets you skip row groups. A selective query can plan a scan over a petabyte-scale table by reading a few small Avro and JSON files and never touching most of the Parquet at all.
How a commit works (and why it is atomic)
This is the heart of it. To add data to an Iceberg table, a writer:
- Writes the new rows as new Parquet data files.
- Writes new manifest files listing those data files (plus the ones carried over).
- Writes a new manifest list for the new snapshot.
- Writes a new
metadata.jsonthat includes the new snapshot. - Asks the catalog to swap the table's current-metadata pointer from the old
metadata.jsonto the new one — a single atomic compare-and-swap.
Everything in steps 1–4 is just writing new, immutable files; none of it is visible to readers yet, because nothing points at it. The table "changes" only at step 5, the instant the catalog flips one pointer. That swap is atomic, so a reader sees either the entire old table or the entire new one — never a half-written state. That is how Iceberg gets atomic commits out of a storage layer that has no transactions: it reduces every change to flipping a single pointer.
The same mechanism gives optimistic concurrency. If two writers commit at once, only one wins the compare-and-swap; the other's swap fails because the pointer is no longer where it expected, so it re-reads the now-current table and retries its commit on top. The result is serializable isolation without locks.
What the tree buys you
Every headline Iceberg feature falls out of this structure rather than being bolted on:
- Time travel. Every commit leaves its snapshot in the metadata; old snapshots still point at their data files. Querying the table "as of" an old snapshot or timestamp just means reading an older snapshot's manifest list. (Old snapshots are kept until you expire them.)
- Schema evolution. Because columns are tracked by unique field ID, not by name or position, you can add, drop, rename, and reorder columns safely. A rename changes a name in
metadata.json; the field ID stays the same, so existing data files still map correctly. No data rewrite. - Hidden partitioning. Iceberg derives partition values from a column via a transform —
day(event_ts),bucket(16, user_id)— and stores the partition spec in metadata. Queries filter on the real column (WHERE event_ts > …) and Iceberg figures out which partitions to read. Users never construct partition paths by hand, and the partition scheme can even evolve without rewriting old data. - File pruning. The per-file stats in manifests let the planner discard files that cannot match a filter before reading them — the manifest-level equivalent of Parquet's predicate pushdown.
Iceberg v2 adds row-level deletes through separate delete files (position or equality deletes) that a reader merges against the data at query time, so updates and deletes no longer require rewriting whole data files.
The catalog is the load-bearing piece
Notice that step 5 — the atomic swap — is the only part that needs more than dumb object storage. That job belongs to the catalog: it resolves a table name to its current metadata.json and performs the atomic compare-and-swap on commit. The catalog is what makes the whole scheme safe, and it is exactly the layer the industry is now fighting over (Snowflake's Polaris, Databricks' Unity Catalog, AWS Glue, all speaking the Iceberg REST catalog spec). That fight is the sequel to the table-format war.
The honest tradeoff
This design is elegant, and it has a real cost. A table is now potentially thousands of small metadata and data files in object storage, and every query plan and every commit is a flurry of small reads and writes against a system that charges per request and rewards large sequential ones. That request-count economics problem is why Iceberg tables need regular compaction and snapshot expiry to stay fast — and why a newer design like DuckLake argues the metadata belongs in a database rather than in JSON and Avro files on a transactionless store. Iceberg is, in a sense, a transactional catalog reimplemented as files on a system that does not support transactions. That it works as well as it does is the achievement; whether it is the right long-term shape is a genuinely open question.
For the mechanics, though, the summary is simple: Parquet stores the rows, a tree of Avro and JSON metadata describes which rows make up the table, and a single atomic pointer swap at the catalog turns "write some files to S3" into "commit a transaction."