How Apache Iceberg won the table-format war

Most of its mass below the surface — like the metadata you never see. Public domain.

The war was over the day Databricks bought the other side's founders

There were three open table formats fighting for the lakehouse¹ in 2022: Apache Iceberg², Delta Lake³, and Apache Hudi⁴. By 2025 there is one that everybody builds against and two that everybody supports for compatibility. The interesting part is not that Iceberg won — it is how, because Iceberg did not win on the merits a feature comparison would have predicted. Delta Lake had the larger installed base, used by over 60% of the Fortune 500 through Databricks' customer base. Hudi had the more mature write path and the harder streaming and CDC⁵ workloads, with Uber, Amazon, Walmart, and Robinhood leaning on it. Iceberg won anyway, and the inflection point was the day in June 2024 when Databricks⁶ — the company that made Delta Lake — agreed to acquire Tabular, the startup founded by Iceberg's original creators, for a price reported at close to $2 billion. When the incumbent format's owner pays two billion dollars for the challenger format's founders, the challenger has won.

🔗 Learn more — ¹ What is a data lakehouse?

🔗 Learn more — ² How Apache Iceberg actually works

🔗 Learn more — ³ What is Delta Lake (and how does it compare to Iceberg)?

🔗 Learn more — ⁴ What is Apache Hudi?

🔗 Learn more — ⁵ What is Change Data Capture (CDC)?

🔗 Learn more — ⁶ What is Databricks?

What a table format actually is

A table format is the least glamorous and most load-bearing layer in the modern data stack, and most arguments about it happen without anyone defining it. It is not a file format — Parquet⁷ is the file format, and all three table formats sit on top of Parquet (or ORC). A table format is a specification for the metadata that turns a pile of immutable Parquet files in object storage into something that behaves like a database table: atomic commits, schema evolution, time travel, hidden partitioning, and a consistent view of "what files make up this table right now."

🔗 Learn more — ⁷ How Parquet works: columnar storage explained

Concretely, Iceberg is a tree of metadata files. A metadata.json points at the current snapshot; the snapshot points at a manifest list; the manifest list points at manifest files; the manifests point at the actual data files. Every commit writes new metadata pointing at a new snapshot, which is how you get atomic writes and time travel out of an object store that has no transactions of its own. I wrote about the request-count economics of all those small files elsewhere, and that small-files-in-object-storage design is exactly the thing DuckLake later argued was a mistake — but for the table-format war, the design was good enough, and what mattered was who controlled the spec.

Open governance was the actual product

Ryan Blue and Daniel Weeks created Iceberg at Netflix and donated it to the Apache Software Foundation, where it became a top-level project. That governance detail is the whole story. Delta Lake's spec, for all that it was eventually open-sourced under the Linux Foundation, was for years effectively steered by Databricks; Hudi was tightly coupled to its Uber origins and a narrower set of streaming-first use cases. Iceberg was the one format that no competing vendor controlled.

That neutrality is why a vendor list reads like a truce between rivals. By 2025, Spark, Trino, Flink, Dremio, Snowflake, BigQuery, Athena, StarRocks, and DuckDB all read Iceberg, and most write it. Snowflake and Databricks — two companies whose entire marketing budgets are spent attacking each other — both committed to the same table format. No competitor adopts a rival's house format as an interchange standard; everybody adopts the neutral one. Iceberg's formal specification meant any engine that followed the spec could read and write tables correctly, which made it the lingua franca by construction rather than by market share.

Format	Origin	Governance	2025 position
Apache Iceberg	Netflix, 2017	Apache top-level project	De-facto interchange standard; every major engine reads/writes it
Delta Lake	Databricks	Linux Foundation, Databricks-steered	Largest installed base via Databricks; converging on Iceberg via UniForm
Apache Hudi	Uber	Apache	Strong in streaming/CDC; Hudi 1.0 added Iceberg-format output

The Tabular acquisition was a surrender disguised as a purchase

Read Databricks' own framing of the deal and the strategy is undisguised. The stated goal was to bring "the original creators of Apache Iceberg and Linux Foundation Delta Lake" together and to evolve toward "a single, open, and common standard of interoperability". In the short term that means Delta Lake UniForm, the feature that writes Delta tables with Iceberg-readable metadata so a Delta table can be queried as if it were Iceberg. In the long term it means Databricks stops trying to win the format war and starts trying to own the layer above it — the catalog.

Spending somewhere between $1 billion and $2 billion on a roughly forty-person company is not a feature acquisition. It is the price of not being locked out of the format every other engine had standardised on. Databricks looked at a world where Snowflake, AWS, and Google were all converging on Iceberg and decided it was cheaper to buy a seat at the Iceberg table than to keep defending Delta as a competing standard. That is what losing a standards war looks like when you have enough cash to make the loss look like strategy.

The fight just moved up a layer, to the catalog

Winning the table-format war did not end the war. It relocated it. Once everyone agreed that tables are Iceberg, the open question became who resolves a table name to its current metadata — the catalog. A table format tells you how to read a table once you know where its metadata.json lives; the catalog is the thing that knows where it lives, who is allowed to read it, and how to commit a new snapshot atomically across concurrent writers. And the catalog layer is fragmented in exactly the way the format layer used to be.

Snowflake open-sourced Polaris in mid-2024 under Apache 2.0, implementing the Iceberg REST catalog spec, and it is now Apache Polaris. Days later, Databricks open-sourced Unity Catalog under the Linux Foundation, also speaking the Iceberg REST catalog API. AWS has Glue. The REST catalog specification is the shared part — it is the same trick Iceberg pulled at the format layer, a neutral API that any engine can speak — but the implementations are once again vendor-aligned camps. Snowflake's camp and Databricks' camp each shipped an open-source catalog within the same fortnight, each compliant with the same spec, each hoping their implementation becomes the default. We have watched this movie. The format war's resolution tells you how the catalog war probably ends: with a neutral spec everyone implements and a slow convergence on whichever implementation has the least conflicted governance. And the runtime catalog is only half the picture: beside it sits a second catalog — the governance and discovery layer, where lineage, ownership, and trust actually live — which I take apart in a follow-up on governance catalogs.

Why "engine-neutral" is the only durable moat

The deeper lesson, and the one I keep coming back to as a practitioner choosing what to build on, is that in open infrastructure the winning artifact is the one with the least conflicted ownership, not the most features. Hudi's write path was arguably the best of the three; it did not matter. Delta had the most production deployments; it did not matter enough. Iceberg won because a Snowflake architect and a Databricks architect could both adopt it without feeding a competitor, and that property — being safe for everyone to depend on — turned out to be worth more than any technical advantage. The same logic is why Parquet itself is uncontroversial and why the catalog REST spec will probably outlast any single catalog product. Neutral standards accrete adoption; house formats accrete lock-in, and lock-in is the thing every buyer in 2026 has finally learned to price.

The contrarian footnote is that none of this means Iceberg's design is the right one. The metadata-files-in-object-storage approach it standardised is, on a hard look, a transactional catalog reimplemented in JSON and Avro⁸ files on a system that does not support transactions — which is precisely the critique DuckLake makes, and part of why single-node engines like DuckDB are eating workloads the lakehouse was supposed to own. Iceberg won the format war on governance. Whether the format it standardised is the one we will still be using in 2030 is a separate question, and the answer is starting to look like "maybe not."

🔗 Learn more — ⁸ What is Apache Avro (and how is it different from Parquet)?

A short close

Iceberg won the table-format war the way standards usually get won: not by being the best, but by being the one nobody had a reason to distrust. Netflix origin, Apache governance, a clean engine-neutral spec, and a vendor field where every major player needed a format they could all share. The Tabular acquisition was the formal surrender — Databricks paid roughly two billion dollars to stop fighting and start interoperating. The war is not over, it just moved up to the catalog, where Polaris and Unity are now replaying the same standoff one layer higher. If you are choosing a table format in 2026, the choice is already made for you, and the reason it is made is governance, not throughput.