← Learn··Updated 18 Jun 2026·5 min read

What is a data lakehouse?

A lakehouse puts a database-style table layer on top of cheap object storage — warehouse guarantees (ACID, schema, time travel) over data-lake files. How the layers stack, and why the architecture exists at all.

Data & lakehouse
How data is storedPart 7 of 8
#data
#lakehouse
#architecture
#ai-assisted

A data lakehouse is an architecture that puts a database-style table layer on top of cheap object storage. It aims to give you the things a data warehouse1 has — ACID transactions, schemas, fast SQL, time travel — while keeping your data as open files in a bucket like S3, the way a data lake2 does. The name is the pitch: the reliability of a warehouse, the cost and openness of a lake. Whether merging the two is actually a good idea — or whether you are better off keeping them separate and using each for what it is best at — is genuinely contested. I lean skeptical, and I get to that below.

🔗 Learn more1 What is a data warehouse?
🔗 Learn more2 What is a data lake?

To see why anyone wanted this, you have to look at the two things it is trying to merge.

The warehouse and the lake

A data warehouse is a database built for analytics — columnar, fast, SQL, schema-enforced, the thing analysts query. The catch: your data lives inside a (often proprietary) system you pay to store and read.

A data lake is the other side: your data as open files in object storage — cheap, open, no vendor lock-in, and (done right) deliberately organized and governed, not a dump. What a raw lake lacks is table semantics: a folder of Parquet files has no atomic commits, no enforced schema, no "the table as it was yesterday."

The lakehouse is the claim that you do not have to choose. Keep the files in the lake; add the missing table semantics as a thin layer on top.

How the layers stack

A lakehouse is not one product — it is a stack of independent layers, each replaceable.

flowchart TD
    ENG["Query engines: Spark, Trino, DuckDB, Snowflake"] --> CAT["Catalog — resolves table name to current metadata, commits atomically"]
    CAT --> TF["Table format: Apache Iceberg / Delta Lake / Hudi (metadata: schema, snapshots, file lists)"]
    TF --> FILE["Open columnar files: Apache Parquet"]
    FILE --> OBJ["Object storage: S3 / GCS / Azure Blob / MinIO"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class TF,CAT key
    class ENG,FILE,OBJ plain

From the bottom up:

  • Object storage holds the bytes. Cheap, durable, infinitely scalable — and transactionless. It cannot do "update these thousand files atomically."
  • Open file format. The data itself is written as Apache Parquet — columnar, compressed, and readable by everything. This is what makes the data open: no vendor owns it.
  • Table format. The new layer. Apache Iceberg, Delta Lake, and Hudi are specifications for metadata — a tree of files that records which Parquet files make up a table right now, the table's schema, and every past snapshot. This is what turns a pile of files into a table.
  • Catalog. The one piece that needs more than dumb storage: it maps a table name to its current metadata and performs the atomic pointer-swap that makes a commit a commit.
  • Engines. Spark3, Trino, DuckDB, even Snowflake read the same tables through the table-format spec. Compute is decoupled from storage — spin up whatever engine fits the query.
🔗 Learn more3 What is Apache Spark?

The two highlighted layers — the table format and the catalog — are the whole reason "lakehouse" is a word. Everything else already existed.

What the architecture buys you

Because the table layer is just metadata over open files:

  • One copy of the data, many engines. A streaming job, a batch ETL, and an analyst's BI tool all read the same Iceberg table. No copying data into a warehouse first.
  • Warehouse guarantees on lake storage. Atomic commits, schema evolution, and time travel — all from the table format, not from a proprietary engine. (Where those guarantees come from mechanically is how Iceberg actually works.)
  • No lock-in at the storage layer. The files are Parquet in your bucket. Switch engines, or run five at once, without migrating data.
  • Cheap storage, separate compute. Pay object-storage prices to keep data; pay for compute only when you query.

The case for keeping them separate

The lakehouse bet is that one system serving every workload beats two specialized ones. I am not convinced, and in practice I lean the other way. The objection is not the table format — running Iceberg over your lake to get atomic commits, schema evolution, and time travel is good, and you should do it. An organized, Iceberg-managed lake makes an excellent system of record. The objection is making object storage your query-serving layer — reading your tables out of blob on every query. You can have every Iceberg benefit on the lake and still not query from there on the hot path: keep the managed lake as the source of truth, then load or materialize into a fast store for the reads that have to be fast. Lake for storage, warehouse for serving, source for realtime — three roles, not one system straining to do all three.

  • Batch analytics is exactly where querying the lake directly is fine. Batch is throughput-bound, not latency-bound, and a bucket is good at throughput. Run Iceberg here for its table semantics — atomic commits, schema evolution, time travel — and read straight from the lake; no complaint. This is the one job the lakehouse genuinely fits.
  • Interactive analytics is where reading from blob on every query hurts. Object storage is high-latency and priced per request; a planner firing thousands of small reads at it is fighting the medium. Lakehouses paper over this with caching layers — but a cache is an admission that the cold path is too slow. A warehouse with data on local NVMe and a real buffer pool is simply better at the sub-second, high-concurrency queries that "interactive" means.
  • Realtime should not go through the lake at all. Force low-latency data through an Iceberg table and you get tiny files, constant commits, and compaction chasing its own tail — the small-file problem in its purest form. Realtime belongs at the source: straight from the OLTP database, off a Kafka4 stream, or out of a purpose-built serving store (Druid, Pinot, ClickHouse). Asking one batch table format to also be a realtime serving layer is the OLTP-vs-OLAP split all over again, one level up — one tool trying to be good at two opposite jobs.
🔗 Learn more4 What is Apache Kafka?

The lakehouse's honest win is real: it avoids a second copy of the data and a second governance surface to keep in sync. My take is that the duplication is usually worth paying. Keep the lake doing cheap durable batch storage, the warehouse doing fast serving, the source doing realtime — and let each be optimal, rather than merging them into one system that is adequate at all three and excellent at none.

The honest tradeoff

Even on its own terms, a lakehouse is a database disassembled into layers and reassembled on top of a store that was never meant to behave like a database. That is elegant, and it has a real cost: a table is now thousands of small metadata and data files, and a storage layer that charges per request and rewards large sequential reads is a poor fit for the flurry of tiny reads a query plan generates. Lakehouse tables need ongoing compaction and metadata cleanup to stay fast, and the catalog layer is now the part of the stack the industry is fighting over.

The short version: a lakehouse is object storage + open files + a table-format metadata layer + a catalog + pluggable engines. Get those layers right, on data that actually needs them, and you have a warehouse you do not have to pour your data into. Apply them where a plain lake or a real warehouse would have done the job, and you inherit the downsides of both instead of the benefits. Knowing which case you are in is the whole skill.