How data is stored: databases, warehouses, lakes, and lakehouses

Data storage is a ladder of choices, and most of the confusion in data engineering comes from using the wrong rung for the job. At one end is a single table in a database serving your app; at the other is petabytes of files in object storage. Each layer in between exists because the previous one hit a wall. This is the map — read it first, then follow the parts in order.

The progression

Every storage architecture is a response to the limits of the one before it:

A database stores your application's data and answers questions about it safely and fast. It is the foundation. But one database tuned for an app (OLTP) is bad at company-wide analytics, and running heavy reports on it starves the app.
A data warehouse is a second database, built the opposite way (OLAP) — columnar, scan-optimized — to serve analytics without touching the app's database. Fast, but your data lives inside a (often proprietary) system you pay to store and read.
A data lake keeps your data as open files in cheap object storage instead, readable by any engine, with no vendor lock-in. Done right it is organized and governed — but a raw lake has no table semantics (atomic commits, schema, time travel).
A data lakehouse adds those table semantics back on top of the lake's files with a table format like Iceberg — trying to get warehouse guarantees on lake storage.

The landscape at a glance

flowchart TD
    APP["Application"] --> DB["Database (OLTP) — row store, your source of truth"]
    DB --> PIPE["ETL / ELT / CDC — move + reshape"]
    PIPE --> WH["Warehouse (OLAP) — columnar, fast serving"]
    PIPE --> LAKE["Data lake — open files in object storage"]
    LAKE --> TF["Table format (Iceberg) → lakehouse"]
    WH --> BI["Dashboards, reports, ML"]
    TF --> BI

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class DB key
    class APP,PIPE,WH,LAKE,TF,BI plain

System	What it is	Workload	Storage	Reach for it when
Database (OLTP)	App's source of truth	Many small reads/writes	Row-oriented	You are running an application
Data warehouse	Analytics database	Big scans, aggregations	Columnar	You need fast interactive analytics
Data lake	Open files in a bucket	Cheap durable storage	Object storage	You want cheap, open, engine-agnostic storage
Lakehouse	Table format on the lake	Batch analytics on open files	Object storage + metadata	You want table guarantees without a warehouse

My bias, stated up front

I do not think one system should try to do all of this. The strongest setups I have seen keep the layers distinct and use each for what it is best at: the database as the source of truth, a lake (organized, often Iceberg-managed) as cheap durable storage and system of record for analytics, a warehouse (or a fast engine like ClickHouse¹/DuckDB) as the serving layer for interactive queries, and the source — not the lake — for realtime. The fashionable move is to collapse these into one lakehouse and query everything off object storage; I think that trades two specialized systems for one that is adequate at everything and excellent at nothing. The posts below make that case in detail, and also give the honest version of when the lakehouse genuinely is the right call.

🔗 Learn more — ¹ What is ClickHouse?

The reading path

This series builds from the ground up:

What is a database? — the foundation
OLTP vs OLAP — the two opposite workloads
CSV, TSV, and tabular data formats — how rows and columns become files
How Parquet works — the columnar format under modern analytics
What is a data warehouse?
What is a data lake?
What is a data lakehouse?
How Apache Iceberg works — the table format that powers lakes and lakehouses