How data is stored: databases, warehouses, lakes, and lakehouses
A map of where data actually lives — from one table in a database to data warehouses, lakes, and lakehouses — and which layer fits which job. The overview that ties the whole storage stack together.
In this series
How data is storedData storage is a ladder of choices, and most of the confusion in data engineering comes from using the wrong rung for the job. At one end is a single table in a database serving your app; at the other is petabytes of files in object storage. Each layer in between exists because the previous one hit a wall. This is the map — read it first, then follow the parts in order.
The progression
Every storage architecture is a response to the limits of the one before it:
- A database stores your application's data and answers questions about it safely and fast. It is the foundation. But one database tuned for an app (OLTP) is bad at company-wide analytics, and running heavy reports on it starves the app.
- A data warehouse is a second database, built the opposite way (OLAP) — columnar, scan-optimized — to serve analytics without touching the app's database. Fast, but your data lives inside a (often proprietary) system you pay to store and read.
- A data lake keeps your data as open files in cheap object storage instead, readable by any engine, with no vendor lock-in. Done right it is organized and governed — but a raw lake has no table semantics (atomic commits, schema, time travel).
- A data lakehouse adds those table semantics back on top of the lake's files with a table format like Iceberg — trying to get warehouse guarantees on lake storage.
The landscape at a glance
flowchart TD
APP["Application"] --> DB["Database (OLTP) — row store, your source of truth"]
DB --> PIPE["ETL / ELT / CDC — move + reshape"]
PIPE --> WH["Warehouse (OLAP) — columnar, fast serving"]
PIPE --> LAKE["Data lake — open files in object storage"]
LAKE --> TF["Table format (Iceberg) → lakehouse"]
WH --> BI["Dashboards, reports, ML"]
TF --> BI
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class DB key
class APP,PIPE,WH,LAKE,TF,BI plain
| System | What it is | Workload | Storage | Reach for it when |
|---|---|---|---|---|
| Database (OLTP) | App's source of truth | Many small reads/writes | Row-oriented | You are running an application |
| Data warehouse | Analytics database | Big scans, aggregations | Columnar | You need fast interactive analytics |
| Data lake | Open files in a bucket | Cheap durable storage | Object storage | You want cheap, open, engine-agnostic storage |
| Lakehouse | Table format on the lake | Batch analytics on open files | Object storage + metadata | You want table guarantees without a warehouse |
My bias, stated up front
I do not think one system should try to do all of this. The strongest setups I have seen keep the layers distinct and use each for what it is best at: the database as the source of truth, a lake (organized, often Iceberg-managed) as cheap durable storage and system of record for analytics, a warehouse (or a fast engine like ClickHouse1/DuckDB) as the serving layer for interactive queries, and the source — not the lake — for realtime. The fashionable move is to collapse these into one lakehouse and query everything off object storage; I think that trades two specialized systems for one that is adequate at everything and excellent at nothing. The posts below make that case in detail, and also give the honest version of when the lakehouse genuinely is the right call.
🔗 Learn more — 1 What is ClickHouse?
The reading path
This series builds from the ground up:
- What is a database? — the foundation
- OLTP vs OLAP — the two opposite workloads
- CSV, TSV, and tabular data formats — how rows and columns become files
- How Parquet works — the columnar format under modern analytics
- What is a data warehouse?
- What is a data lake?
- What is a data lakehouse?
- How Apache Iceberg works — the table format that powers lakes and lakehouses