What is a data lake?
A data lake is your data as open files in object storage — cheap, open, readable by any engine. Done right it is organized and governed, not a swamp. How it differs from a warehouse, and where a table format fits in.
In this series
How data is storedA data lake is a simple idea: keep your data as open files in object storage — S3, GCS, Azure Blob, or self-hosted MinIO — instead of inside a database engine. The files are usually Parquet1 (and some JSON, CSV2, images, logs), the storage is cheap and effectively infinite, and any engine can read it. That openness is the whole point: your data is not trapped inside one vendor's system, and you can point Spark3, DuckDB, Trino, or a warehouse at the same files.
🔗 Learn more — 1 How Parquet works: columnar storage explained
🔗 Learn more — 2 CSV, TSV, and tabular data formats explained
🔗 Learn more — 3 What is Apache Spark?
Organized, not a dump
The lazy description of a lake — "just throw every file in a bucket and sort it out later" — is wrong, and believing it is how you end up with the thing everyone warns about. A real data lake is deliberately organized:
- Zones. Data flows through layers — raw (exactly as ingested), cleaned (validated, typed, deduplicated), curated (modelled for consumption). Each zone has different guarantees and different audiences.
- Partitioning. Files are laid out by meaningful keys (
/year=2026/month=06/) so a query for June reads only June's files instead of scanning the whole dataset. - Governance. Access control, cataloguing, and naming conventions, so people can find data and are allowed to see only what they should.
A "data swamp" — unfindable, untrustworthy, undocumented data — is not what a lake is; it is what an unmanaged lake decays into. The discipline is the difference.
What a raw lake lacks: table semantics
Even a well-organized lake has one real gap. A folder of Parquet files is not a table. It has no atomic commits (a half-finished write is visible), no enforced schema (one file can drift from the others), no "the table as it was yesterday," and no safe way to update or delete specific rows. Object storage simply has no concept of a transaction.
This is the gap a table format fills.
flowchart TD
SRC["Sources: app DBs, streams, exports"] --> ZONES["Object storage, organized into zones + partitions"]
ZONES --> RAW["raw/"]
ZONES --> CUR["curated/ (Parquet)"]
CUR --> TF["Table format (Iceberg) — schema, snapshots, atomic commits"]
TF --> ENG["Any engine: Spark, Trino, DuckDB, a warehouse"]
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class TF key
class SRC,ZONES,RAW,CUR,ENG plain
Layer Apache Iceberg4 (or Delta Lake, or Hudi) over the curated files and the pile of Parquet starts behaving like a database table — atomic commits, schema evolution, time travel — while staying open files in your bucket. A lake with a table format is the foundation of a lakehouse.
🔗 Learn more — 4 How Apache Iceberg actually works
Where a lake belongs
A lake done right is excellent at exactly two things: being cheap, durable storage at any scale, and being the open system of record for analytical data — one copy, many engines, no lock-in. Run Iceberg on it and you also get table guarantees. For batch analytics, querying the lake directly is fine: batch wants throughput, and a bucket delivers throughput.
What a lake is not is a fast interactive query engine. Object storage is high-latency and priced per request, so serving sub-second, high-concurrency dashboards by reading files out of the bucket on every query fights the medium. The clean pattern is to treat the lake as storage and the system of record, then load or materialize the slices people query into a fast warehouse (or ClickHouse/DuckDB) for serving — rather than making the lake pull double duty as the query layer.
The short version: a data lake is open files in organized object storage — cheap, engine-agnostic, and (with a table format) transactional. Keep it as your durable system of record, not as the thing answering your dashboards directly.