What is a data lake?

A data lake is a simple idea: keep your data as open files in object storage — S3, GCS, Azure Blob, or self-hosted MinIO — instead of inside a database engine. The files are usually Parquet¹ (and some JSON, CSV², images, logs), the storage is cheap and effectively infinite, and any engine can read it. That openness is the whole point: your data is not trapped inside one vendor's system, and you can point Spark³, DuckDB, Trino⁴, or a warehouse at the same files.

🔗 Learn more — ¹ How Parquet works: columnar storage explained

🔗 Learn more — ² CSV, TSV, and tabular data formats explained

🔗 Learn more — ³ What is Apache Spark?

🔗 Learn more — ⁴ What is a query engine (Trino, Presto, and friends)?

Organized, not a dump

The lazy description of a lake — "just throw every file in a bucket and sort it out later" — is wrong, and believing it is how you end up with the thing everyone warns about. A real data lake is deliberately organized:

Zones. Data flows through layers — raw (exactly as ingested), cleaned (validated, typed, deduplicated), curated (modelled for consumption). Each zone has different guarantees and different audiences.
Partitioning. Files are laid out by meaningful keys (/year=2026/month=06/) so a query for June reads only June's files instead of scanning the whole dataset.
Governance. Access control, cataloguing, and naming conventions, so people can find data and are allowed to see only what they should.

A "data swamp" — unfindable, untrustworthy, undocumented data — is not what a lake is; it is what an unmanaged lake decays into. The discipline is the difference.

What a raw lake lacks: table semantics

Even a well-organized lake has one real gap. A folder of Parquet files is not a table. It has no atomic commits (a half-finished write is visible), no enforced schema (one file can drift from the others), no "the table as it was yesterday," and no safe way to update or delete specific rows. Object storage simply has no concept of a transaction.

This is the gap a table format fills.

flowchart TD
    SRC["Sources: app DBs, streams, exports"] --> ZONES["Object storage, organized into zones + partitions"]
    ZONES --> RAW["raw/"]
    ZONES --> CUR["curated/ (Parquet)"]
    CUR --> TF["Table format (Iceberg) — schema, snapshots, atomic commits"]
    TF --> ENG["Any engine: Spark, Trino, DuckDB, a warehouse"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class TF key
    class SRC,ZONES,RAW,CUR,ENG plain

Layer Apache Iceberg⁵ (or Delta Lake⁶, or Hudi⁷) over the curated files and the pile of Parquet starts behaving like a database table — atomic commits, schema evolution, time travel — while staying open files in your bucket. A lake with a table format is the foundation of a lakehouse.

🔗 Learn more — ⁵ How Apache Iceberg actually works

🔗 Learn more — ⁶ What is Delta Lake (and how does it compare to Iceberg)?

🔗 Learn more — ⁷ What is Apache Hudi?

Where a lake belongs

A lake done right is excellent at exactly two things: being cheap, durable storage at any scale, and being the open system of record for analytical data — one copy, many engines, no lock-in. Run Iceberg on it and you also get table guarantees. For batch analytics, querying the lake directly is fine: batch wants throughput, and a bucket delivers throughput.

What a lake is not is a fast interactive query engine. Object storage is high-latency and priced per request, so serving sub-second, high-concurrency dashboards by reading files out of the bucket on every query fights the medium. The clean pattern is to treat the lake as storage and the system of record, then load or materialize the slices people query into a fast warehouse (or ClickHouse⁸/DuckDB) for serving — rather than making the lake pull double duty as the query layer.

🔗 Learn more — ⁸ What is ClickHouse?

The short version: a data lake is open files in organized object storage — cheap, engine-agnostic, and (with a table format) transactional. Keep it as your durable system of record, not as the thing answering your dashboards directly.