How Parquet works: columnar storage explained
Parquet is the file format under almost every modern data lake. It stores data by column instead of by row, which is why analytics queries on it are fast. A look at the actual file anatomy — row groups, column chunks, pages, and the footer that makes predicate pushdown possible.
In this series
How data is storedAlmost every modern data lake1 stores its actual bytes in Apache Parquet files. Table formats like Apache Iceberg, Delta Lake, and Hudi sit on top of Parquet; query engines from Spark2 to DuckDB read it natively; and "just write it as Parquet" is the default answer for analytical data at rest. Understanding what is inside a Parquet file explains most of why analytical data systems are shaped the way they are.
🔗 Learn more — 1 What is a data lake?
🔗 Learn more — 2 What is Apache Spark?
Row storage vs columnar storage
Start with the core idea, because everything else follows from it. Consider a table:
id | name | country | amount
1 | Anu | EE | 42
2 | Mehmet| TR | 17
3 | Priya | IN | 88
A row-oriented format (a CSV3, a traditional OLTP4 database) stores it the way you read it — one row at a time:
🔗 Learn more — 3 CSV, TSV, and tabular data formats explained
🔗 Learn more — 4 OLTP vs OLAP: two opposite jobs
1,Anu,EE,42 | 2,Mehmet,TR,17 | 3,Priya,IN,88
A columnar format stores each column contiguously instead:
1,2,3 | Anu,Mehmet,Priya | EE,TR,IN | 42,17,88
Same data, transposed. That single decision is the whole game, because analytical and transactional workloads read data in opposite shapes:
- A transaction wants one whole row: "give me everything about order 2." Row storage hands it over in one contiguous read.
- An analytical query wants part of many rows: "the average
amountacross ten million orders." It needs one column and ignores the other three.
In a columnar file, that query reads only the amount column off disk and never touches name or country. On a wide table — hundreds of columns, of which a query touches five — that is the difference between reading 3% of the file and reading all of it. This is called column pruning, and it is the first reason Parquet is fast.
Where Parquet came from
Parquet was released in July 2013 by engineers at Twitter and Cloudera, built on the ideas in Google's Dremel paper — specifically its "record shredding and assembly" algorithm, which is how Parquet flattens nested structures (a struct of arrays of structs) into flat columns and reconstructs them on read. It became an Apache top-level project and, over the following decade, the de-facto on-disk format for analytics.
Inside the file
A Parquet file is not just transposed columns dumped to disk. It has a deliberate nested structure, and the layout is what makes it queryable without reading the whole thing.
flowchart TD
FILE["Parquet file (magic bytes: PAR1 ... PAR1)"] --> RG1["Row group 1 (~128 MB of rows)"]
FILE --> RG2["Row group 2"]
FILE --> FOOT["Footer: schema, row-group locations, per-column stats"]
RG1 --> CC1["Column chunk: id"]
RG1 --> CC2["Column chunk: amount"]
CC2 --> P1["Page (encoded + compressed)"]
CC2 --> P2["Page"]
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class FOOT key
class FILE,RG1,RG2,CC1,CC2,P1,P2 plain
Reading top to bottom:
- Row groups. The file is split into horizontal slices of rows, typically around 128 MB each. A row group is the unit of parallelism — different workers can read different row groups at once.
- Column chunks. Within a row group, each column's data is stored together as a column chunk. This is where the "columnar" actually happens: contiguous values of one column.
- Pages. Each column chunk is divided into pages — the smallest unit that gets encoded and compressed.
- The footer. At the very end of the file sits the metadata footer: the schema, the location of every row group and column chunk, and — critically — per-column statistics: min value, max value, and null count for each column in each row group. The file begins and ends with the magic bytes
PAR1.
The footer-at-the-end design is deliberate: a reader fetches the small footer first, learns the exact byte ranges of the columns and row groups it cares about, and then issues precise reads for only those. Nothing is read speculatively.
Why it is fast: pushdown, encoding, compression
Three mechanisms, all enabled by the structure above:
Predicate pushdown. Because the footer carries min/max stats per row group, a query with a filter — WHERE amount > 1000 — can check the stats and skip entire row groups whose maximum amount is below 1000, without decoding a single value in them. The filter is "pushed down" into the scan. (Parquet 2.x extended this with page-level indexes and Bloom filters for even finer skipping.)
Encoding. Storing one column together means the values are the same type and often highly repetitive, which unlocks cheap encodings: dictionary encoding (store each distinct value once, then reference it by a small integer), run-length encoding (EE,EE,EE,EE becomes "EE ×4"), and bit-packing. A country column with 50 distinct values across a million rows compresses to almost nothing.
Compression. On top of encoding, each page is compressed (Snappy, zstd, gzip). Compression works far better on columnar data than row data for the same reason encoding does — adjacent values are similar, so there is more redundancy to squeeze. Columnar layout and good compression ratios are the same phenomenon.
What Parquet is not
Parquet is an immutable, write-once file. There is no in-place update: to "change a row" you rewrite a file. That makes it superb for analytical data at rest and useless as a transactional store — you would not run a bank ledger on raw Parquet.
It is also just a file. A single Parquet file knows its own schema and stats, but it knows nothing about other files. A real table is usually thousands of Parquet files in object storage, and questions like "which files make up this table right now," "commit these new files atomically," "show me the table as it was yesterday," and "rename a column safely" are entirely outside Parquet's scope. Answering those is the job of a table format — the layer that turns a pile of Parquet files into something that behaves like a database table. That is how Apache Iceberg works, and the request-count economics of all those small files is a story of its own.