What is data partitioning?

Data partitioning is splitting a large dataset into smaller physical chunks by the value of some key — usually a date, region, or category — so that a query touching only part of the data reads only the relevant chunks. It is one of the highest-leverage performance techniques in data engineering, and one of the easiest to get wrong.

How it speeds up queries

The mechanism is partition pruning. If a table is partitioned by event_date, the data for each day lives in its own directory or file group. A query filtered to WHERE event_date = '2026-06-18' reads one partition and skips the rest entirely — it never opens the other 364 days. On a large table the difference is reading 0.3% of the data instead of 100%.

This is why partitioning by the columns you actually filter on matters so much: pruning only helps when the query's filter lines up with the partition key. Partition by date and filter by date, and you win big. Partition by date but filter by user_id, and pruning does nothing — every partition still has to be scanned.

Horizontal vs vertical

The partitioning above is horizontal: splitting rows into groups. Sharding is horizontal partitioning spread across multiple machines, used to scale a database beyond one node. Vertical partitioning is the other axis — splitting columns into separate stores, which is essentially what columnar formats like Parquet¹ do within a file so a query reads only the columns it needs.

🔗 Learn more — ¹ How Parquet works: columnar storage explained

In a lakehouse², partitioning shows up at the table-format layer. Apache Iceberg³'s "hidden partitioning" is a refinement worth knowing: you partition by a transform of a column (day(event_ts)), queries filter on the real column, and Iceberg figures out which partitions to read — so users never hand-build partition paths and the scheme can even evolve without rewriting old data.

🔗 Learn more — ² What is a data lakehouse?

🔗 Learn more — ³ How Apache Iceberg actually works

The trap: too many partitions

Partitioning has a failure mode that bites everyone eventually: over-partitioning. Split too finely — by hour, or by a high-cardinality key like user_id — and you get the small-files problem. Each partition becomes a handful of tiny files, and now the engine spends more time opening and listing thousands of files than it would have spent scanning. On object storage, where every file is a separate request with real latency, this is especially punishing.

The rule of thumb: partition by a low-cardinality column that matches your most common filter, and aim for partitions large enough to be worth a scan (tens to hundreds of megabytes, not kilobytes). Partitioning is a knife — the right key cuts your scan to a sliver; the wrong key shreds your table into noise.