Storage & table formats

7 articles in this category.

·5 min read
How Apache Iceberg actually works
Iceberg turns thousands of immutable Parquet files in object storage into something that behaves like a database table — atomic commits, time travel, schema evolution. It does it with a tree of metadata files and one atomic pointer swap. A walk through the actual mechanics.
#data
#lakehouse
#iceberg
#parquet
#ai-assisted
·3 min read
CSV, TSV, and tabular data formats explained
Tabular data is rows and columns — but how you serialize it to a file matters more than people think. CSV and TSV and their traps, JSON and JSONL, and why columnar binary formats like Parquet exist at all.
#data
#csv
#file-formats
#parquet
#ai-assisted
·4 min read
How Parquet works: columnar storage explained
Parquet is the file format under almost every modern data lake. It stores data by column instead of by row, which is why analytics queries on it are fast. A look at the actual file anatomy — row groups, column chunks, pages, and the footer that makes predicate pushdown possible.
#data
#parquet
#columnar
#storage
#ai-assisted
·3 min read
What is Apache Hudi?
Apache Hudi is an open table format built at Uber for fast upserts and incremental processing on a lake — strong write path, lost the format war on neutrality.
#data
#lakehouse
#table-formats
#ai-assisted
·2 min read
What is Apache Avro (and how is it different from Parquet)?
Avro is a row-oriented binary format with a schema attached — built for moving and evolving records. Parquet is columnar for analytics at rest. They solve opposite problems.
#data
#formats
#serialization
#ai-assisted
·2 min read
What is data partitioning?
Partitioning splits a dataset into chunks by a key so queries can skip the parts they don't need. Done right it's the biggest scan win; done wrong it makes tiny files.
#data
#performance
#storage
#ai-assisted
·3 min read
What is Delta Lake (and how does it compare to Iceberg)?
Delta Lake is an open table format that adds ACID, time travel, and schema enforcement to a data lake. How it compares to Apache Iceberg.
#data
#lakehouse
#table-formats
#ai-assisted