What is Apache Avro (and how is it different from Parquet)?

Apache Avro is a row-oriented binary serialization format that carries its schema with it. Every Avro datum is written according to a schema (defined in JSON), and the schema travels with the data or lives in a registry, so any reader can decode the bytes correctly. It is the format you reach for when data is moving — between services, through a message bus, into a pipeline — and the format will change over time.

Row-oriented, schema-first

Avro stores a record at a time: all the fields of row one, then all the fields of row two. That layout is cheap to write and cheap to read whole records from, which is exactly what streaming and messaging want — you produce and consume one event at a time. Because the schema is explicit and separate from the data, Avro encodes values compactly (no field names repeated per record) while still being self-describing through the attached schema.

Its standout strength is schema evolution. Avro defines clear rules for reading data written with an older or newer schema: add a field with a default, remove an optional field, and old and new producers and consumers keep interoperating. This is why Avro became the default payload format in Kafka¹ ecosystems, usually paired with a schema registry that enforces compatible evolution.

🔗 Learn more — ¹ What is Apache Kafka?

How it differs from Parquet

Parquet² is columnar and built for analytics at rest: it stores each column together so a query can read only the columns it needs and skip the rest, with compression and predicate pushdown layered on top. That makes Parquet superb for scanning millions of rows over a few columns — and a poor fit for writing or reading one whole record at a time.

🔗 Learn more — ² How Parquet works: columnar storage explained

Avro is the mirror image. Row-oriented suits write-heavy, record-at-a-time movement; columnar suits read-heavy, column-at-a-time analytics. The rule of thumb:

Avro for data in motion — messages, events, inter-service payloads, ingestion — where you write whole records and the schema evolves.
Parquet for data at rest — analytical tables you scan and aggregate.

ORC is a third format in the same family as Parquet: also columnar, also for analytics, with a similar footer-and-stats design. In practice the columnar choice today is usually Parquet; the row-vs-column decision matters far more than Parquet-vs-ORC.

Where they meet

The two are not rivals so much as different jobs, and real systems use both. Apache Iceberg³ is a good example: the actual table data is Parquet (columnar, for queries), but Iceberg's own metadata — its manifest files — are written in Avro, because that metadata is a stream of evolving records, not an analytical scan. Pick the format by the shape of the access: moving and evolving records, Avro; scanning columns, Parquet.

🔗 Learn more — ³ How Apache Iceberg actually works