What is Apache Kafka?
Kafka is a distributed, append-only log you can publish events to and replay later. Not a queue and not a database — a durable commit log that decouples the systems producing data from the ones consuming it. How the log, partitions, and consumer groups fit together.
Apache Kafka is a distributed, durable, append-only log that systems publish events to and other systems read from — at their own pace, possibly much later. People call it a "message queue" or an "event streaming platform," but the mental model that explains everything is simpler: it is a commit log, replicated across machines, that many readers can scan independently.
That one idea — a shared, replayable log instead of point-to-point messages — is what makes Kafka the backbone of so many data platforms.
The log, not the queue
In a traditional message queue, a message is delivered to a consumer and then deleted. The queue's job is to hand work out and forget it.
Kafka does not delete on read. A topic is an ordered, append-only log. Producers append records to the end; the records stay for a configured retention period (hours, days, or forever) regardless of who has read them. Each consumer tracks its own position — an offset — into the log. Two consumers can read the same topic completely independently, and a consumer can rewind its offset to replay history.
This is why Kafka decouples systems. The producer does not know or care who consumes; it just appends. A new consumer can be added next year and read the whole history from offset zero.
Topics, partitions, and brokers
A single log on a single machine would not scale, so a topic is split into partitions. Each partition is an independent ordered log, and partitions are spread across brokers (the Kafka server processes). This is the unit of parallelism and the unit of ordering: records are ordered within a partition, never across partitions.
flowchart TD
P1["Producer A"] --> T
P2["Producer B"] --> T
T["Topic 'orders' — append-only log"] --> PA["Partition 0 (broker 1, replicated to 2,3)"]
T --> PB["Partition 1 (broker 2)"]
T --> PC["Partition 2 (broker 3)"]
PA --> CG["Consumer group 'billing' — offsets tracked per partition"]
PB --> CG
PC --> CG
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class T,PA,PB,PC key
class P1,P2,CG plain
- Partition key. A producer can supply a key (e.g.
user_id); Kafka hashes it to pick a partition, so all events for one user land in the same partition and stay ordered relative to each other. - Replication. Each partition has a leader broker and follower replicas. If the leader dies, a follower takes over — that is how Kafka survives machine failure without losing the log.
- Consumer groups. Consumers join a named group, and Kafka assigns each partition to exactly one consumer in the group. Add consumers (up to the partition count) and throughput scales horizontally. Two different groups each get the full stream — that is the fan-out.
Where it fits
Kafka sits between the systems that produce data and the systems that use it, so neither has to know about the other:
- Event backbone. Services emit events ("order placed", "payment failed") to Kafka; any number of downstream services react. The systems are decoupled in time and in knowledge of each other.
- Ingestion into analytics. Kafka is a front door to a lakehouse1 or warehouse. A streaming job — often Apache Spark2 Structured Streaming — reads topics and writes them as Parquet3 into Iceberg4 tables. (Getting data in cheaply and fast is the data industry's real unsolved problem: SaaS loaders like Fivetran bill per row, Airbyte is open source but heavy, and lean OSS loaders — dlt in Python, Sling in Go — are only now starting to fill the gap. A genuinely fast, free, low-overhead loader is still missing.)
- Change data capture5. Database changes are streamed into Kafka as an event log, so downstream systems stay in sync without polling the source database.
🔗 Learn more — 1 What is a data lakehouse?
🔗 Learn more — 2 What is Apache Spark?
🔗 Learn more — 3 How Parquet works: columnar storage explained
🔗 Learn more — 4 How Apache Iceberg actually works
🔗 Learn more — 5 What is Change Data Capture (CDC)?
The JVM tax and lighter alternatives
Kafka is a JVM system. It is memory-hungry, historically dragged a ZooKeeper cluster along (now replaced by KRaft), and is genuinely heavy to self-host well. The easy escape — managed Kafka from Confluent, MSK, or Aiven — is exactly where the cloud markup lives: you pay a standing premium over raw compute for someone to babysit brokers you could run yourself.
If you want Kafka's model without the footprint, the open-source alternatives are real:
- Redpanda6 — a C++ reimplementation of the Kafka API in a single binary, no JVM and no ZooKeeper, with lower tail latency and a fraction of the operational weight. Existing Kafka clients connect unchanged.
- NATS JetStream — a lightweight Go-based streaming and messaging system for workloads that never needed Kafka's full ecosystem in the first place.
🔗 Learn more — 6 What is Redpanda?
Kafka still wins on ecosystem, connector breadth, and battle-tested durability at large scale — that is why it is the default. But "default" is not "always right," and for most workloads a JVM cluster plus a managed-service bill is more machine, and more money, than the problem calls for.
What Kafka is not
It is not a database — you query it by replaying a log from an offset, not with SELECT ... WHERE. It is not a classic queue — it does not delete on consume or do per-message acknowledgement the way RabbitMQ7 does. And ordering is only guaranteed within a partition, which trips up everyone at least once: if global order matters, you need one partition (and you lose parallelism).
🔗 Learn more — 7 What is RabbitMQ (and how is it different from Kafka)?
The short version: Kafka is a replicated, replayable, append-only log, sharded into partitions. Producers append, consumers read at their own offset, and the log — not a delivery handshake — is the source of truth.