What is Apache Pulsar?

Apache Pulsar is a distributed messaging and streaming platform that separates the serving layer (brokers) from the storage layer (Apache BookKeeper), with native multi-tenancy, geo-replication, and tiered storage built in from the start. If you have used Apache Kafka¹, the mental model is familiar — producers write to topics, consumers read from them — but Pulsar makes one architectural choice that ripples through everything else: the machines that serve traffic do not store the data.

🔗 Learn more — ¹ What is Apache Kafka?

The broker/BookKeeper split

In a stateful streaming system, a broker normally does two jobs at once: it accepts and serves client traffic, and it holds the data on its own local disks. Pulsar splits those jobs in two. Brokers handle producers, consumers, and topic routing but keep no permanent data of their own — they are stateless. The actual messages live in Apache BookKeeper, a separate distributed log store whose nodes are called bookies. ZooKeeper (or, in newer versions, an alternative metadata store) tracks coordination and metadata for the cluster.

This separation is the whole point, and it buys two concrete things. First, you scale serving and storage independently: if you are throughput-bound on connections, add brokers; if you are capacity-bound on retained data, add bookies. You are not forced to over-provision one to get more of the other. Second, rebalancing is fast. Because a broker owns no data, moving a topic from a busy broker to an idle one is just reassigning ownership — there is no terabyte-scale copy to wait on, the way there is when a node that also stores its partitions joins or leaves.

flowchart TD
    P["Producers / consumers"] --> B["Brokers — stateless serving layer"]
    B --> BK["Apache BookKeeper — distributed storage (bookies)"]
    B --> ZK["ZooKeeper / metadata store — coordination"]
    BK --> TS["Tiered storage — offload to object storage (S3/GCS)"]

    classDef serve stroke:#a3be8c,stroke-width:2.5px
    classDef store stroke:#88c0d0,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    %% color = role: green serving, blue storage, grey clients/metadata
    class B serve
    class BK,TS store
    class P,ZK plain

Topics, subscriptions, queue and stream in one

Pulsar's other distinctive trait is that a single topic can behave like a message queue or like a streaming log, depending on the subscription type a consumer chooses. With an exclusive or failover subscription, messages are delivered in order to one active consumer — log/streaming semantics, like reading a partitioned topic top to bottom. With a shared subscription, messages are fan-balanced across many consumers, and a key-shared subscription balances by key while preserving per-key order — that is classic competing-consumer queue behavior, where you scale workers to drain a backlog faster.

This matters because most shops otherwise run two systems: a message queue for work distribution and a separate stream processing² pipeline for ordered event logs. Pulsar offers to be both, on the same topics, which is a genuine simplification when your workload spans both shapes. The catch worth naming up front: doing two jobs in one platform means the operational surface is correspondingly larger.

🔗 Learn more — ² Batch vs stream processing

Tiered storage and the operational reality

Because storage is its own layer, Pulsar can offload older log segments from BookKeeper to cheaper object storage — S3, GCS, or compatible stores — while topics stay readable as if the data were local. This is tiered storage, and it lets you keep effectively unbounded retention without paying for hot disk on every bookie. For event-sourcing or long-replay use cases, it is a real advantage over keeping everything on local SSDs.

Now the honest part. That same architecture is more moving parts to operate. A minimal serious Pulsar deployment means running and tuning brokers, bookies, and a metadata store — three distributed systems with three failure modes — versus a single-binary alternative like Redpanda³, which deliberately collapses the stack. BookKeeper in particular has its own tuning surface (ensemble size, write/ack quorums) that a Kafka operator never has to think about. The flexibility of independent scaling is exactly the thing that makes the cluster harder to reason about.

🔗 Learn more — ³ What is Redpanda?

The other practical gap is ecosystem. Apache Kafka has been the default long enough that connectors, client libraries, managed offerings, monitoring integrations, and hiring pools are all deeper around it. Pulsar's tooling is capable and its Kafka-protocol compatibility layer narrows the gap, but if you choose Pulsar you should expect to do more integration work yourself and find fewer copy-paste answers.

Where it fits

Pulsar is a strong fit when you genuinely need its differentiators: hard multi-tenancy across many teams on one cluster, queue and stream semantics in the same system, geo-replication across regions, or long retention via tiered storage. If your needs are a straightforward ordered event log with a big ecosystem behind it, Kafka is the lower-friction default; if you want streaming with the fewest moving parts, Redpanda is worth a look. Pulsar earns its complexity precisely when the features that justify the extra layers are the ones you actually use.

The broker/BookKeeper split

Topics, subscriptions, queue and stream in one

Tiered storage and the operational reality

Where it fits

Sources