What is Apache Pulsar?

Apache Pulsar is an open-source pub-sub messaging system. It was originally developed by Yahoo! and was contributed to the Apache Software Foundation (ASF) in 2016. It is highly scalable so it can handle the most demanding data movement use case out there. In these days of exploding data, Apache Pulsar is emerging as the new go-to platform for businesses that need to efficiently move their data.

Pulsar combines an exciting and growing feature set into a single platform to meet a myriad of use cases, all in a single package supported by the ASF.

Pub-sub, streaming, and queueing

Typically a messaging solution is either good at streaming messages, where you are dealing with a high volume of messages in real-time performance with simple pub-sub messaging patterns, or queuing, where you are dealing with a variety of complex message exchange patterns, such as competing consumers.

Pulsar is adept at handling high-volume pub-sub messaging as well as the more complex messaging patterns typical in a message queuing system. And these complex messaging patterns are handled by Pulsar, not left to the software developer to code around using a complex application built on top of a simple client.

Retention and message replay

In a traditional messaging system, the system keeps track if a particular message has been consumed. Once the consuming client is done with the message, it acknowledges the message which tells the messaging system that message is no longer needed. A traditional messaging system will then delete the message from its persistent storage. After all, the message is no longer needed.

In a perfect world, that may true, but in the real world, things go wrong, applications crash, availability zones go down, and being able to get that message back may be important for rebuilding your application state. That’s why message retention is important. If something goes wrong, Pulsar can replay the messages that have been published to a topic, even if they have already been consumed. Because you never know when you might need that message again.

The ability to retain messages also enables event-driven application architectures such as event sourcing, where it is important to record each change of state as an event in the order it occurred.

Designed for low latency, high throughput

From the beginning, Pulsar was designed to provide low, consistent latency at high throughput. It does this by separating the concerns of serving messages between producers and consumers and storing the messages for persistence. Pulsar uses a multi-tier architecture where messages are served by brokers and stored by Apache BookKeeper. Instead of building their own storage layer, Pulsar leverages the best-in-class performance and durability of BookKeeper.

BookKeeper is a distributed log that is designed to durably store messages with IO isolation between writing and reading. This means that it can provide consistent, low latency even while large amounts of data being written or read. Unlike traditional storage systems, performance doesn’t break down under high write pressure or under high read (consumer catch up) pressure. BookKeeper is a distributed system and it is able to seamlessly scale horizontally without needing to rebalance storage assignments.

Cloud-native architecture

Because Pulsar using a multiple layer approach, separating compute (brokers) from storage (BookKeeper), it fits very well into cloud infrastructures, which also separate these two concerns. Brokers are essentially stateless and BookKeeper can easily be managed as a StatefulSet in container orchestration environments like Kubernetes, which is the defacto standard for cloud-native orchestration.

In fact, Apache Pulsar works naturally in Kubernetes, supporting rolling upgrades, rollbacks, and horizontal scaling. When coupled with persistent volumes backed by cloud storage with configurable performance dimensions, Pulsar is a highly durable and highly flexible messaging system that can scale from small test deployments to large production deployments with ease.

Infinite retention with tiered storage

Another advantage of Pulsar’s multi-layer architecture is that new layers can be added. For high performance, any persistent messaging system needs to use high-performance disks, since messages ultimately must be written to disk and may have to be retrieved from disk (if they aren’t consumed immediately). But what happens if you need to keep around old messages in case you want to replay them or you are doing event sourcing? And what if you want to keep those messages forever? Storage those old messages on high-performance disks can get expensive.

To solve this problem, Pulsar supports tiered storage, allowing older messages to be offloaded to cheaper storage options, like S3 buckets. When a consumer needs an older message, Pulsar automatically retrieves it from the S3 bucket and delivers it to the consumer. Yes, the performance will be lower, but when dealing with messages that are months or even years old, performance doesn’t matter. You just want those messages to be available when or if you need them without breaking the bank.

Multi-tenancy and namespaces

Once you have a high-performance, scalable messaging system in place, you will want to share it between different teams and groups within your organization. It doesn’t make sense to have to replicate the high-performance system to make sure different teams don’t impact each other or build a complex overlay system to simulate multi-tenancy.

Pulsar was designed from the beginning to be a multi-tenant system. So different teams can safely share the messages system. Each tenant has its own authentication, authorization, and policies. And tenants can be further divided into namespaces, which makes it easy to support different environments, such as development, staging, and production within a single tenant.

Built-in schema registry

One of the biggest challenges of any messaging system is making sure producers and consumers are talking the same language. Because producers and consumers are decoupled, it is easily for one or both to change the format of the messages they are sending or expecting to receive and applications end up getting broken.

The solution to this is a schema registry that enforces producers and consumers to use messages with a compatible schema. Pulsar includes a schema registry out-of-the-box. You just need to register the schema with a Pulsar topic and it takes care of enforcing the schema rules.

Built-in geo-replication

Replicating messages to remote locations is important to support disaster recovery or to enable applications to operate on a global scale. If the users of your application move around the world, you want them to have the same user experience no matter where they are. With geo-replication, applications can connect to the local cluster but can send and receive to clusters around the world.

With Pulsar, geo-replication of messages is built in. If you publish a message to a topic in a replicated namespace, that message is automatically replicated to the configured remote geo-location or locations. No complex configurations or add-ons needed.

By User:Mysid, User:Jm smits - Made by Mysid in Inkscape, based on en:Image:Pulsar schematic.jpg by Roy Smits., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2612701
By User:Mysid, User:Jm smits – Made by Mysid in Inkscape, based on en:Image:Pulsar schematic.jpg by Roy Smits., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2612701

Flexible subscriptions

Pulsar support four different subscription types: exclusive, failover, shared, and key shared. It also supports multiple subscriptions on a single topic. Using subscriptions you can easily configure messaging patterns such as queuing, pub-sub, fan-out, and competing consumers.

Pulsar implements the competing consumers pattern using its shared subscriptions. You can scale the number of consumers up and down on a shared subscription seamlessly. There are no partitions involved. Just add a consumer and it starts receiving messages right away.

Dead letter topic, negative acknowledgment, delayed delivery

Pulsar supports a variety of advanced messaging capabilities that make it easy to build powerful and flexible applications around it. With negative acknowledgment, a consuming client can put a message back on a topic to process it later or allow another consumer to attempt to process it. If a consumer is unable to process a message, instead of getting blocked, it can send the message to a dead-letter topic to become unblocked and to save the problematic message for later analysis.

If you want to send a message after a delay, Pulsar can do that using the delayed delivery feature. When you publish a message you can set a configurable amount of time to wait before the messages can be consumed.

Integrated streaming functions

Increasingly we want to get insights from the data we are collecting in real-time. Gone are the days when waiting for an overnight batch job to crunch all the data and getting insight the next day is considered good enough. Today we want our insights in real-time so we can react in real-time.

In order to get real-time insights, data must be processed in real-time. With Pulsar, you can seamlessly integrate lightweight functions into the message flow, performing cleaning, enrichment, and analysis of the data in real-time. There’s no need to dump everything into a data lake and process it later. With Pulsar functions, you can process the data as is flows through the messaging system. Pulsar functions can be written in Java, Python, or Go and can be configured to run as Kubernetes pods.

IO connectors

One of the main functions of a messaging system is to glue together data-intensive systems like databases, stream-processing engines, and other messaging systems. Since this is common, it makes sense to provide a common framework and connectors to make this easy to do. That’s exactly what Pulsar does with its IO connectors.

Pulsar provides a number of ready-made connectors that run inside the Pulsar cluster that make it easy to glue your systems together. Pulsar comes with a wide variety of connectors, including MySQL, MongoDb, Cassandra, RabbitMQ, Kafka, Flume, Redis, and many more.

SQL queries with Presto

If you are storing a lot of data in Pulsar it can be very useful to run queries on that data and do that while Pulsar is doing its main job of sending and receiving messages. Pulsar makes exactly this possible by leveraging the SQL query engine Presto. Pulsar integrates with Presto so you can performance SQL queries on the data stored in your topics. You can even query the data if it is offloaded into tiered storage. And the queries bypass the broker, so they won’t impact the ability of the Pulsar cluster to send and receive messages in real-time.

Partitioned and non-partitioned topics

Pulsar support both partitioned and non-partitioned topics. For lower performance use cases you can use a non-partitioned topic to keep things simple. But if you have a high-performance use case where you need to process a high volume of data on a single topic, you can use a partitioned topic to take advantage of parallelism in the processing. You can seamlessly add partitions as performance requirements grow.

Like Kafka, Pulsar is able to guarantee message order if you publish your message with keys. Pulsar will assign messages with the same key to the same partition, guaranteeing order for messages sent to that key.

Persistent and non-persistent messages

Persistent messages are sent to Apache BookKeeper for storage on disk. These messages are guaranteed to be delivered at-least-once regardless of the failure of the network, application, or even Pulsar itself.

However, there are some cases where this level of guaranteed delivery is not required. At-most-once delivery is sufficient. For those cases, Pulsar supports non-persistent messages. Non-persistent messages are not stored to disk, reducing resource requirements while still delivering high throughput and low latency.

Topic compaction

Sometimes only the latest instance of a piece of data is of interest. You don’t care about all the historical values, just the latest value. If that’s the case, you can use a Pulsar compacted topic to store only the latest value on a particular key in a topic.

All data is published to a compacted topic, but Pulsar will periodically remove the old values for a key, leaving only the latest. Compacted topics prevent the topic from growing indefinitely large and gives you quick access to the latest values on a topic.

Client libraries

Pulsar has a wide variety of client libraries that are maintained by the core project: Java, Python, C++, Golang, Node.js, and C#. And if you don’t want to use a Pulsar client library at all, Pulsar includes a WebSockets proxy.

There are many other clients being developed by the community, such as Scala and Rust. And if you prefer to send use HTTP to send and receive Pulsar messages, you can use Pulsar Beam.

Conclusion

Apache Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Kafka. You get the best of both worlds in a high performance, cloud-native package. It’s not a surprise that Pulsar has been increasing in popularity since it became an Apache open source project. And given its advantages, it will likely continue to gain in popularity in the years to come.


Want to try Pulsar for yourself? Just sign up for the free plan of our fully managed service give it a try. It only takes a minute to get started.