Understanding Pulsar Message TTL, Backlog, and Retention

I’ve noticed some confusion out there around message deletion and retention in Pulsar (for example, in order to keep messages indefinitely what should be configured in the backlog quota or the retention policy or both). So I thought I’d put together a quick primer on how Pulsar manages message life cycle, and a few of the ways you can fine-tune it to manage backlog queues and message retention policies.

Pulsar uses several mechanisms to manage message life cycle. This blog post explains how Pulsar uses these mechanisms together to determine which messages to retain and which messages to delete.

But a quick heads-up: the information here only covers scenarios for persistent topics where messages are persisted in non volatile storage. Message retention/deletion on non-persistent topics are outside the scope of this post.

Message life cycle in Pulsar

Before we dig into the ways you can manage message retention in Pulsar, let’s quickly review how Pulsar determines which messages to keep or discard in an ideal scenario, when Pulsar runs in the default configuration and storage space is unlimited.

Rule #1: Pulsar keeps all messages until they are acknowledged

Pulsar’s subscription keeps track of all message consumption. When a consumer receives a message, it sends an acknowledgement back to Pulsar, to indicate that it no longer needs the message. That tells Pulsar that it can delete this message. Pulsar retains any unacknowledged messages for a particular subscription in a backlog.

This means two things: 

1) Pulsar keeps all the unacknowledged messages in a subscription backlog.

2) When messages in a backlog are acknowledged, Pulsar removes them from the backlog and marks them for deletion. 

Also, since a topic can have multiple subscriptions, that topic can have multiple backlogs. A message must be acknowledged in all subscriptions before Pulsar can consider it ready for deletion.

Rule #2:  Pulsar does not intend to keep acknowledged messages or messages in a topic with no subscriptions

If a message isn’t tracked by any subscription, it doesn’t exist in any subscription backlog. In this case, Pulsar considers it ready for deletion.

Take a look at the following (simple!) diagram for a visual representation of these rules in action. 

  • Message will be retained until it is acknowledged.
  • Message will be deleted once it is acknowledged.

Backlog quota and TTL

What I described is default Pulsar behavior in an ideal scenario. Now, let’s move into the real world.

Disk storage has limits, and messages can’t be stored forever while Pulsar waits for a consumer to acknowledge them. Pulsar uses two mechanisms to prevent the unlimited growth of message backlog: A Time-To-Live (TTL) parameter for individual messages, and a subscription backlog quota for the backlog itself. 

The TTL parameter is like a stopwatch attached to each message that defines the amount of time a message is allowed to stay in the unacknowledged state. When the TTL expires, Pulsar automatically moves the message to the acknowledged state (and thus makes it ready for deletion). 

But TTL only applies to individual messages. The backlog could still swell to an unmanageable size if messages are entering the backlog at a faster rate than individual messages are being acknowledged and expired by TTL combined. To handle this scenario, Pulsar uses a quota to enforce a hard limit on the logical size of the backlogs in a topic.

Backlog keep tracks of unacknowledged messages; backlog quota triggers alert policy once the quota limit is reached.

Breaching the backlog quota (Caution:  details ahead!)

The backlog quota applies per topic, and is defined by the backlogQuotaDefaultLimitGB parameter in the broker.conf file. As the name suggests, it sets a limit on the maximum backlog size permitted for the topic. Since a topic can have multiple backlogs, Pulsar applies the limit to  the largest subscription backlog for the topic (that is from the slowest consumer).

So what happens when a topic’s backlog exceeds the permitted size? Pulsar can either interrupt message transmission, or start removing older messages from the backlog. Pulsar offers three policies to prevent backlog overflow:

  1. producer_request_hold: Pulsar holds the Producer’s send request until the backlog has room for more messages. Pulsar blocks the producer send() method until one of the following conditions are met:
    1. New consumer acknowledgements take messages off the backlog, freeing room in the backlog.
    2. Producer send() method times out.
  2. producer_exception : Pulsar sends  ProducerBlockedQuotaExceededException to producer in Java code.
  3. consumer_backlog_eviction: Remember that there is only one logical copy of a message. Both message acknowledgement and TTL move a cursor to track messages consumption on a backlog. This eviction policy follows the same design by Pulsar moves the subscription cursor to skip messages in the backlog. The skipped messages are still available for the Reader interface if the retention policy is correctly configured (more on that below).

The first two options interrupt message transmission to prevent further backlog growth, but consumers can still receive and acknowledge existing messages.

The third  option clears existing messages from the backlog. The consumer_backlog_eviction has a 0.9 reduction factor for message eviction. That means that the slowest consumer will lose 10% of the oldest messages in the backlog. Neither producers nor consumers will get any exceptions or errors, and message transmission proceeds as usual. 

The default broker option is producer_request_hold. It is the least intrusive option, because it relies on consumers to drain the backlog.

Retention policy

Ok, back to another “real world” scenario. The whole “message will be deleted once it is acknowledged” rule obviously doesn’t satisfy Pulsar’s data streaming use case. Messages must be kept for Pulsar’s Reader interface. That’s where Pulsar’s retention policy comes in: it tells Pulsar to retain acknowledged messages and messages on a topic with no subscription. Pulsar hangs on to these messages based on two configuration parameters: defaultRetentionTimeInMinutes and defaultRetentionTimeInSize in the broker.conf file.

You can specify retention policies at the namespace level, so teams using different namespaces can use different policies.

To sum up …

Now that we’ve covered the basic rules of backlog and message retention, let’s look at these concepts from a few different angles. 

Implications for storage:

  • The backlog quota and TTL parameters prevent disk size from growing indefinitely, as Pulsar’s default behaviour is to keep unacknowledged messages forever.
  • The retention policy allocates storage space to accommodate the messages that are supposed to be deleted by Pulsar by default.

Implications for message queuing and data streaming:

  • The subscription backlog is designed for message queuing. Once a message is consumed, it is no longer required.
  • Backlog quota is a cap on the queue size.
  • The TTL mechanism automatically removes messages from the queue to prevent queue overflow.
  • The retention policy enables data streaming, so that acknowledged messages and messages without subscriptions on a topic can be streamed over and over again.

Implications for message retention:

  • The backlog quota governs unacknowledged messages. It has no jurisdiction over acknowledged messages.
  • The retention policy only governs acknowledged messages and messages with no subscription. 

Ultimately, to understand (and manage) message retention in Pulsar, you need to look at both the backlog quota and the retention policy configuration.

  • If the retention limit is reached, Pulsar deletes older acknowledged messages. Readers may lose a few older messages, but there is no interruption in message flow.
  • If the backlog quota is exceeded, the consequences are more serious.. Either Pulsar halts incoming messages from Producers until more backlog space becomes available, or Pulsar removes older messages from the backlog (and  the slowest consumers lose messages). So configuring an appropriate TTL for your messages is important to protect the  backlog quota threshold.

Finally, consider the following when factoring in your available physical storage:

  • The size of your physical storage should accommodate the sum of the backlog quota and the retention size.
  • Only the retention policy governs the physical deletion of messages from storage. Backlog quotas do not result in message deletion.
  • Can consumers keep up with the rate of incoming messages? What is the budgeted backlog size on a topic? (At the time of writing, Pulsar 2.5.1 and earlier versions do not limit backlogs based on time.) If so, how long can a message stay unacknowledged? If you set lower TTL, it will take longer to hit the backlog limit.
  • Is there any Reader interface for data streaming needs? Are there any new subscriptions when all messages are acknowledged? These scenarios require that Pulsar keep acknowledged messages through a retention policy.

Here’s another diagram that puts all these concepts together. It (hopefully) shows how a message can transition through different message flow stages and how the backlog quota and retention policy mechanisms govern its life cycle.

Message retention in the tiered storage

Pulsar offers a tiered storage feature to offload closed ledgers to AWS S3 because S3-like blob storage offers a lower cost than storage nodes and substantially extends the storage size. But remember that messages offloaded to tiered storage are still governed by the same retention policy. If the deletion requirement is satisfied, the ledger on S3 will be deleted. You should adjust your retention policy accordingly to take advantage of any extra storage.

Conclusion

So those are the nuts and bolts of Pulsar message retention and backlog control. As you can see, Pulsar provides a backlog quota with TTL and retention policy to serve the needs of message queuing and data streaming separately.

Acknowledgments

In writing this blog post, I need to acknowledge Ali Ahmed and Alexander Ursu from Pulsar community, and Chris and Ben from my team to review the content. Special thanks to Anne Kavanagh for the copy editing.