What is Pulsar Beam?
Pulsar Beam is a standalone service allowing applications to interact with Apache Pulsar using HTTP. It provides an endpoint to ingest events into Pulsar and a broker to push events to webhooks and Cloud Functions. This broadens the number of applications on different platforms, OS, and languages can take advantage of Apache Pulsar as long as they speak HTTP.
Architecture of Pulsar Beam
Pulsar Beam is comprised of three components: an ingestion endpoint server, a broker, and a RESTful interface that manages webhook or Cloud Function registration. To offer flexible deployment and scalability, these three components can each run in their own process. Since the ingestion server is stateless, multiple servers can be deployed behind a proxy server to handle high volume traffic with high availability. Multiple brokers can also be deployed to share the load of pushing events to webhooks and Cloud Functions. These applications are built for Cloud Native deployment.
The following architecture diagram depicts how three components can deployed separately and interact with Pulsar, webhooks, Cloud Functions and external components.
In the development environment or light traffic deployment, all three components can run in the same process to simplify the set up. Pulsar Beam is designed for Cloud Native deployment and written in Golang.
Publishing a message using HTTP POST
To produce events, an external application as simple as curl command or Postman can send events or messages to Pulsar via an ingestion HTTP endpoint, which receives an event in a body of POST call. The destination topic configuration such as Pulsar token and topic full name must be specified in the request header so that Pulsar Beam can construct a Pulsar producer.
This is an example of curl command to produce a message. The bearer token is Pulsar token. Topic full name is a mandatory header. Pulsar URL can be optional. Pulsar Beam can be configured to only send message to the same Pulsar cluster where it’s deployed with.
$ curl -X POST -v \ -H "Authorization: Bearer eyJhbGciOiJSUzI1..." \ -H "PulsarUrl: pulsar+ssl://useast1.gcp.kafkaesque.io:6651" \ -H "TopicFn: persistent://ming-luo/local-useast1-gcp/dest-topic" \ -d "test payload string" \ https://example.pulsar-beam.net/v1/firehose /
To offload Pulsar server authentication and authorization work load, Pulsar Beam is able to decode any Pulsar token to perform authentication and examine the subject. To be precise it can pre-screen the failure cases to prevent these costly failed authentications from reaching Pulsar. It recognizes PKCS8 private and X509 public keys generated by the Pulsar token CLI. All that is required is to load Pulsar server’s public key. In addition, it can issue and manage its own token.
Sending a message to a Webhook
To consume events, a conventional Pulsar consumer can subscribe the topic. However, the best part of Pulsar Beam is that it can push events to any HTTP server with an event in the body of POST method. It means you can have event push notification to a simple webhook. This enables the number of applications and platforms to integrate with Pulsar in your existing architecture without Pulsar’s library dependencies and constraints of OS and language that current Pulsar clients can only support.
To preserve the original Pulsar message properties, message ID, published time, event time, and topic full name are sent as part of the HTTP headers.
Handling response from Webhook
The event delivered will only be acknowledged back to Pulsar when Pulsar Beam receives 200 HTTP status code from the webhook or function. This preserves Pulsar’s effectively-once guarantees.
The webhook or function has the option to reply to an event into a Pulsar sink topic if the topic full name and Pulsar token is specified in the reply’s header. Upon receiving 200 status code, Pulsar Beam would examine the headers and forward the event in the reply body to a sink topic.
Restful Interface manages WEBHOOK
A RESTful API provides an interface to manage webhook and function, which configuration is stored in a back-end database. As a reference implementation, three database options are offered, namely: in-memory cache, MongoDB, and compacted Pulsar topic. Yes, we even use Pulsar itself as a database. This will be another blog post of its own.
Pulsar Beam and Serverless: AWS Lambda, Azure Functions, Google Cloud Functions
Given the nature of HTTP triggered Cloud Functions, they fulfill the webhook specification. This means any HTTP-triggered Cloud Function can receive events from Pulsar Beam. AWS Lambda, Azure Function, and Google Could Function all support HTTP-triggered functions. In fact, Pulsar Beam has incorporated Google Cloud Function in its GitHub CI actions to test end-to-end event flow for every GitHub Pull Request.
We have also tested AWS Lambda and Azure Functions in the same flow described in the architecture diagram earlier. The example below is AWS Lambda with API Gateway configured to allow an HTTP trigger. The Lambda is a Node function that receives message payload, modifies it, and replies back to another Pulsar topic. In a similar manner as the ingestion endpoint, `Authorization` and `TopicFn` headers specify the Pulsar token and the topic full name, where the reply message is destined.
Cloud providers like AWS offer variety of service integration through Lambda. This extends Pulsar’s event reach to the entire realm of AWS services via Lambda. Just to name a few, Lambda can integrate with S3, Kinesis, DynamoDB and Splunk.
Complementary to Pulsar Functions, these major cloud providers’ function offer additional language support such as Node, Go, Ruby, C# and F#. Lambda, Azure and Google Cloud all provide comprehensive logging and monitoring of their functions. They also provide better dependency management, run-time isolation, and mature operational management.