Need to scale Apache Kafka? Switch to Apache Pulsar

0

Today, even the simplest web and mobile applications consume a lot of data. The key to exchanging and acting on this data is a messaging system supported by an event-driven architecture.

An event-driven system enables scalable and asynchronous messaging solutions and processing. Asynchronous systems can process more requests because each request is processed in the background.

When a request is made to the server, it is added to a queue where a processor reads it. This allows organizations to build systems that accept hundreds of thousands – or even millions – of requests per second at scale by processing the requests in a separate cluster.

The industry has developed several message broker systems and topic-driven publish-subscribe (pub-sub) platforms that follow this event and message-driven format. Apache Kafka and Apache Pulsar are two common examples of distributed messaging and streaming systems.

Kafka and Pulsar are both based on a Pub-Sub pattern that allows you to scale messaging to thousands of connected clients. Both offer a persistent storage model to ensure that the messages are not lost, and both use partitions to store and process the messages.

While Kafka and Pulsar are similar in many respects, they have some notable differences in capabilities – particularly when it comes to managing large amounts of data, building real-time applications, and developing at scale.

Kafka offers many benefits, but Pulsar’s support for scalability and growth is unmatched. And after a certain point of growth, it is optimal not to optimize Kafka anymore, but to part with it. Here we compare the differences between Kafka and Pulsar and show that a logical next step for scalability when using Kafka is to switch to Pulsar.

Challenges with Apache Kafka apps

Kafka is de facto the de facto for distributed Pub-Sub pattern in software architecture. An organization using Kafka is able to process thousands of messages and send the messages to multiple consumers at the same time.

Kafka has several advantages, but also certain limitations when it comes to scaling. Let’s look at some challenges you’ll face when trying to scale applications built with Apache Kafka.

storage limitations

Kafka’s architecture presents the first challenge you face when scaling your applications in Kafka: storage.

Stateful brokers are the first reason why a company finds scaling difficult. The data in Kafka is stored in the leader node, while data partitions are stored on the local disk. The data is bound to the nodes and the brokers in Kafka are stateful. This means that once the leader node reaches maximum storage capacity, the cluster cannot accept any more messages unless infrastructure storage is increased. This is challenging because a cluster in an ever-expanding environment requires multiple upgrades.

One way to overcome this challenge is to purchase a large storage cluster, which is very expensive.

Also, based on this architecture, the platform cannot accept new incoming messages once it has reached the maximum storage or memory limit. This can result in a huge loss for mission-critical applications. Kafka’s architecture is designed to accept and send many messages. Long-term data storage is not a priority. As a result, scaling a Kafka application is a major challenge as it cannot provide the storage space it needs – at least not without a hefty price tag.

Message processing problems

Managing Kafka is challenging because it lacks features needed for activity monitoring, message processing, and data persistence.

Kafka shines for headless messaging systems where you don’t have to mutate a message before delivery. Suppose you need to process a message before forwarding it to consumers. This requires dependency on additional platforms, making processing messages with Kafka more difficult and complex.

Additionally, including other platforms such as those listed above significantly increases the complexity of your data delivery system, as each component of the streaming platform requires maintenance and has constraints that apply to the entire cluster. Additionally, Kafka clusters have limited data and message persistence as your data needs grow over time.

Complicated client libraries

Companies mainly use Kafka for the streaming services they offer. Streaming API is written on top of Pub-Sub message delivery to support a unique business case. The Kafka Streams API is a standalone product that provides advanced functionality for enterprise customers. Kafka Streams’ most notable feature, transactions, helps organizations ensure the consistency of the output generated by the message flow. Because of this, Kafka has two separate APIs for each use case.

For example, the Kafka streaming library allows companies to offer “exactly once” message delivery. The delivery guarantees that both Kafka and Pulsar offer are:

  • At least once
  • At most once
  • Exactly once

“Exactly once” delivery guarantees that for every message there is an associated output that guarantees that the message will be processed in the event a consumer crashes. However, this is not possible with the Consumers API, which allows applications to read data streams from topics in the Kafka cluster, so you have to write most of the functionality to the platform. This makes it difficult to use a single client library for all the functionality you need for your business, which is unsustainable when working at scale.

Enter Pulsar

For each Kafka constraint highlighted above, Pulsar has a solution. The following sections describe some of the benefits of Pulsar.

Persistent data storage

Pulsar offers Kafka’s news streaming and publishing capabilities, but adds the ability to store the data for longer periods of time.

Pulsar offers data store persistence with Apache Bookkeeper. Bookkeeper manages the data and helps offload data persistence outside of the cluster. You can use other data storage media like AWS S3 to store data and scale beyond the limits of a local disk, allowing you to easily expand your applications without storage problems.

In addition, Pulsar includes a tiered storage feature that helps move data between hot and cold storage options; Data can then be stored in cold storage for as long as the business needs. The cluster does not require continuous infrastructure sizing for storage options.

Pulsar also automatically moves Bookkeeper’s older messages to a cheaper cold storage option by making a segment of the data immutable. The immutable segment can be moved to cheaper storage, effectively allowing Pulsar to hold infinite amounts of data.

developer experience

From a developer’s perspective, Pulsar offers an integrated, lightweight client library for all major languages ​​(Java, Python, Go, and C#). The libraries help developers get started quickly with the platform, which is key to developing and publishing applications at scale. Pulsar’s binary protocol extends the client library’s capabilities as needed, making the library suitable for growth. (Here is the list of available and officially supported Pulsar client libraries.)

Pulsar Features

Pulsar Functions is an out-of-the-box feature that allows developers to write custom code that can process messages in the message stream without having to deploy a system like Apache Heron, Apache Flink, or Apache Storm.

Pulsar functions are used in a serverless connector framework Pulsar IO, making it easier to move the data to and from Pulsar. This out-of-box system allows Pulsar to connect to external SQL and NoSQL databases such as Apache Cassandra.

Additionally, this message processing is stream-native, which means that the messages are processed and transformed within the cluster before being delivered to the consumers. Because Pulsar Functions are the computing infrastructure of the Pulsar messaging system, they support business-level goals including developer productivity, ease of troubleshooting, and operational simplicity—traits critical to application and team performance when working at scale.

scalability

In addition to the features and services mentioned above and their impact on scalability, Pulsar offers several features that make it a scalable option for your organization’s news streaming and publishing needs.

Pulsar’s geo-replication capability allows for high scalability of Pulsar. The cluster replicates the data to multiple locations around the world for use in the event of a disaster bringing down the application. Replication is supported both synchronously and asynchronously. Asynchronous replication is faster but offers fewer data consistency guarantees than synchronous replication.

Pulsar uses a broker-per-topic concept, which ensures that the same broker handles all requests for a topic. The Pulsar architecture shows how the broker-based approach improves the system’s performance compared to a Kafka cluster.

Wrap up

Kafka and Pulsar share some similarities, but there are some fundamental differences that should be considered when choosing which platform to use – especially if you need scalability.

Kafka’s architecture, storage capabilities, and ease of use present numerous challenges that can hamper a company’s ability to grow. Trying to scale your Kafka clusters past a certain point gets expensive and is often more trouble than it’s worth. From the way data is stored to the way it supports message transformation, Pulsar is Kafka’s next-generation unified challenger designed for scalability.

Learn more about DataStax Astra streamingbuilt on Apache Pulsar and delivered as a fully managed service.

group Created with Sketch.
Share.

Comments are closed.