Kafka Alternatives: Streaming with Pathway and MinIO

Pathway Team·August 27, 2024·0 min read

Introduction

Apache Kafka is an event store platform for publishing and consuming data streams. It is the de facto standard for building real-time streaming data pipelines.

However, migrating to Kafka can be challenging because of Kafka's complexity and operational overhead. Often, relying on a simpler alternative might be a better solution.

In this article, you will learn a new approach to setting up a streaming pipeline without relying on Kafka: using a pipeline based on Pathway and MinIO.

Migrating to Kafka is a challenge

Despite its popularity, many companies struggle to migrate to Kafka due to its inherent complexity and operational overhead:

  • Substantial Setup and Maintenance: Kafka demands considerable effort in terms of setup and ongoing maintenance and requires specialized expertise to manage effectively.
  • Resource-Intensive Infrastructure: Kafka's infrastructure needs are heavy, often requiring additional components like Zookeeper, which add to both the cost and complexity.
  • Overkill for Smaller Use Cases: For simpler or smaller-scale applications, Kafka can be excessive, making it a less practical choice. Integration Challenges: Adopting Kafka can be challenging when integrating with non-Java systems or legacy applications, further complicating the migration process.

Kafka Alternative: Pathway + MinIO

Instead of using Kafka as a message queue, we propose to use Delta Tables on S3-compatible storage, such as MinIO, to manage your data streams. Pathway will handle the reading and writing of data to and from these Delta Tables, enabling a seamless and efficient streaming architecture. This solution simplifies the infrastructure, reduces operational overhead, and offers a more flexible and cost-effective alternative to Kafka, all while maintaining the robustness and scalability required for real-time data processing.

The following section quickly presents the basics of MinIO, S3, and Delta Tables. If you are already familiar with those, you can go straight into the code.

Some alternative text

MinIO as S3-storage for Delta Tables

MinIO is an S3-compatible object storage service, offering scalable storage for unstructured data like files, images, and backups. It supports creating and managing buckets, object versioning, and access controls to keep your data well-organized and secure.

S3 storage, short for Simple Storage Service, is a cloud-based solution that lets you store and retrieve any amount of data from anywhere on the web. You can use MinIO as a drop-in replacement for Amazon S3, enjoying the same features such as scalability, durability, and flexibility. This compatibility allows you to handle data storage needs efficiently without relying on Amazon’s infrastructure, making MinIO a great alternative for your cloud storage requirements.

The main difference between S3 storage and a file system is how they manage and access the data. In S3 storage, data is represented as objects within buckets, each object having a unique key and metadata. Inside a bucket, objects are accessed in a key-value store fashion: you access them by their key, and there is no extension or ordering.

What are Delta Tables?

Delta Tables is an ACID-compliant storage layer implemented through Delta Lake. Delta Tables are a way to make object storage behave like database tables: They track data changes, ensure consistency, support schema evolution, and enable high-performance analytics. Delta Tables are especially useful for real-time data ingestion and processing, providing an append-only mechanism for writing and reading data.

Practical ETL Example: Switching from Kafka to MinIO+Delta Tables

As an example, you will build a simple ETL use case, but instead of Kafka, the data streams will be handled using Delta Tables on MinIO. The ETL pipeline will be the same as one of our Kafka ETL examples: Kafka ETL: Processing Event Streams in Python.

In a nutshell, you have two data sources sending times from different time zones. Pathway will ingest those times and convert them into timestamps. For more information about how the pipeline is done, you can read the associated article.

Some alternative text

The only difference will be the connectors: Pathway will read from MinIO+Delta Table instead of Kafka.

Step by Step Guide

MinIO settings

First, you need a MinIO instance. You can host your own or use the MinIO offering. Keep your credentials: you will need the MINIO_S3_ACCESS_KEY and the MINIO_S3_SECRET_ACCESS_KEY.

Before starting, you must create a bucket to store all the data. This is a crucial step as you cannot create a bucket on the fly: if you try to create/read a document on a non-existing bucket, the pipeline will fail.

Obtaining the sources

First, download the sources from our GitHub repository. The project has two directories:

.
├── minio-ETL/
└── benchmarks/

You need to go in the ETL project is in the minio-ETL/ folder (link), organized as follows:

.
├── .env
├── base.py
├── etl.py
├── producer.py
├── read-results.py
└── README.md
  • .env: environment file where you will write your MinIO credentials.
  • base.py: base configurations settings required for accessing the MinIO S3 storage.
  • etl.py: Pathway ETL pipeline that will be executed. It includes the loading and writing to Delta Tables on MinIO. It uses base.py to configure the accesses.
  • producer.py: script generating messages with timestamps from two different timezones (New York and Paris) and writing them to the corresponding Delta Tables.
  • read-results.py: Use this file to read the resulting data stream created by the ETL pipeline. It reads the Delta Table and outputs a local CSV file.

Configuration

You need to put your credentials in the .env file:

MINIO_S3_ACCESS_KEY =  *******
MINIO_S3_SECRET_ACCESS_KEY =  *******

Update the configurations in base.py and producer.py:

  • In the base.py file, you must fill the bucket and the endpoint.
  • In the producer.py file, you must update the storage_option dictionary. Do not forget the AWS_REGION.

ETL Read and Write Updates

Regarding the implementation, the only difference between the ETL pipeline with Kafka and this one is the connectors: how to read and write the data streams. Pathway provides Delta Lake connectors to read and write Delta Tables on S3 easily.

Data Ingestion - Reading and Ingesting Delta Lake Tables with Pathway

timestamps_timezone_1 = pw.io.deltalake.read(
    base_path + "timezone1",
    schema=InputStreamSchema,
    s3_connection_settings=s3_connection_settings,
    autocommit_duration_ms=100,  # Auto-commit duration in milliseconds
)

timestamps_timezone_2 = pw.io.deltalake.read(
    base_path + "timezone2",
    schema=InputStreamSchema,
    s3_connection_settings=s3_connection_settings,
    autocommit_duration_ms=100,  # Auto-commit duration in milliseconds
)

Data Writing - Write Data to Delta Lake with Pathway

pw.io.deltalake.write(
    timestamps_unified,
    base_path + "timezone_unified",
    s3_connection_settings=s3_connection_settings,
    min_commit_frequency=100,  # Auto-commit duration in milliseconds
)

Running the pipeline

Now that the project is ready, all what's left for you is to run it.

If you were to start by running etl.py directly, it would fail as the input Delta Tables are not created yet. So you need to first start by running producer.py. Right after, and before producer.py is done, you can launch in another terminal etl.py. Pathway will process the incoming data stream until it is stopped. Once producer.py is done, you can stop Pathway, and read the results using read-results.py.

  1. Run producer.py to generate the data stream.
  2. Run etl.py to launch Pathway. No need to wait for producer.py to end.
  3. Once producer.py is done, wait ~10 seconds and strop Pathway (Ctrl+C works).
  4. Run read-results.py to write the results in a CSV file.
  5. Read the results with cat results.csv

You should obtain something like this:

timestamp,message,time,diff
1724403133066.217,"0",1724403268648,1
1724403138388.917,"1",1724403269748,1
1724403144896.706,"3",1724403270848,1
1724403147095.393,"4",1724403271948,1
1724403149295.165,"5",1724403272948,1
1724403151499.115,"6",1724403274048,1
1724403153736.456,"7",1724403275148,1
1724403140576.744,"2",1724403276248,1
1724403158229.244,"9",1724403276248,1

Congratulations! You have successfully deployed on MinIO a data stream pipeline using Pathway and Delta Tables instead of traditional messaging systems like Kafka. Simple right?

Benchmarking latency

To explore the viability of using Delta Tables on S3 as an alternative to traditional Kafka message queues, we conducted a series of latency benchmarks under various workloads. Our goal was to assess whether this setup could effectively replace Kafka, especially in scenarios where ultra-low latency isn't a strict requirement.

In these benchmarks, messages were written to a Delta Table hosted on a MinIO instance and then read back from the table. We measured latency as the time elapsed from when a message was sent until it was successfully received. The tests were designed to simulate high-throughput scenarios, with workloads of 10,000, 20,000, and 30,000 messages per second, each running for 10 minutes.

For those interested in reproducing these tests or adapting them to their own needs, the benchmark script is available here. Be aware that the machine configuration (in particular the location) can have an impact on the latency.

After a warm-up period, we observed the following latency distributions:

Workload (messages/sec)p50 (s)p75 (s)p85 (s)p95 (s)p99 (s)
10,0000.490.500.500.501.48
20,0000.830.960.971.061.35
30,0000.650.971.051.081.33
  • Workload: Number of messages written per second.
  • Latency: Time measured in seconds.

Explanation: The columns labeled p50, p75, p85, p95, and p99 represent latency percentiles. For instance, p50 indicates that 50% of the messages experienced a latency lower than the value shown, while p95 means that 95% of the messages had a latency lower than the corresponding value.

Key Insights

The solution demonstrates stable performance across 10,000 to 30,000 messages per second, with the 99th percentile latency averaging 1.4 ± 0.1 seconds. Performance is primarily influenced by the time required to upload and download objects from S3 or MinIO buckets. Using buckets in geographically closer locations can reduce latency and improve performance.

Our focus was on achieving consistent performance at standard workloads, rather than pushing for maximum throughput. As the workload increases, the primary challenge becomes the volume of data being transferred and downloaded, which can increase the cost of network errors.

The streamer component, written in Python, handles data dispatching effectively at lower rates. However, at higher rates, the increased communication between Python and the Rust core can become a bottleneck. Despite this, the approach performs well within the studied rates.

While this setup using Delta Tables on S3 and MinIO does not quite match Kafka in terms of raw latency performance, it demonstrates strong potential as a more accessible and flexible alternative. This is particularly true for applications where slightly higher latency is acceptable. If your application can tolerate latencies of around 1 second at higher throughputs, using Pathway and MinIO can be an efficient and reliable choice for managing your data streams.

Conclusion

Migrating to Kafka is challenging because of its complexity and resource demands. For organizations already using S3-compatible storage like MinIO, using Pathway with Delta Tables offers a simpler, more efficient way to handle real-time messaging between services. Delta Tables store tables on S3 storage and forward messages, replacing Kafka.

It's a game-changer for teams already using S3 storage and want to add a real-time pipeline to their infrastructure without depending on the Kafka ecosystem.