Kafka alternative: Stream processing in Python with Pathway, Delta Lake and MinIO
Introduction
Apache Kafka is an event store platform for publishing and consuming data streams. It is the de facto standard for building real-time streaming data pipelines.
However, migrating to Kafka can be challenging because of Kafka's complexity and operational overhead. Often, relying on a simpler alternative might be a better solution.
In this article, you will learn a new approach to setting up a streaming pipeline without relying on Kafka: using a pipeline based on Pathway and MinIO.
Migrating to Kafka is a challenge
Despite its popularity, many companies struggle to migrate to Kafka due to its inherent complexity and operational overhead:
- Substantial Setup and Maintenance: Kafka demands considerable effort in terms of setup and ongoing maintenance and requires specialized expertise to manage effectively.
- Resource-Intensive Infrastructure: Kafka's infrastructure needs are heavy, often requiring additional components like Zookeeper, which add to both the cost and complexity.
- Overkill for Smaller Use Cases: For simpler or smaller-scale applications, Kafka can be excessive, making it a less practical choice.
- Integration Challenges: Adopting Kafka can be challenging when integrating with non-Java systems or legacy applications, further complicating the migration process.
Kafka Alternative: Pathway + MinIO
Instead of using Kafka as a message queue, we propose to use Delta Tables on S3-compatible storage, such as MinIO, to manage your data streams. Pathway will handle the reading and writing of data to and from these Delta Tables, enabling a seamless and efficient streaming architecture. This solution simplifies the infrastructure, reduces operational overhead, and offers a more flexible and cost-effective alternative to Kafka, all while maintaining the robustness and scalability required for real-time data processing.
The following section quickly presents the basics of MinIO, S3, and Delta Tables. If you are already familiar with those, you can go straight into the code.
MinIO as S3-storage for Delta Tables
The MinIO Enterprise Object Store (EOS) is a high-performance, Kubernetes-native, S3-compatible object store offering scalable storage for unstructured data like files, videos, images, and backups. EOS delivers S3-like infrastructure across public clouds, private clouds, on-prem and the edge. It offers a rich suite of enterprise features targeting security, resiliency, data protection, and scalability. MinIO's EOS is commonly used to build streaming data pipelines and AI data lakes because it is highly scalable, performant and durable but can also handle backup and archival workloads - all from a single platform.
S3 storage, short for Simple Storage Service, is a cloud-based solution that lets you store and retrieve any amount of data from anywhere on the web. You can use MinIO as a drop-in replacement for Amazon S3, enjoying the same features such as scalability, durability, and flexibility. This compatibility allows you to handle data storage needs efficiently without relying on Amazon's infrastructure, making MinIO a great alternative for your cloud storage requirements.
The main difference between S3 storage and a file system is how they manage and access the data. In S3 storage, data is represented as objects within buckets, each object having a unique key and metadata. Inside a bucket, objects are accessed in a key-value store fashion: you access them by their key, and there is no extension or ordering.
What are Delta Tables?
Delta Tables is an ACID-compliant storage layer implemented through Delta Lake. Delta Tables are a way to make object storage behave like database tables: They track data changes, ensure consistency, support schema evolution, and enable high-performance analytics. Delta Tables are especially useful for real-time data ingestion and processing, providing an append-only mechanism for writing and reading data.
Practical ETL Example: Switching from Kafka to MinIO+Delta Tables
As an example, you will build a simple ETL use case, but instead of Kafka, the data streams will be handled using Delta Tables on MinIO. The ETL pipeline will be the same as one of our Kafka ETL examples: Kafka ETL: Processing Event Streams in Python.
In a nutshell, you have two data sources sending times from different time zones. Pathway will ingest those times and convert them into timestamps. For more information about how the pipeline is done, you can read the associated article.
The only difference will be the connectors: Pathway will read from MinIO+Delta Table instead of Kafka.
Step by Step Guide
MinIO settings
First, you need a MinIO instance.
You can host your own or use the MinIO offering.
Keep your credentials: you will need the MINIO_S3_ACCESS_KEY
and the MINIO_S3_SECRET_ACCESS_KEY
.
Before starting, you must create a bucket to store all the data. This is a crucial step as you cannot create a bucket on the fly: if you try to create/read a document on a non-existing bucket, the pipeline will fail.
Obtaining the sources
First, download the sources from our GitHub repository. The project has two directories:
.
├── minio-ETL/
└── benchmarks/
You need to go in the ETL project is in the minio-ETL/
folder (link), organized as follows:
.
├── .env
├── base.py
├── etl.py
├── producer.py
├── read-results.py
└── README.md
.env
: environment file where you will write your MinIO credentials.base.py
: base configurations settings required for accessing the MinIO S3 storage.etl.py
: Pathway ETL pipeline that will be executed. It includes the loading and writing to Delta Tables on MinIO. It usesbase.py
to configure the accesses.producer.py
: script generating messages with timestamps from two different timezones (New York and Paris) and writing them to the corresponding Delta Tables.read-results.py
: Use this file to read the resulting data stream created by the ETL pipeline. It reads the Delta Table and outputs a local CSV file.
Configuration
You need to put your credentials in the .env
file:
MINIO_S3_ACCESS_KEY = *******
MINIO_S3_SECRET_ACCESS_KEY = *******
Update the configurations in base.py
and producer.py
:
- In the
base.py
file, you must fill the bucket and the endpoint. - In the
producer.py
file, you must update thestorage_option
dictionary. Do not forget theAWS_REGION
.
ETL Read and Write Updates
Regarding the implementation, the only difference between the ETL pipeline with Kafka and this one is the connectors: how to read and write the data streams. Pathway provides Delta Lake connectors to read and write Delta Tables on S3 easily.
Data Ingestion - Reading and Ingesting Delta Lake Tables with Pathway
timestamps_timezone_1 = pw.io.deltalake.read(
base_path + "timezone1",
schema=InputStreamSchema,
s3_connection_settings=s3_connection_settings,
autocommit_duration_ms=100, # Auto-commit duration in milliseconds
)
timestamps_timezone_2 = pw.io.deltalake.read(
base_path + "timezone2",
schema=InputStreamSchema,
s3_connection_settings=s3_connection_settings,
autocommit_duration_ms=100, # Auto-commit duration in milliseconds
)
Data Writing - Write Data to Delta Lake with Pathway
pw.io.deltalake.write(
timestamps_unified,
base_path + "timezone_unified",
s3_connection_settings=s3_connection_settings,
min_commit_frequency=100, # Auto-commit duration in milliseconds
)
Running the pipeline
Now that the project is ready, all what's left for you is to run it.
If you were to start by running etl.py
directly, it would fail as the input Delta Tables are not created yet.
So you need to first start by running producer.py
. Right after, and before producer.py
is done, you can launch in another terminal etl.py
. Pathway will process the incoming data stream until it is stopped. Once producer.py
is done, you can stop Pathway, and read the results using read-results.py
.
- Run
producer.py
to generate the data stream. - Run
etl.py
to launch Pathway. No need to wait forproducer.py
to end. - Once
producer.py
is done, wait ~10 seconds and strop Pathway (Ctrl+C works). - Run
read-results.py
to write the results in a CSV file. - Read the results with
cat results.csv
You should obtain something like this:
timestamp,message,time,diff
1724403133066.217,"0",1724403268648,1
1724403138388.917,"1",1724403269748,1
1724403144896.706,"3",1724403270848,1
1724403147095.393,"4",1724403271948,1
1724403149295.165,"5",1724403272948,1
1724403151499.115,"6",1724403274048,1
1724403153736.456,"7",1724403275148,1
1724403140576.744,"2",1724403276248,1
1724403158229.244,"9",1724403276248,1
Congratulations! You have successfully deployed on MinIO a data stream pipeline using Pathway and Delta Tables instead of traditional messaging systems like Kafka. Simple right?
Benchmarking
The streaming warehousing space is seeing a lot of growth, and Kafka is increasingly being promoted as a data storage solution for low-latency applications. Although the cost of managed Kafka services is decreasing, it can still be significantly more expensive than standard S3 storage.
Pathway provides a seamless ETL layer for both Kafka and S3, and a common question we hear from users building new data stacks is: should I choose Kafka or S3? A few months ago, we conducted some tests to measure the throughput and latency of a Kafka-based data backbone. We found that you can achieve 1 million messages per second with 95% of latencies in the 100-200ms range when using Kafka with Pathway.
Today, we're exploring how far you can go without using Kafka by relying solely on a self-hosted MinIO setup. We'll use Pathway to store and retrieve data in Delta Table format on MinIO, eliminating the need for extra investments. This approach can also be applied to any other S3-compatible storage system.
Initial setup
We conducted a series of latency benchmarks to evaluate the feasibility of using Delta Tables on S3 as a potential alternative to traditional Kafka message queues. Our goal was to determine if this setup could replace Kafka in cases where ultra-low latency isn't essential.
In these tests, messages were written to a Delta Table hosted on a MinIO instance, then read back from the table. Latency was measured as the time from when a message was sent to when it was successfully received. We simulated high-throughput scenarios with workloads ranging from 10,000 to 250,000 entries per second, each test running for 5 minutes.
If you'd like to replicate or modify these tests, the benchmark script is available here. Keep in mind that latency can be influenced by machine configuration.
Key setup considerations included:
- We used a self-hosted MinIO bucket on the same network as the test machine to minimize network impact on latency.
- All Pathway programs were single-threaded, ensuring results are reproducible on a single core.
- The first 30 seconds of each test were excluded from the results to allow for a warm-up period.
There is a reason for the warm-up period. To keep the code simple, it works like this: first, the producer starts, and only after some entries are written to the Delta Table (hosted on MinIO or other S3-compatible storage), the consumer is launched. This creates a situation where the initial entries have higher latency because the consumer needs time to catch up with the data that was written before it started. As a result, the latency of these initial entries is not accurate and should be ignored.
After the warm-up period, we collected key latency data points, which represent latency percentiles measured at the end of a five-minute simulation for each workload. Although the latency remains finite by the end of the process, when latency exceeds the stable threshold - in this case, 2 seconds for the 99th percentile - it indicates that the system is falling behind. This happens because the system can't process messages as quickly as they arrive, leading to an ongoing accumulation of lag.
Workload (messages/sec) | p50 (s) | p75 (s) | p85 (s) | p95 (s) | p99 (s) |
---|---|---|---|---|---|
10,000 | 0.94 | 0.94 | 0.94 | 0.95 | 1.94 |
70,000 | 1.45 | 1.72 | 1.84 | 1.89 | 2.05 |
150,000 | 1.68 | 1.94 | 2.04 | 2.18 | 3.09 |
250,000 | 1.98 | 2.27 | 2.41 | 3.36 | 4.58 |
- Workload: The number of messages written per second.
- Latency: Time measured in seconds.
- The columns labeled p50, p75, p85, p95, and p99 represent latency percentiles. For example, p50 means that 50% of the messages had a latency lower than the shown value, while p95 indicates that 95% of the messages had a latency lower than the corresponding value.
Below is the full latency graph for the 50th, 75th, 85th, 95th, and 99th percentiles.
As shown, the 99th percentile of latencies remains stable for rates up to 70,000 records per second but increases as the rate exceeds this threshold. This indicates that, beyond a certain point, the system starts to fall behind, leading to a processing delay. This trend is highlighted in the graph for the 99th percentile at a streaming rate of 250,000 records per second.
At the start, latency is about 2 seconds, which is consistent with lower streaming rates, but it increases as the test progresses. The graphs show gradual jumps in latency, corresponding to times when the system can't process incoming data in full before the next batch arrives. Although latency decreases slightly before each jump, the system never fully catches up. This causes the 99th percentile latency to keep rising indefinitely.
In contrast, let's look at the 99th percentile latency graph for a lower rate of 50,000 entries per second, streamed over a longer period - 30 minutes. While there are some latency spikes, they are relatively small, reaching up to 150 milliseconds in the 99th percentile, and are quickly resolved. This highlights the difference between a system that falls behind, accumulating unresolved latency spikes, and a well-functioning system that processes data steadily and can recover from occasional delays.
Understanding bottlenecks
As shown above, the approach we discussed enables durable messaging at a rate of 50,000 messages per second, with a stable 99th percentile, as illustrated in the graph. Although achieving this rate without a complex system is useful for certain applications, the latency remains significantly higher than what Kafka and other messaging technologies provide.
To understand the reasons for this latency, let's break down what happens in the streamer:
- When a message is emitted, it first passes through the Delta Lake connector, which accumulates batches for up to one second.
- After the one-second threshold, the data is committed to the Delta Lake hosted on S3. Most of the time spent here is due to network communication.
- On the consumer side, the system polls the Delta Table for updates at intervals of less than 100 milliseconds.
- Once the consumer detects a change in the Delta Table version, it retrieves the incremental updates and sends them to the next Pathway computational batch.
- Pathway finalizes computational batches every second, meaning some entries may wait for the next batch to be created.
- Finally, the statistics calculation code receives the messages via a Python connector, which adds latency due to communication between the Rust core and the Python interface.
The final latency is influenced by several factors:
- Time spent creating batches for submission to and retrieval from Delta Lake.
- Network communication time between the server and Pathway code.
- Processing time, including parsing the Parquet data from Delta Lake, Rust-Python communication, and other minor overheads.
The system starts to fall behind when the minibatches communicated through Delta Lake grow large enough that processing them takes longer than the time available between receiving batches.
Another important consideration is scalability and partitioning. While Kafka handles partitioning seamlessly, implementing it with Pathway over Delta Tables requires more effort. You need to set up multiple simultaneous Delta Table readers and writers and define logic to determine partition numbers for each processed entry.
Trading throughput for better latency
Some of the bottlenecks mentioned earlier can be reduced. For instance, you can decrease batch sizes, which will shorten the delay between when Pathway receives a record and when it sends the batch to the remote Delta Table. This can be done by reducing the autocommit_duration_ms
and min_commit_frequency
parameters in the messaging code by a factor of 10, bringing them down to 100 milliseconds.
This adjustment has a clear benefit: Pathway experiences much shorter communication delays. However, it also significantly increases the number of network communications, which limits the possibility of further reducing batching delays.
Here's how this setup affects performance:
Workload (messages/sec) | p50 (s) | p75 (s) | p85 (s) | p95 (s) | p99 (s) |
---|---|---|---|---|---|
10,000 | 0.26 | 0.35 | 0.36 | 0.53 | 0.67 |
30,000 | 0.28 | 0.31 | 0.32 | 0.38 | 0.64 |
70,000 | 1.24 | 16.06 | 24.82 | 35.66 | 41.32 |
As you can see, the optimization significantly improves latency at rates of 10,000 to 30,000 messages per second, and the improvement holds up until around 60,000 messages per second. The timings are stable, with only minor fluctuations (around 50 ms), caused by network and batching delays. The improvement is about four times for median processing time and up to three times for the 99th percentile.
However, at rates above 60,000 messages per second, the system starts to fail:
While latency improves, throughput suffers. The system now makes ten times more network calls and synchronization operations, such as polling for the latest Delta Table versions, which reduces the maximum throughput it can handle. Additionally, this increases the impact of network errors: if access to the Delta Table is lost even briefly, it may be more difficult for the system to recover.
Conclusion
Migrating to Kafka can be difficult due to its complexity and high resource requirements. For organizations already using S3-compatible storage like MinIO and that don't need ultra-low latency, Pathway with Delta Tables offers a simpler and more efficient solution for real-time messaging between services. Delta Tables store data on S3 and forward messages, providing an alternative to Kafka.
This approach is a game-changer for teams that already use S3 storage and want to add a real-time pipeline to their infrastructure without relying on the Kafka ecosystem.