App Templates showcasedata-pipeline

Python Kafka Alternative: Achieve Sub-Second Latency with your S3 Storage without Kafka using Pathway

Get updates on Upcoming App Templates and Blogs

Pathway Team

·Published August 27, 2024.Updated August 27, 2024·0 min read

Apache Kafka is a distributed event store for publishing and consuming data streams. It is a de facto choice for building real-time streaming data pipelines. However, Kafka's complexity and high infrastructure demands make it a costly and challenging solution for message queuing. It requires dedicated server clusters, ongoing maintenance, and fine-tuning, which can add significant expense and operational overhead.

Deploying and maintaining self-managed Apache Kafka clusters for minimal use cases will cost you over $20,000 monthly according to Confluent estimates. Even using managed services like Confluent Cloud will be roughly $1,800 per month—and that's just for the service itself, not including the additional costs of risk management and operations. What if there's a simpler, more cost-effective way to achieve sub-second latency for your streaming applications—all while leveraging your existing S3 storage and keeping your system straightforward?

Introducing a Kafka-free architecture using Pathway (GitHub repo) alongside Delta Tables on S3-compatible storage, such as MinIO. Pathway is a stream processing engine with a Rust engine that allows you to build efficient, real-time data pipelines in Python for both on-premise and cloud environments (AWS, Google Cloud, Azure). In this architecture, Pathway manages the reading and writing of data to and from Delta Tables, creating a seamless, scalable, and cost-effective streaming solution that leverages your existing S3 storage—eliminating the need for Kafka.

In these benchmarks, Pathway achieved stable near-real-time latency for workloads up to 250,000 messages per second. Reducing the batch size provides sub-second latency for workloads up to 60,000 messages per second, demonstrating performance suitable for many real-time applications while significantly reducing complexity and cost compared to Kafka.

When using Pathway in conjunction with Delta Lake and MinIO, you will:

Reduce the number of components you need to manage.
Avoid the high expenses associated with deploying and maintaining Kafka clusters.
Achieve sub-second latency using your existing S3-compatible storage.
Keep it simple.

In this article, we'll guide you through setting up a streaming data pipeline, present benchmark results, and explore how you can do this setup to simplify your infrastructure while delivering the performance you need.

Table of Contents

Challenges with Kafka
Using Pathway as a Python Kafka Alternative for Your S3 Storage
Some Use Cases Where This Kafka Alternative Works
Simplified Architecture for Streaming with S3
1. MinIO as S3-storage for Delta Tables
2. What are Delta Tables?
Building Your Streaming Pipeline Without Kafka
Benchmarks for this S3 Streaming Setup without Kafka
1. Benchmark Setup:
Key Findings
AWS Kafka Alternative
Google Cloud Kafka Alternative
Azure Kafka Alternative
TL;DR
Essential Links

Challenges with Kafka

Kafka is a common tool for handling high-throughput, real-time data streams, but it comes with significant challenges:

Costs

Self-Managed Clusters: Deploying and maintaining Kafka clusters can cost over $20000 monthly. This includes expenses for hardware, maintenance, and skilled personnel.
Managed Services: Just for services like Confluent Cloud you will have to pay around $1800 per month for processing 50k small messages per second, excluding higher additional costs like risk management and operations.

Complexity

Kafka requires setting up and managing multiple components like brokers, zookeepers, producers, and consumers.
Ensuring availability, fault tolerance, and scalability demands continuous effort and expertise.
Your team will need to invest time to learn Kafka's intricacies.

Using Pathway as a Python Kafka Alternative for Your S3 Storage

You don't need to get into the complexities of Kafka to achieve sub-second latency.

Instead of using Kafka as a message queue, we propose to use Delta Tables on S3-compatible storage, such as MinIO, to manage your data streams. Pathway provides the streaming capabilities that Delta Tables alone lack: Pathway continuously monitors, reads, and processes new data entries and updates onto the Delta Tables. By handling the reading and writing of data to and from these Delta Tables, Pathway provides an easy-to-deploy and efficient way to process data streams on S3.

With Pathway in integration with your existing S3 storage, you can handle up to 60,000 messages per second with 0.7s (p99) latency, and even 250k messages per second with near-real-time latency. Here you will see how you will get:

Automatic Data Updates
Sub-Second Latency
Lightweight and Efficient pipelines
Easy learning curve as you can build pipelines in Python
Much lower infrastructure and licensing costs

Some Use Cases Where This Kafka Alternative Works

While the achieved performance of Pathway and Delta Tables over MinIO—sub-second latency for workloads up to 60,000 messages per second and near-real-time latency (1-4 seconds) for workloads up to 250,000 messages per second—is an order of magnitude lower than the best possible with Kafka, it is still more than sufficient for a wide range of applications, such as:

Logistics and IoT: Collecting data from numerous sensors in logistics, industrial automation systems, and agricultural equipment.
Financial Services: Processing transactions, fraud detection, and handling market data feeds in near realtime or realtime.
Web and Mobile Analytics: Monitoring user interactions, managing real-time bidding, and tracking ad impressions.

This is not an exhaustive list, but it illustrates the wide range of scenarios this setup serves as an effective Python Kafka alternative for real-time data processing needs.

Simplified Architecture for Streaming with S3

Using Pathway with S3-compatible storage like MinIO allows you to create a streaming architecture that is both effective and straightforward. Unlike traditional Kafka setups, which involve managing clusters, brokers, and zookeepers, a Pathway-based solution simplifies your infrastructure by relying on your existing S3 storage. This architecture reduces your maintenance overhead, costs, and the need for complex configurations, all while delivering reliable, near-real-time data processing.

This simplified architecture focuses on three main components:

S3 Storage: Messages are stored on scalable, S3-compatible storage systems, such as MinIO, which provide cost-effective and reliable storage for streaming workloads.
Delta Tables: Acting as the data format, Delta Tables ensure efficient and consistent storage of messages, with support for ACID transactions and schema enforcement.
Pathway: Pathway serves as the processing engine that seamlessly reads and writes data to and from Delta Tables, enabling streamlined management of data streams.

MinIO as S3-storage for Delta Tables

The MinIO Enterprise Object Store (EOS) is a high-performance, Kubernetes-native, S3-compatible object store offering scalable storage for unstructured data like files, videos, images, and backups. EOS delivers S3-like infrastructure across public clouds, private clouds, on-prem and the edge. It offers a rich suite of enterprise features targeting security, resiliency, data protection, and scalability. MinIO's EOS is commonly used to build streaming data pipelines and AI data lakes because it is highly scalable, performant and durable but can also handle backup and archival workloads - all from a single platform.

S3 storage, short for Simple Storage Service, is a cloud-based solution that lets you store and retrieve any amount of data from anywhere on the web. You can use MinIO as a drop-in replacement for Amazon S3, enjoying the same features such as scalability, durability, and flexibility. This compatibility allows you to handle data storage needs efficiently without relying on Amazon's infrastructure, making MinIO a great alternative for your cloud storage requirements.

The main difference between S3 storage and a file system is how they manage and access the data. In S3 storage, data is represented as objects within buckets, each object having a unique key and metadata. Inside a bucket, objects are accessed in a key-value store fashion: you access them by their key, and there is no extension or ordering.

What are Delta Tables?

Delta Tables is an ACID-compliant storage layer implemented through Delta Lake. Delta Tables are a way to make object storage behave like database tables: They track data changes, ensure consistency, support schema evolution, and enable high-performance analytics. Delta Tables are especially useful for real-time data ingestion and processing, providing an append-only mechanism for writing and reading data.

Building Your Streaming Pipeline Without Kafka

Here's a step-by-step guide to building this pipeline with Pathway, Delta Lake, and S3/MinIO.

Code Walkthrough

Set Up MinIO

MinIO Instance: Start with a MinIO instance. You can either host your own or use MinIO's managed offering. Be sure to have your credentials ready— MINIO_S3_ACCESS_KEY and the MINIO_S3_SECRET_ACCESS_KEY.
Create a Bucket: Before starting the pipeline, create a bucket in MinIO to store data. This is essential, as attempting to write to a non-existent bucket will cause the pipeline to fail.

Download the Project Sources

Get the sources from our GitHub repository. The project has two directories:

.
├── minio-ETL/
└── benchmarks/

Navigate to minio-ETL/ folder, where you’ll find the pipeline project files:

.
├── .env
├── base.py
├── etl.py
├── producer.py
├── read-results.py
└── README.md

.env is the environment file for setting MinIO credentials.
base.pycontains basic configuration settings for accessing MinIO.
etl.py is the main Pathway ETL pipeline, handling data loading and writing to Delta Tables on MinIO.
producer.py generates messages with timestamps from two time zones (New York and Paris) and writes them to the respective Delta Tables.
read-results.py reads the output data stream from the ETL pipeline and saves it to a CSV file.

Configure the Project

Open .env and enter your MinIO credentials:

MINIO_S3_ACCESS_KEY =  *******
MINIO_S3_SECRET_ACCESS_KEY =  *******

In base.py, add your bucket name and endpoint URL. In producer.py, update the storage_option dictionary with these settings, including the AWS_REGION.

Implement the ETL pipeline

Pathway here reads and writes data streams directly from Delta Tables on S3-compatible storage.

Reading Delta Tables

Define the data ingestion setup in etl.py to read timestamps from Delta Tables:

etl.py

timestamps_timezone_1 = pw.io.deltalake.read(
    base_path + "timezone1",
    schema=InputStreamSchema,
    s3_connection_settings=s3_connection_settings,
    autocommit_duration_ms=1000,
)

timestamps_timezone_2 = pw.io.deltalake.read(
    base_path + "timezone2",
    schema=InputStreamSchema,
    s3_connection_settings=s3_connection_settings,
    autocommit_duration_ms=1000,
)

Transforming Data

Unify timestamps across different time zones with this transformation logic:

def process_timestamps(stream1, stream2):
    unified_stream = pw.transform.unify_timezones(stream1, stream2)
    return unified_stream

timestamps_unified = process_timestamps(timestamps_timezone_1, timestamps_timezone_2)

Writing to Delta Tables

Write the transformed data back to Delta Tables on MinIO:

pw.io.deltalake.write(
    timestamps_unified,
    base_path + "timezone_unified",
    s3_connection_settings=s3_connection_settings,
    min_commit_frequency=1000
)

Running the pipeline

Run the Producer: Start by running producer.py to generate the data stream and write it to Delta Tables.
Start the ETL Pipeline: Open a new terminal and run etl.py. Pathway will process data from the Delta Tables continuously.
Stop the Pipeline: Once producer.py is done, wait about 10 seconds, then stop Pathway by pressing Ctrl+C.
Read and Save Results: Run read-results.py to save the output data to a CSV file. View it with:

cat results.csv

You should obtain something like this:

timestamp,message,time,diff
1724403133066.217,"0",1724403268648,1
1724403138388.917,"1",1724403269748,1
1724403144896.706,"3",1724403270848,1
1724403147095.393,"4",1724403271948,1
1724403149295.165,"5",1724403272948,1
1724403151499.115,"6",1724403274048,1
1724403153736.456,"7",1724403275148,1
1724403140576.744,"2",1724403276248,1
1724403158229.244,"9",1724403276248,1

Bravo! You’ve successfully built a streaming pipeline on MinIO using Pathway and Delta Tables—an effective alternative to traditional messaging systems like Kafka.

Benchmarks for this S3 Streaming Setup without Kafka

Pathway is often used by developers for ultra-low latency use cases, such as Formula 1 racing analytics and mission-critical applications for organizations like Intel and NATO. However, in many scenarios, ultra-low latency isn't essential, and achieving sub-second or near-real-time latency is sufficient (like the use cases listed above) while giving you the advantage of working with the existing infrastructure itself.

The benchmark experiments below are to see if Pathway performs as a Python Kafka alternative with your existing S3-compatible storage.

Benchmark Setup:

In these tests, messages are written to a Delta Table hosted on MinIO (a popular S3-compatible storage system) and then read back. Latency is measured from the moment a message is sent to when it's successfully processed by Pathway.

The workloads simulate scenarios with message rates ranging from 10k to 250k messages per second. Each test runs for 5 minutes, excluding the first 30 seconds, to allow the system to stabilize for more accurate measurements. MinIO is hosted on the same network as the test machine to minimize external network delays. Pathway is configured to run in single-threaded mode for consistency.

The first 30 seconds were excluded from the results because the producer starts writing entries before the consumer begins. This means initial entries have higher latency as the consumer catches up with data already written. Excluding this period provides a more accurate measurement of steady-state performance.

To replicate these tests, you can find the benchmarking script here.

Key Findings

Latency graph for different streaming rates

For higher message rates, up to 250,000 per second, latency increased but remained within acceptable bounds for applications where near-real-time processing is sufficient.

70,000 messages/sec: Median latency around 1.45 seconds
150,000 messages/sec: Median latency around 1.68 seconds
250,000 messages/sec: Median latency around 1.98 seconds

Furthermore, we’ve recommended necessary modifications based on the workloads you’re dealing with.

Understanding Latency Behavior

Latency in this setup is influenced by several factors, including time spent creating and processing batches, network communication between Pathway and S3 storage, and processing overheads such as data parsing, system I/O, and other computational tasks.

Adjusting Parameters for Sub-Second Latency for Specific Workloads

Initially, the system uses default settings for batch sizes and commit intervals. By default, batches may be accumulated for up to one second before being committed to the Delta Table. This batching helps with throughput but can increase latency.

To improve latency, considering your workload, you can adjust autocommit_duration_ms and min_commit_frequency:

autocommit_duration_ms: Set this to 100 milliseconds. Reduce batch sizes and decrease the time messages spend waiting before being processed.
Trade-Off Consideration: Note that while smaller batches can significantly improve latency at moderate message rates, they increase the number of network communications with the storage system. This can limit throughput at higher message rates because the system spends more time handling network operations.

Here's how this setup affects performance:

Workload (messages/sec)	p50 (s)	p75 (s)	p85 (s)	p95 (s)	p99 (s)
10,000	0.26	0.35	0.36	0.53	0.67
30,000	0.28	0.31	0.32	0.38	0.64
70,000	1.24	16.06	24.82	35.66	41.32

As you can see, the optimization significantly improves latency at rates of 10,000 to 30,000 messages per second, and the improvement holds up until around 60,000 messages per second. The timings are stable, with only minor fluctuations (around 50 ms), caused by network and batching delays. The improvement is about four times for median processing time and up to three times for the 99th percentile.

However, at rates above 60,000 messages per second, the system starts to fail:

While latency improves, throughput suffers. The system now makes ten times more network calls and synchronization operations, such as polling for the latest Delta Table versions, which reduces the maximum throughput it can handle. Additionally, this increases the impact of network errors: if access to the Delta Table is lost even briefly, it may be more difficult for the system to recover.

Tips for Your Experiment

Include a warm-up period when running your own benchmarks to allow the system to reach steady-state performance. This ensures your latency measurements are accurate.
Note that the observed latency is with S3-compatible storage itself!
Based on the workload in consideration, experiment with settings like autocommit_duration_ms and min_commit_frequency to find the optimal balance between latency and throughput for your use case. Ensure your testing environment is adequately provisioned, as CPU, memory, and network capacity can affect performance.

AWS Kafka Alternative

Pathway integrates seamlessly with AWS S3, providing a powerful Kafka alternative for AWS users. Available on the AWS Marketplace, Pathway simplifies deployment within your AWS environment.

To set up Pathway with AWS S3, deploy Pathway from the AWS Marketplace and configure it with your AWS credentials. Use Pathway's Delta Lake connectors to read from and write to Delta Tables stored in S3. This allows you to run your streaming pipeline, leveraging AWS's scalability and reliability without the complexity of Kafka.

You can read the documentation for it here.

Google Cloud Kafka Alternative

Pathway serves as an efficient Kafka alternative on the Google Cloud Platform (GCP). By deploying Pathway on GCP, you can build streaming pipelines without the need to manage Kafka clusters.

To set up Pathway on Google Cloud, deploy it using Google Kubernetes Engine (GKE). Configure Pathway to interact with Google Cloud Storage, enabling you to process streaming data effectively. This setup leverages Google's scalable infrastructure without the complexity of Kafka.

You can read more about deploying Pathway on GCP in our GCP deployment guide.

Azure Kafka Alternative

Pathway offers a compelling Kafka alternative for Microsoft Azure users. Deploying Pathway on Azure allows you to build streaming pipelines without the overhead of managing Kafka infrastructure.

To set up Pathway on Azure, deploy it using Azure Container Instances. Configure Pathway to interact with Azure storage services, enabling you to process streaming data efficiently. This setup utilizes Azure's scalable infrastructure without the complexity of Kafka.

You can read more about deploying Pathway on Azure in our Azure deployment guide.

TL;DR

Kafka is widely used for real-time data streaming but is complex and costly to manage, often requiring substantial infrastructure and operational overhead. If you're already using S3-compatible storage like MinIO, you can achieve sub-second latency for your streaming applications without Kafka by using Pathway.

Pathway offers a lightweight alternative that integrates seamlessly with your existing S3 storage, using Delta Tables to manage data streams efficiently. Benchmarks demonstrate that Pathway can handle up to 60,000 messages per second with sub-second latency and up to 250,000 messages per second with near-real-time latency.

This setup simplifies your architecture, reduces costs, and is suitable for various applications like IoT sensor data collection, financial transactions, web analytics, and logistics tracking. The article provides a step-by-step guide to building a streaming pipeline using Pathway and MinIO, and explores how Pathway serves as a Kafka alternative on AWS, Google Cloud Platform, and Azure, leveraging each platform's storage services to process streaming data effectively without the complexity of Kafka. While this Kafka-free architecture relies on Pathway for managing data flows and interacting with Delta Tables, it can be seamlessly extended to handle advanced stream processing tasks. Thanks to Pathway, you can easily turn your S3 storage into a full data streaming processing pipeline.

Need help?

Pathway is trusted by thousands of developers and enterprises, including Intel, NATO, Formula 1 Teams, CMA CGM, and more. Reach out for assistance with your enterprise applications. Contact us here to discuss your project needs or to request a demo.

Essential Links

Pathway Documentation
Pathway GitHub Repository (9k+ stars combined)
Pathway Discord Community (2k+ members)

Pathway Team

Power your RAG and ETL pipelines with Live Data

Get started for free

Etl

Delta Lake ETL with Pathway for Spark Analytics

Rag

Pathway + PostgreSQL + LLM: app for querying financial reports with live document structuring pipeline.