App Templates
Benchmark for Delta Lake S3 messaging as a Kafka replacement
Get updates on Upcoming App Templates and Blogs
·Published Invalid Date·Updated Invalid Date·0 min read
Benchmark for Delta Lake S3 messaging as a Kafka replacement
This repository contains the necessary files to benchmark Delta Lake S3 messaging describe in this article. Below is a description of each file:
lib.py: Contains Python wrappers for Pathway classes, which are used to build a message queue on top of Delta Lake.producer.py: Implements the producer logic, which generates messages and adds them to the Delta Lake-based message queue at a specified streaming rate.consumer.py: Implements the consumer logic, which retrieves messages from the message queue and tracks latencies across various percentiles (50th, 75th, 85th, 95th, and 99th).benchmark.py: Coordinates the execution of both the producer and consumer, and saves the benchmark results in thebenchmark-results/directory.
Setting Up the Benchmark
To set up this benchmark, you'll need to specify the name of the S3 bucket and provide two S3 access keys.
- Set the bucket name by assigning the appropriate value to the
TEST_BUCKET_NAME,TEST_ENDPOINT, andTEST_REGIONconstants inlib.py. - The access key and secret access key will be retrieved from the environment variables
MINIO_S3_ACCESS_KEYandMINIO_S3_SECRET_ACCESS_KEY.
Usage Instructions
To run the streaming benchmark, use the benchmark.py script. This command-line tool accepts the following parameters:
--range-start: The starting value of the "messages-per-second" range to be tested.--range-end: The ending value (inclusive) of the "messages-per-second" range to be tested.--range-step: The increment between values in the tested range.--seconds-to-stream: The duration for each streaming test.
Example
If you want to benchmark rates of 10,000, 20,000, and 30,000 messages per second, each running for 10 minutes, use the following command:
python benchmark.py --range-start 10000 --range-end 30000 --range-step 10000 --seconds-to-stream 600
The script will generate several CSV files in the benchmark-results/ directory, with one file per tested rate. Each file will include a report showing the latency percentiles for different points in the streaming process.
Power your RAG and ETL pipelines with Live Data