App Templates
Data Preparation for Spark Analytics
Get updates on Upcoming App Templates and Blogs
·Published Invalid Date·Updated Invalid Date·0 min read
Data Preparation for Spark Analytics
This repository contains the Dockerized example code for the "Data Preparation for Spark Analytics" showcase.
Running the Example
To run this example, follow these steps:
- Get your GitHub Personal Access Token (PAT) from the Personal access tokens page.
- Insert this token into the
personal_access_tokenfield in the./github-config.yamlfile. There is a comment there for guidance. - Build the Docker image with the command:
docker build --no-cache -t spark-data-preparation . - Run the Docker image. Note that the Delta Lake connector is available only in the Pathway Scale and Pathway Enterprise tiers. You need to provide the
PATHWAY_LICENSE_KEYvariable when launching the Docker container. Use the following command:docker run -e PATHWAY_LICENSE_KEY=YOUR_LICENSE_KEY -t spark-data-preparation
Running the Example with S3
You can run this example similarly for the S3 case. To enable S3 output, pass the AWS_S3_OUTPUT_PATH environment variable to the container.
Additionally, you need to specify the following environment variables:
AWS_S3_ACCESS_KEY: Your S3 access keyAWS_S3_SECRET_ACCESS_KEY: Your S3 secret access keyAWS_BUCKET_NAME: The name of your S3 bucketAWS_REGION: The region of your S3 bucket
The launch command will look like this:
docker run \
-e PATHWAY_LICENSE_KEY=YOUR_LICENSE_KEY \
-e AWS_S3_OUTPUT_PATH=YOUR_OUTPUT_PATH_IN_S3_BUCKET \
-e AWS_S3_ACCESS_KEY=YOUR_S3_ACCESS_KEY \
-e AWS_S3_SECRET_ACCESS_KEY=YOUR_S3_SECRET_ACCESS_KEY \
-e AWS_BUCKET_NAME=YOUR_BUCKET_NAME \
-e AWS_REGION=YOUR_AWS_REGION \
-t spark-data-preparation
Power your RAG and ETL pipelines with Live Data