App template:

Running Pathway Program in Azure with Azure Container Instances

Sergey Kulik·September 9, 2024·0 min read

If you've already gone through the AWS Deployment tutorial, feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the sections on Pathway Dockerhub Container and Running the Example in Azure Container Instances for more advanced content.

The Pathway framework enables you to define and run various data processing pipelines. You can find numerous tutorials that guide you through building systems like log monitoring, ETL pipelines with Kafka, or data preparation for Spark analytics.

Once you've developed and tested these pipelines locally, the next logical step is to deploy them in the cloud. Cloud deployment allows your code to run remotely, minimizing interruptions from local machine issues. This step is crucial for moving your code into a production-ready environment.

There are several ways to deploy your code to the cloud. You can deploy it on GCP, using Render or on AWS Fargate, for example. In this tutorial, you will learn how to deploy your Pathway code on Azure Container Instances using Pathway's tools and Dockerhub as an image storage.

Running Pathway ETL pipeline in Azure

The tutorial is structured as follows:

  1. Description of the ETL example pipeline.
  2. Instructions on getting Pathway image from Dockerhub.
  3. Step-by-step guide to setting up a deployment on Azure Container Instances.
  4. Results verifications.
  5. Conclusions.

Before you continue, please ensure your project meets these basic requirements:

  • The project is hosted on a public GitHub repository.
  • The requirements.txt file in the root directory lists all the Python dependencies for the project.

ETL Example Pipeline

Let's take the "Data Preparation for Spark Analytics" tutorial as an example. This tutorial walks you through building an ETL process that tracks GitHub commit history, removes sensitive data, and loads the results into a Delta Lake. For a detailed explanation, you can refer to the article that covers this task in depth.

Pathway data preparation pipeline for Spark

The tutorial's code is available in a Github repository. A few changes have been made to simplify the process:

  • The GitHub PAT (Personal Access Token) can now be read from an environment variable.
  • Spark computations have been removed since they aren't necessary in a cloud-based container.

Additionally, the README file has been updated to offer more guidance on using Pathway CLI tools to run the project.

There's an important point to consider regarding the task's output. Originally, there were two possible output modes: storing data in a locally-based Delta Lake or in an S3-based Delta Lake. In cloud deployment, using a locally-based Delta Lake isn't practical because it only exists within the container on a remote cloud worker and isn't accessible to the user. Therefore, this tutorial uses an S3-based Delta Lake to store the results, as it provides easy access afterward. This approach requires additional environment variables for the container to access the S3 service, which will be discussed further.

Pathway CLI and the BYOL container

Pathway CLI

Pathway provides several tools that simplify both cloud deployment and development in general.

The first tool is the Pathway CLI. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the spawn command lets you run code using multiple computational threads or processes. For example, pathway spawn python main.py runs your locally hosted main.py file using Pathway.

This tutorial highlights another feature: the ability to run code directly from a GitHub repository, even if it's not hosted locally.

Take the airbyte-to-deltalake example mentioned earlier. You can run it from the command line by setting two environment variables: GITHUB_PERSONAL_ACCESS_TOKEN for your GitHub PAT and PATHWAY_LICENSE_KEY for your Pathway license key. Then, simply call pathway spawn using --repository-url to define the GitHub repository to run.

This approach allows you to run remotely hosted code as follows:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    pathway spawn --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py

When the --repository-url parameter is provided, the CLI automatically handles checking out the repository, installing any dependencies listed in the requirements.txt file within an isolated environment, and running the specified file.

Pathway CLI

Additionally, you can use the PATHWAY_SPAWN_ARGS environment variable as a shortcut for running pathway spawn. This allows you to run code from a GitHub repository with the following command:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    PATHWAY_SPAWN_ARGS='--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py' \
    pathway spawn-from-env

Pathway Dockerhub Container

Another useful resource from Pathway is the Docker container, listed at Dockerhub. This listing offers a ready-to-use Docker image with Pathway and all its dependencies pre-installed, and without binding to a particular ecosystem. You can use the container without a license key, but entering one unlocks the full features of the framework. The listing is free to use, so there's no cost associated with accessing it.

Pathway Dockerhub container

The container runs the pathway spawn-from-env command, allowing you to easily execute it on the marketplace by passing the PATHWAY_SPAWN_ARGS and other required environment variables into the container. This gets your code running in the cloud. The next section will guide you through setting up Pathway processes using Azure Container Instances, the recommended Azure solution for the task.

Running the Example in Azure Container Instances

Since the container originates outside the Azure ecosystem, there are no steps required to acquire it from a specific marketplace.

However, several steps are necessary to use the container: logging in, configuring Azure Container Instances, specifying the required variables for S3 data storage, and finally running the Pathway instance. All of these steps can be performed using a single launcher script (referred to as launch.py in this example), which should be run locally. This script is provided in Pathway's repository.

The process involves first configuring the system and obtaining tokens from all related systems, and then using these tokens to run the computation in the cloud.

Step 1: Performing Azure Configuration

The Azure Command-Line Interface (CLI) is a powerful tool for managing Azure services. If you haven’t installed it yet, follow the installation guide here. Once installed, you can move forward with this tutorial.

We will go through a few key steps to set up the necessary variables.

First, let’s set up the variables required for Azure.

  1. Log in to Azure: Run the following command to log in to Azure:
    az login
    

    This will open a browser for authentication. After you log in, choose the subscription tenant you want to use. Copy the Subscription ID (in UUID4 format) from the Subscription ID column. Set this value as the AZURE_SUBSCRIPTION_ID variable.

    Note: If you don’t have a subscription, you can create one on the Subscriptions page in the Azure Portal.

  2. Get an Access Token: Run this command to get a session token:
    az account get-access-token
    

    Copy the accessToken value from the resulting JSON and assign it to the AZURE_TOKEN_CREDENTIAL variable.
    Since the token is quite long, it’s a good idea to store it as an environment variable. Keep in mind, this token expires every hour, so make sure it’s up to date before starting the container.
  3. Resource Group ID: To find your resource group, list the existing ones by running:
    az group list --query "[].name" --output tsv
    

    If you need to create a new resource group, use:
    az group create --name myResourceGroup --location eastus
    

    Assign the resource group’s name to the AZURE_RESOURCE_GROUP variable.

The remaining parameters are pre-defined and don’t need to be changed:

  • AZURE_CONTAINER_GROUP_NAME sets the name of the container group (in this case, there’s only one container).
  • AZURE_CONTAINER_NAME specifies the name of the container itself.
  • AZURE_LOCATION indicates the Azure data center location (e.g., "eastus").

With these steps completed, your variables in the code should look as follows:

launch.py
AZURE_SUBSCRIPTION_ID = "YOUR_AZURE_SUBSCRIPTION_ID"
AZURE_TOKEN_CREDENTIAL = "YOUR_AZURE_TOKEN_CREDENTIAL"
AZURE_RESOURCE_GROUP = "YOUR_AZURE_RESOURCE_GROUP"
AZURE_CONTAINER_GROUP_NAME = "pathway-test-container-group"
AZURE_CONTAINER_NAME = "pathway-test-container"
AZURE_LOCATION = "eastus"

Step 2: Authenticating in Dockerhub

This tutorial uses Dockerhub to store the Pathway Docker image. Azure Container Instances can pull and run containers directly from Dockerhub, which makes it easy for us to use. Now it's time to store up your Dockerhub credentials in the code:

  • First, your Dockerhub account username in the DOCKER_REGISTRY_USER variable.
  • Next, generate a personal access token from the Personal access tokens page, and assign it to the DOCKER_REGISTRY_TOKEN variable.
  • Finally, define the Docker image to be used. You can use the latest Pathway image: pathwaycom/pathway:latest.

You should end up with the following variables in the Python code:

launch.py
DOCKER_REGISTRY_USER = "YOUR_DOCKER_REGISTRY_USER"
DOCKER_REGISTRY_TOKEN = "YOUR_DOCKER_REGISTRY_TOKEN"
DOCKER_IMAGE_NAME = "pathwaycom/pathway:latest"

Step 3: Configuring Backend for Delta Lake Storage

As mentioned earlier, the results must be stored in durable storage since the container’s files will be deleted once it finishes. For this, the tutorial uses Amazon S3 to store the resulting Delta Lake. Hence, there is a need to configure S3-related variables.

  1. S3 Output Path, Bucket Name, and Region: You need to set up the full output path, bucket name, and the region where your S3 bucket is located. Store these values in the following variables:
  • AWS_S3_OUTPUT_PATH (e.g., s3://your-bucket/output-path)
  • AWS_S3_BUCKET_NAME (e.g., your-bucket)
  • AWS_REGION
  1. S3 Credentials: You'll also need to provide your AWS Access Key and Secret Access Key. You can find instructions on how to obtain these credentials here.

Here's what the S3 configuration should look like:

launch.py
AWS_S3_OUTPUT_PATH = "YOUR_AWS_S3_OUTPUT_PATH"
AWS_BUCKET_NAME = "YOUR_AWS_BUCKET_NAME"
AWS_REGION = "YOUR_AWS_REGION"
AWS_S3_ACCESS_KEY = "YOUR_AWS_S3_ACCESS_KEY"
AWS_S3_SECRET_ACCESS_KEY = "YOUR_AWS_S3_SECRET_ACCESS_KEY"

Step 4: Providing Pathway License Key and Github PAT

To enable Delta Lake features and parse commits from GitHub, you’ll need two last remaining pieces: the Pathway License Key and a GitHub Personal Access Token.

You can get a free-tier Pathway license key from the Pathway website.

The token for Github can be generated from the "Personal access tokens" page.

Store both the Pathway License Key and GitHub token in the following variables:

launch.py
PATHWAY_LICENSE_KEY = "YOUR_PATHWAY_LICENSE_KEY"
GITHUB_PERSONAL_ACCESS_TOKEN = "YOUR_GITHUB_PERSONAL_ACCESS_TOKEN"

Step 5: Configuring a Container in Azure Container Instances

In Azure Container Instances (ACI), a Container is a lightweight, standalone, and executable software package that includes everything needed to run an application: code, runtime, libraries, and dependencies. Each container runs in isolation but shares the host system’s kernel. ACI allows you to easily deploy and run containers without managing underlying infrastructure, offering a simple way to run applications in the cloud.

To manage containers and other resources in Azure efficiently, you can use the Azure Python SDK. For this tutorial, you'll need to install the azure-identity and azure-mgmt-containerinstance Python packages. You can install them using pip.

pip install azure-identity
pip install azure-mgmt-containerinstance

Next, import the classes you'll need for the configuration process.

launch.py
from azure.core.credentials import AccessToken
from azure.core.exceptions import HttpResponseError
from azure.mgmt.containerinstance import ContainerInstanceManagementClient
from azure.mgmt.containerinstance.models import (
    Container,
    ContainerGroup,
    ContainerGroupRestartPolicy,
    ContainerPort,
    EnvironmentVariable,
    ImageRegistryCredential,
    IpAddress,
    OperatingSystemTypes,
    Port,
    ResourceRequests,
    ResourceRequirements,
)

Now you can configure the container that will be run. This process involves two main steps:

  1. Create the environment_vars array, which holds the variables used within the container.
  2. Create an instance of the Container class, which includes the settings for the container.

The code for these steps is provided below.

launch.py
environment_vars = [
    EnvironmentVariable(
        name="AWS_S3_OUTPUT_PATH",
        value=AWS_S3_OUTPUT_PATH,
    ),
    EnvironmentVariable(
        name="AWS_S3_ACCESS_KEY",
        value=AWS_S3_ACCESS_KEY,
    ),
    EnvironmentVariable(
        name="AWS_S3_SECRET_ACCESS_KEY",
        value=AWS_S3_SECRET_ACCESS_KEY,
    ),
    EnvironmentVariable(
        name="AWS_BUCKET_NAME",
        value=AWS_BUCKET_NAME,
    ),
    EnvironmentVariable(
        name="AWS_REGION",
        value=AWS_REGION,
    ),
    EnvironmentVariable(
        name="PATHWAY_LICENSE_KEY",
        value=PATHWAY_LICENSE_KEY,
    ),
    EnvironmentVariable(
        name="GITHUB_PERSONAL_ACCESS_TOKEN",
        value=GITHUB_PERSONAL_ACCESS_TOKEN,
    ),
    EnvironmentVariable(
        name="PATHWAY_SPAWN_ARGS",
        value="--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py",
    ),
]

container = Container(
    name=AZURE_CONTAINER_NAME,
    image=DOCKER_IMAGE_NAME,
    resources=ResourceRequirements(
        requests=ResourceRequests(cpu=1, memory_in_gb=1.5)
    ),
    ports=[ContainerPort(port=80)],
    environment_variables=environment_variables,
)

Here's a brief explanation to the Container configuration field-by-field:

  • name: Specifies the name of the container, which is used to identify it within Azure.
  • image: Defines the Docker image that the container will use, which contains the application's code and dependencies.
  • resources: Indicates the compute resources allocated to the container, such as CPU and memory.
    • requests: Specifies the exact amount of resources (e.g., CPU and memory) that the container requests to run. These limitations are set fairly small because the simple ETL process wouldn't require much resources.
  • ports: Lists the network ports that the container will expose for communication, defining which ports can be accessed externally.
  • environment_variables: Sets the environment variables that will be passed to the container, which can be used to configure the application at runtime.

As you can see, the environment variables list is rather big. Here is the brief explanation for each of them:

  • AWS_S3_OUTPUT_PATH: The full path in S3 where the output will be stored.
  • AWS_S3_ACCESS_KEY: Your S3 access key.
  • AWS_S3_SECRET_ACCESS_KEY: Your S3 secret access key.
  • AWS_BUCKET_NAME: The name of your S3 bucket.
  • AWS_REGION: The region where your S3 bucket is located.
  • PATHWAY_LICENSE_KEY: Pathway License key is required for Delta Lake features to work. You can get a free license here.
  • GITHUB_PERSONAL_ACCESS_TOKEN: Your GitHub Personal Access Token, which you can obtain from the "Personal access tokens" page.
  • PATHWAY_SPAWN_ARGS: Arguments for the Pathway CLI. For this example, it specifies that the script main.py from the pathway-labs/airbyte-to-deltalake repository should be run.

Having the container configuration, you can now proceed to the container group configuration.

Step 6: Container Group Creation

In Azure Container Instances (ACI), a Container Group is a collection of one or more containers that share the same lifecycle, network, and storage resources. All containers in a group run on the same host machine and can communicate with each other over a local network. Container groups also share external IP addresses and ports, making it easy to deploy multi-container applications. They are ideal for scenarios where multiple containers need to work together, such as a web app and its supporting service. Each container in the group can be configured with its own resources, such as CPU and memory.

You can also configure the container group for the task using Azure Python SDK. It can be done as follows:

launch.py
container_group = ContainerGroup(
    location=AZURE_LOCATION,
    containers=[container],
    os_type=OperatingSystemTypes.linux,
    ip_address=IpAddress(ports=[Port(protocol="TCP", port=80)], type="Public"),
    restart_policy=ContainerGroupRestartPolicy.never,
    image_registry_credentials=[
        ImageRegistryCredential(
            server="index.docker.io",
            username=DOCKER_REGISTRY_USER,
            password=DOCKER_REGISTRY_TOKEN,
        )
    ],
)

Below is the parameter-by-parameter explanation of this code:

  • location: Indicates the Azure region where the container group will be deployed.
  • containers: Lists the containers in the group. In this tutorial, it includes a single container created earlier.
  • os_type: Specifies that the container group will use the Linux operating system.
  • ip_address: Sets the IP address configuration for the container group.
  • restart_policy: Specifies that the container group should not be automatically restarted, meaning it will run once and not restart after termination.
  • image_registry_credentials: Supplies the credentials needed to access the container registry, which in this case is Dockerhub.

This command will construct the Container Group settings, which are sufficient to run the tutorial.

Step 7: Launch The Container

Now that everything is set up, you’re ready to run the task. This involves creating an Azure cloud client instance and calling specific methods on it.

To create the client, you first need to provide credentials. In an isolated environment, such as the Docker image used in this tutorial, the simplest way to handle authentication is by using a code wrapper that manages authentication for the Azure SDK.

launch.py
class TokenCredential:
    def __init__(self, token: str):
        self.token = token

    def get_token(self, *args, **kwargs):
        return AccessToken(self.token, 3600)

Once authentication is set up, you can create an instance of the client.

launch.py
client = ContainerInstanceManagementClient(
    TokenCredential(AZURE_TOKEN_CREDENTIAL), AZURE_SUBSCRIPTION_ID
)

Finally, start the container using the begin_create_or_update method.

launch.py
client.container_groups.begin_create_or_update(
    resource_group_name=AZURE_RESOURCE_GROUP,
    container_group_name=AZURE_CONTAINER_GROUP_NAME,
    container_group=container_group,
)

You can now go to the Azure Portal to view the execution stages and related metrics, such as resource usage. You can also stop the execution from the portal.

Accessing the Execution Results

After the execution is complete, you can verify that the results are in the S3-based Delta Lake using the delta-rs Python package.

launch.py
from deltalake import DeltaTable


storage_options = {
    "AWS_ACCESS_KEY_ID": AWS_S3_ACCESS_KEY,
    "AWS_SECRET_ACCESS_KEY": AWS_S3_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "AWS_BUCKET_NAME": AWS_BUCKET_NAME,

    # Disabling DynamoDB sync since there are no parallel writes into this Delta Lake
    "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
}

delta_table = DeltaTable(
    s3_output_path,
    storage_options=storage_options,
)
pd_table_from_delta = delta_table.to_pandas()

pd_table_from_delta.shape[0]
700

You can also verify the count: there were indeed 700 commits in the pathwaycom/pathway repository as of the time this text was written.

Conclusions

Cloud deployment is a key part of developing advanced projects. It lets you deploy solutions that run reliably and predictably, while also allowing for flexible resource management, increased stability, and the ability to choose application availability zones.

However, it can be complex, especially for beginners who might face a system with containers, cloud services, virtual machines, and many other components.

This tutorial taught you how to simplify program deployment on Azure cloud using Pathway CLI and Pathway Dockerhub container. At the end, you need to run a container from Dockerhub with the usage of the powerful Microsoft Azure instruments.

Feel free to try it out and clone the example repository to develop your own data extraction solutions. We also welcome your feedback in our Discord community!