App template:

Deploy to Azure

Sergey Kulik·November 20, 2024·0 min read

If you've already gone through the AWS Deployment tutorial, feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the Running the Example in Azure section for more advanced content.

The Pathway framework enables you to define and run various data processing pipelines. You can find numerous tutorials that guide you through building systems like log monitoring, ETL pipelines with Kafka, or data preparation for Spark analytics.

Once you've developed and tested these pipelines locally, the next logical step is to deploy them in the cloud. Cloud deployment allows your code to run remotely, minimizing interruptions from local machine issues. This step is crucial for moving your code into a production-ready environment.

There are several ways to deploy your code to the cloud. You can deploy it on GCP, using Render or on AWS Fargate, for example. In this tutorial, you will learn how to deploy your Pathway code in the Azure ecosystem using the Azure Marketplace offering or Azure Container Instances and Pathway's tools.

Deploying Pathway Programs on Azure Made Easy

The tutorial is structured as follows:

Description of the ETL example pipeline.
Instructions on Pathway CLI usage for running a Github-hosted code.
Step-by-step guide to setting up a deployment with either Azure Marketplace Offer or via Azure Container Instances.
Results verifications.
Conclusions.

Before you continue, please ensure your project meets these basic requirements:

The project is hosted on a public GitHub repository.
The requirements.txt file in the root directory lists all the Python dependencies for the project.

ETL Example Pipeline

Let's take the "Data Preparation for Spark Analytics" tutorial as an example. This tutorial walks you through building an ETL process that tracks GitHub commit history, removes sensitive data, and loads the results into a Delta Lake. For a detailed explanation, you can refer to the article that covers this task in depth.

Pathway data preparation pipeline for Spark

The tutorial's code is available in a Github repository. A few changes have been made to simplify the process:

The GitHub PAT (Personal Access Token) can now be read from an environment variable.
Spark computations have been removed since they aren't necessary in a cloud-based container.

Additionally, the README file has been updated to offer more guidance on using Pathway CLI tools to run the project.

There's an important point to consider regarding the task's output. Originally, there were two possible output modes: storing data in a locally-based Delta Lake or in an S3-based Delta Lake. In cloud deployment, using a locally-based Delta Lake isn't practical because it only exists within the container on a remote cloud worker and isn't accessible to the user. Therefore, this tutorial uses an S3-based Delta Lake to store the results, as it provides easy access afterward. This approach requires additional environment variables for the container to access the S3 service, which will be discussed further.

Pathway CLI

Pathway provides several tools that simplify both cloud deployment and development in general.

One of these tools is the Pathway CLI. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the spawn command lets you run code using multiple computational threads or processes. For example, pathway spawn python main.py runs your locally hosted main.py file using Pathway.

This tutorial highlights another feature: the ability to run code directly from a GitHub repository, even if it's not hosted locally.

Take the airbyte-to-deltalake example mentioned earlier. You can run it from the command line by setting two environment variables: GITHUB_PERSONAL_ACCESS_TOKEN for your GitHub PAT and PATHWAY_LICENSE_KEY for your Pathway license key. Then, simply call pathway spawn using --repository-url to define the GitHub repository to run.

This approach allows you to run remotely hosted code as follows:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    pathway spawn --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py

When the --repository-url parameter is provided, the CLI automatically handles checking out the repository, installing any dependencies listed in the requirements.txt file within an isolated environment, and running the specified file.

Pathway CLI

Additionally, you can use the PATHWAY_SPAWN_ARGS environment variable as a shortcut for running pathway spawn. This allows you to run code from a GitHub repository with the following command:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    PATHWAY_SPAWN_ARGS='--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py' \
    pathway spawn-from-env

Running the Example in Azure

The Pathway framework makes it simple to deploy programs on Azure using the Pathway - BYOL listing, available in the Azure Marketplace. This listing is free.

We recommend using the Azure Marketplace offering because it's straightforward: just follow a four-step deployment wizard and set up your project-specific settings. For detailed instructions on using this wizard, refer to the first dropdown section.

If the Marketplace solution doesn't meet your needs, you can also deploy using Azure Container Instances. This method is outlined in the second section of this tutorial. However, keep in mind that it's more complex.

Easy deployment with Azure Marketplace

Pathway offers a BYOL (Bring Your Own License) Container on the Azure Marketplace. This container provides a ready-to-use Docker image with Pathway and all required dependencies pre-installed, along with a configured Kubernetes environment for easy deployment within Azure.

To use this container, you can get a free license key from the Pathway website. Access to the listing is free, so you won't incur any marketplace costs.

The container uses the pathway spawn-from-env command, making it easy to run your code on the marketplace. Simply provide the necessary spawn arguments and environment variables, and your code will run in the cloud.

Keep in mind that the container operates like a standard cloud deployment. This means it expects the program to run continuously. If the program finishes for any reason, it will automatically restart with the same launch parameters and environment variables as before.

The next section will guide you through setting up Pathway processes in Azure.

Performing Service Configuration

To create the container, start by clicking the "Get It Now" button on the listing page. This opens a setup wizard with four steps:

Basics:
Select your subscription and resource group for the deployment using the dropdown menus. If you don't have a subscription, create one on the Azure Subscriptions page. For a new resource group, click the "Create new" link under the "Resource group" dropdown, or use the Azure CLI:
```
az group create --name myResourceGroup --location eastus
```
If using the CLI, log in first with az login. After selecting these fields, specify if you need a development cluster by selecting "Yes", and then choose the cluster's region from the dropdown.
Cluster Details:
Enter an alphanumeric name for your deployment cluster in the "AKS cluster name" field. Next, select the latest Kubernetes version. Adjust hardware settings if needed; the defaults are generally suitable for most Pathway programs, but you can modify the vCPU and memory settings via the "Change size" link. Autoscaling is enabled by default, and you can set up multiple VMs if desired.
Application Details:
First, enter a name for the cluster extension and a title for your application. In the "Pathway App License" field, insert your Pathway license key. Then, add the PATHWAY_SPAWN_ARGS to specify the pathway spawn command, directing the application to the required repository upon startup. Since the tutorial launches the pipeline from airbyte-to-deltalake repository, you can specify: --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py.
Additional Environment Variables: Define all necessary environment variables. For instance, if using S3 and GitHub, like in the example pipeline, you'll need:
- AWS_S3_OUTPUT_PATH: The S3 path for output storage (e.g., s3://your-bucket/output-path).
- AWS_S3_ACCESS_KEY: S3 access key.
- AWS_S3_SECRET_ACCESS_KEY: S3 secret access key.
- AWS_BUCKET_NAME: S3 bucket name.
- AWS_REGION: Region of your S3 bucket.
- GITHUB_PERSONAL_ACCESS_TOKEN: GitHub access token (available here).
- INPUT_CONNECTOR_MODE determines how commits are polled from GitHub. It can be set to either "static" or "streaming". The default mode, "static", scans all commits once, processes them, and then exits. In "streaming" mode, the program runs continuously, waiting for new commits and appending them to the output connection as they arrive. When deploying in Azure, the program restarts automatically after it finishes. Therefore, set "streaming" so the program runs without exiting and appends the new commits to the existing collections when they appear.
Review + Create:
Review the Pathway BYOL Container's terms and privacy policy. Check all your entries, and once you confirm they're correct, press "Create".

Congratulations! Your service is now being created on Azure.

After the Service is Created

After creating the service, you'll see a notification in the top-right corner indicating that the deployment is in progress. If everything goes smoothly, a "Deployment succeeded" message will soon appear in the same spot.

In the "Deployment details" section on the page, you can view the resources created by the cluster. If any resource fails to deploy, click on it to view error details. A common error is "Insufficient regional vCPU quota left"; if this appears, increase the hardware limits in your resource group.

The service creation process can take 5-10 minutes to complete. Once finished, you'll see a "Your deployment is complete" message. In the "Next steps" section, click the "Go to resource" button. From there, navigate to the Kubernetes service linked to the app you configured. Your service is up and running at this point.

Conclusion

That's all for the Azure Marketplace option! As you can see, it's straightforward - there's no need for complex steps as long as you have a subscription and a resource group. This is the recommended method for deploying your Pathway programs in the cloud.

If you need to deploy your Pathway program to Azure Container Instances and set everything up from scratch, don't worry - we've provided detailed guidance in the next section.

Running a publicly available container in Azure Container Instances

Another useful resource from Pathway is the Docker container, listed at Dockerhub. This listing offers a ready-to-use Docker image with Pathway and all its dependencies pre-installed, and without binding to a particular ecosystem. You can use the container without a license key, but entering one unlocks the full features of the framework. The listing is free to use, so there's no cost associated with accessing it.

Pathway Dockerhub container

The container runs the pathway spawn-from-env command, allowing you to easily execute it on the marketplace by passing the PATHWAY_SPAWN_ARGS and other required environment variables into the container. This gets your code running in the cloud. The next section will guide you through setting up Pathway processes using Azure Container Instances, the recommended Azure solution for the task.

Since the container originates outside the Azure ecosystem, there are no steps required to acquire it from a specific marketplace.

However, several steps are necessary to use the container: logging in, configuring Azure Container Instances, specifying the required variables for S3 data storage, and finally running the Pathway instance. All of these steps can be performed using a single launcher script (referred to as launch.py in this example), which should be run locally. This script is provided in Pathway's repository.

The process involves first configuring the system and obtaining tokens from all related systems, and then using these tokens to run the computation in the cloud.

Step 1: Performing Azure Configuration

The Azure Command-Line Interface (CLI) is a powerful tool for managing Azure services. If you haven't installed it yet, follow the installation guide here. Once installed, you can move forward with this tutorial.

We will go through a few key steps to set up the necessary variables.

First, let's set up the variables required for Azure.

Log in to Azure: Run the following command to log in to Azure:
```
az login
```
This will open a browser for authentication. After you log in, choose the subscription tenant you want to use. Copy the Subscription ID (in UUID4 format) from the Subscription ID column. Set this value as the AZURE_SUBSCRIPTION_ID variable.
Note: If you don't have a subscription, you can create one on the Subscriptions page in the Azure Portal.
Get an Access Token: Run this command to get a session token:
```
az account get-access-token
```
Copy the accessToken value from the resulting JSON and assign it to the AZURE_TOKEN_CREDENTIAL variable.
Since the token is quite long, it's a good idea to store it as an environment variable. Keep in mind, this token expires every hour, so make sure it's up to date before starting the container.
Resource Group ID: To find your resource group, list the existing ones by running:
```
az group list --query "[].name" --output tsv
```
If you need to create a new resource group, use:
```
az group create --name myResourceGroup --location eastus
```
Assign the resource group's name to the AZURE_RESOURCE_GROUP variable.

The remaining parameters are pre-defined and don't need to be changed:

AZURE_CONTAINER_GROUP_NAME sets the name of the container group (in this case, there's only one container).
AZURE_CONTAINER_NAME specifies the name of the container itself.
AZURE_LOCATION indicates the Azure data center location (e.g., "eastus").

With these steps completed, your variables in the code should look as follows:

launch.py

AZURE_SUBSCRIPTION_ID = "YOUR_AZURE_SUBSCRIPTION_ID"
AZURE_TOKEN_CREDENTIAL = "YOUR_AZURE_TOKEN_CREDENTIAL"
AZURE_RESOURCE_GROUP = "YOUR_AZURE_RESOURCE_GROUP"
AZURE_CONTAINER_GROUP_NAME = "pathway-test-container-group"
AZURE_CONTAINER_NAME = "pathway-test-container"
AZURE_LOCATION = "eastus"

Step 2: Authenticating in Dockerhub

This tutorial uses Dockerhub to store the Pathway Docker image. Azure Container Instances can pull and run containers directly from Dockerhub, which makes it easy for us to use. Now it's time to store up your Dockerhub credentials in the code:

First, your Dockerhub account username in the DOCKER_REGISTRY_USER variable.
Next, generate a personal access token from the Personal access tokens page, and assign it to the DOCKER_REGISTRY_TOKEN variable.
Finally, define the Docker image to be used. You can use the latest Pathway image: pathwaycom/pathway:latest.

You should end up with the following variables in the Python code:

launch.py

DOCKER_REGISTRY_USER = "YOUR_DOCKER_REGISTRY_USER"
DOCKER_REGISTRY_TOKEN = "YOUR_DOCKER_REGISTRY_TOKEN"
DOCKER_IMAGE_NAME = "pathwaycom/pathway:latest"

Step 3: Configuring Backend for Delta Lake Storage

As mentioned earlier, the results must be stored in durable storage since the container's files will be deleted once it finishes. For this, the tutorial uses Amazon S3 to store the resulting Delta Lake. Hence, there is a need to configure S3-related variables.

S3 Output Path, Bucket Name, and Region: You need to set up the full output path, bucket name, and the region where your S3 bucket is located. Store these values in the following variables:

AWS_S3_OUTPUT_PATH (e.g., s3://your-bucket/output-path)
AWS_S3_BUCKET_NAME (e.g., your-bucket)
AWS_REGION

S3 Credentials: You'll also need to provide your AWS Access Key and Secret Access Key. You can find instructions on how to obtain these credentials here.

Here's what the S3 configuration should look like:

launch.py

AWS_S3_OUTPUT_PATH = "YOUR_AWS_S3_OUTPUT_PATH"
AWS_BUCKET_NAME = "YOUR_AWS_BUCKET_NAME"
AWS_REGION = "YOUR_AWS_REGION"
AWS_S3_ACCESS_KEY = "YOUR_AWS_S3_ACCESS_KEY"
AWS_S3_SECRET_ACCESS_KEY = "YOUR_AWS_S3_SECRET_ACCESS_KEY"

Step 4: Providing Pathway License Key and Github PAT

To enable Delta Lake features and parse commits from GitHub, you'll need two last remaining pieces: the Pathway License Key and a GitHub Personal Access Token.

You can get a free-tier Pathway license key from the Pathway website.

The token for Github can be generated from the "Personal access tokens" page.

Store both the Pathway License Key and GitHub token in the following variables:

launch.py

PATHWAY_LICENSE_KEY = "YOUR_PATHWAY_LICENSE_KEY"
GITHUB_PERSONAL_ACCESS_TOKEN = "YOUR_GITHUB_PERSONAL_ACCESS_TOKEN"

Step 5: Configuring a Container in Azure Container Instances

In Azure Container Instances (ACI), a Container is a lightweight, standalone, and executable software package that includes everything needed to run an application: code, runtime, libraries, and dependencies. Each container runs in isolation but shares the host system's kernel. ACI allows you to easily deploy and run containers without managing underlying infrastructure, offering a simple way to run applications in the cloud.

To manage containers and other resources in Azure efficiently, you can use the Azure Python SDK. For this tutorial, you'll need to install the azure-identity and azure-mgmt-containerinstance Python packages. You can install them using pip.

pip install azure-identity
pip install azure-mgmt-containerinstance

Next, import the classes you'll need for the configuration process.

launch.py

from azure.core.credentials import AccessToken
from azure.core.exceptions import HttpResponseError
from azure.mgmt.containerinstance import ContainerInstanceManagementClient
from azure.mgmt.containerinstance.models import (
    Container,
    ContainerGroup,
    ContainerGroupRestartPolicy,
    ContainerPort,
    EnvironmentVariable,
    ImageRegistryCredential,
    IpAddress,
    OperatingSystemTypes,
    Port,
    ResourceRequests,
    ResourceRequirements,
)

Now you can configure the container that will be run. This process involves two main steps:

Create the environment_variables array, which holds the variables used within the container.
Create an instance of the Container class, which includes the settings for the container.

The code for these steps is provided below.

launch.py

environment_variables = [
    EnvironmentVariable(
        name="AWS_S3_OUTPUT_PATH",
        value=AWS_S3_OUTPUT_PATH,
    ),
    EnvironmentVariable(
        name="AWS_S3_ACCESS_KEY",
        value=AWS_S3_ACCESS_KEY,
    ),
    EnvironmentVariable(
        name="AWS_S3_SECRET_ACCESS_KEY",
        value=AWS_S3_SECRET_ACCESS_KEY,
    ),
    EnvironmentVariable(
        name="AWS_BUCKET_NAME",
        value=AWS_BUCKET_NAME,
    ),
    EnvironmentVariable(
        name="AWS_REGION",
        value=AWS_REGION,
    ),
    EnvironmentVariable(
        name="PATHWAY_LICENSE_KEY",
        value=PATHWAY_LICENSE_KEY,
    ),
    EnvironmentVariable(
        name="GITHUB_PERSONAL_ACCESS_TOKEN",
        value=GITHUB_PERSONAL_ACCESS_TOKEN,
    ),
    EnvironmentVariable(
        name="PATHWAY_SPAWN_ARGS",
        value="--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py",
    ),
]

container = Container(
    name=AZURE_CONTAINER_NAME,
    image=DOCKER_IMAGE_NAME,
    resources=ResourceRequirements(
        requests=ResourceRequests(cpu=1, memory_in_gb=1.5)
    ),
    ports=[ContainerPort(port=80)],
    environment_variables=environment_variables,
)

Here's a brief explanation to the Container configuration field-by-field:

name: Specifies the name of the container, which is used to identify it within Azure.
image: Defines the Docker image that the container will use, which contains the application's code and dependencies.
resources: Indicates the compute resources allocated to the container, such as CPU and memory.
- requests: Specifies the exact amount of resources (e.g., CPU and memory) that the container requests to run. These limitations are set fairly small because the simple ETL process wouldn't require much resources.
ports: Lists the network ports that the container will expose for communication, defining which ports can be accessed externally.
environment_variables: Sets the environment variables that will be passed to the container, which can be used to configure the application at runtime.

As you can see, the environment variables list is rather big. Here is the brief explanation for each of them:

AWS_S3_OUTPUT_PATH: The full path in S3 where the output will be stored.
AWS_S3_ACCESS_KEY: Your S3 access key.
AWS_S3_SECRET_ACCESS_KEY: Your S3 secret access key.
AWS_BUCKET_NAME: The name of your S3 bucket.
AWS_REGION: The region where your S3 bucket is located.
PATHWAY_LICENSE_KEY: Pathway License key is required for Delta Lake features to work. You can get a free license here.
GITHUB_PERSONAL_ACCESS_TOKEN: Your GitHub Personal Access Token, which you can obtain from the "Personal access tokens" page.
PATHWAY_SPAWN_ARGS: Arguments for the Pathway CLI. For this example, it specifies that the script main.py from the pathway-labs/airbyte-to-deltalake repository should be run.

Having the container configuration, you can now proceed to the container group configuration.

Step 6: Container Group Creation

In Azure Container Instances (ACI), a Container Group is a collection of one or more containers that share the same lifecycle, network, and storage resources. All containers in a group run on the same host machine and can communicate with each other over a local network. Container groups also share external IP addresses and ports, making it easy to deploy multi-container applications. They are ideal for scenarios where multiple containers need to work together, such as a web app and its supporting service. Each container in the group can be configured with its own resources, such as CPU and memory.

You can also configure the container group for the task using Azure Python SDK. It can be done as follows:

launch.py

container_group = ContainerGroup(
    location=AZURE_LOCATION,
    containers=[container],
    os_type=OperatingSystemTypes.linux,
    ip_address=IpAddress(ports=[Port(protocol="TCP", port=80)], type="Public"),
    restart_policy=ContainerGroupRestartPolicy.never,
    image_registry_credentials=[
        ImageRegistryCredential(
            server="index.docker.io",
            username=DOCKER_REGISTRY_USER,
            password=DOCKER_REGISTRY_TOKEN,
        )
    ],
)

Below is the parameter-by-parameter explanation of this code:

location: Indicates the Azure region where the container group will be deployed.
containers: Lists the containers in the group. In this tutorial, it includes a single container created earlier.
os_type: Specifies that the container group will use the Linux operating system.
ip_address: Sets the IP address configuration for the container group.
restart_policy: Specifies that the container group should not be automatically restarted, meaning it will run once and not restart after termination.
image_registry_credentials: Supplies the credentials needed to access the container registry, which in this case is Dockerhub.

This command will construct the Container Group settings, which are sufficient to run the tutorial.

Step 7: Launch The Container

Now that everything is set up, you're ready to run the task. This involves creating an Azure cloud client instance and calling specific methods on it.

To create the client, you first need to provide credentials. In an isolated environment, such as the Docker image used in this tutorial, the simplest way to handle authentication is by using a code wrapper that manages authentication for the Azure SDK.

launch.py

class TokenCredential:
    def __init__(self, token: str):
        self.token = token

    def get_token(self, *args, **kwargs):
        return AccessToken(self.token, 3600)

Once authentication is set up, you can create an instance of the client.

launch.py

client = ContainerInstanceManagementClient(
    TokenCredential(AZURE_TOKEN_CREDENTIAL), AZURE_SUBSCRIPTION_ID
)

Finally, start the container using the begin_create_or_update method.

launch.py

client.container_groups.begin_create_or_update(
    resource_group_name=AZURE_RESOURCE_GROUP,
    container_group_name=AZURE_CONTAINER_GROUP_NAME,
    container_group=container_group,
)

You can now go to the Azure Portal to view the execution stages and related metrics, such as resource usage. You can also stop the execution from the portal.

Conclusion

That's all for the Azure Container Instances option. As you can see, it's a bit more complex than using the Azure Marketplace. However, if you can't use the Marketplace or need to customize your deployment as a container instance, this is a viable alternative.

Accessing the Execution Results

After the service had successfully started and performed the ETL step, you can verify that the results are in the S3-based Delta Lake using the delta-rs Python package.

launch.py

from deltalake import DeltaTable


# Create an S3 connection settings dictionary
storage_options = {
    "AWS_ACCESS_KEY_ID": AWS_S3_ACCESS_KEY,
    "AWS_SECRET_ACCESS_KEY": AWS_S3_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "AWS_BUCKET_NAME": AWS_BUCKET_NAME,

    # Disabling DynamoDB sync since there are no parallel writes into this Delta Lake
    "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
}

# Read a table from S3
delta_table = DeltaTable(
    s3_output_path,
    storage_options=storage_options,
)
pd_table_from_delta = delta_table.to_pandas()

# Print the number of commits processed
pd_table_from_delta.shape[0]

You can also verify the count: there were indeed 862 commits in the pathwaycom/pathway repository as of the time this text was written.

Conclusions

Cloud deployment is a key part of developing advanced projects. It lets you deploy solutions that run reliably and predictably, while also allowing for flexible resource management, increased stability, and the ability to choose application availability zones.

However, it can be complex, especially for beginners who might face a system with containers, cloud services, virtual machines, and many other components.

This tutorial explains how to simplify program deployment on the Azure cloud using the Pathway CLI and the Azure Marketplace listing. It also provides guidance on deploying with Azure Container Instances as an alternative if the Marketplace listing isn't an option for you.

Two Deployment Options Comparison

Feel free to try it out and clone the example repository to develop your own data extraction solutions. We also welcome your feedback in our Discord community!