App template:

Deploy to Azure

Sergey Kulik·November 20, 2024·0 min read

If you've already gone through the AWS Deployment tutorial, feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the Running the Example in Azure section for more advanced content.

The Pathway framework enables you to define and run various data processing pipelines. You can find numerous tutorials that guide you through building systems like log monitoring, ETL pipelines with Kafka, or data preparation for Spark analytics.

Once you've developed and tested these pipelines locally, the next logical step is to deploy them in the cloud. Cloud deployment allows your code to run remotely, minimizing interruptions from local machine issues. This step is crucial for moving your code into a production-ready environment.

There are several ways to deploy your code to the cloud. You can deploy it on GCP, using Render or on AWS Fargate, for example. In this tutorial, you will learn how to deploy your Pathway code in the Azure ecosystem using the Azure Marketplace offering or Azure Container Instances and Pathway's tools.

Deploying Pathway Programs on Azure Made Easy

The tutorial is structured as follows:

  1. Description of the ETL example pipeline.
  2. Instructions on Pathway CLI usage for running a Github-hosted code.
  3. Step-by-step guide to setting up a deployment with either Azure Marketplace Offer or via Azure Container Instances.
  4. Results verifications.
  5. Conclusions.

Before you continue, please ensure your project meets these basic requirements:

  • The project is hosted on a public GitHub repository.
  • The requirements.txt file in the root directory lists all the Python dependencies for the project.

ETL Example Pipeline

Let's take the "Data Preparation for Spark Analytics" tutorial as an example. This tutorial walks you through building an ETL process that tracks GitHub commit history, removes sensitive data, and loads the results into a Delta Lake. For a detailed explanation, you can refer to the article that covers this task in depth.

Pathway data preparation pipeline for Spark

The tutorial's code is available in a Github repository. A few changes have been made to simplify the process:

  • The GitHub PAT (Personal Access Token) can now be read from an environment variable.
  • Spark computations have been removed since they aren't necessary in a cloud-based container.

Additionally, the README file has been updated to offer more guidance on using Pathway CLI tools to run the project.

There's an important point to consider regarding the task's output. Originally, there were two possible output modes: storing data in a locally-based Delta Lake or in an S3-based Delta Lake. In cloud deployment, using a locally-based Delta Lake isn't practical because it only exists within the container on a remote cloud worker and isn't accessible to the user. Therefore, this tutorial uses an S3-based Delta Lake to store the results, as it provides easy access afterward. This approach requires additional environment variables for the container to access the S3 service, which will be discussed further.

Pathway CLI

Pathway provides several tools that simplify both cloud deployment and development in general.

One of these tools is the Pathway CLI. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the spawn command lets you run code using multiple computational threads or processes. For example, pathway spawn python main.py runs your locally hosted main.py file using Pathway.

This tutorial highlights another feature: the ability to run code directly from a GitHub repository, even if it's not hosted locally.

Take the airbyte-to-deltalake example mentioned earlier. You can run it from the command line by setting two environment variables: GITHUB_PERSONAL_ACCESS_TOKEN for your GitHub PAT and PATHWAY_LICENSE_KEY for your Pathway license key. Then, simply call pathway spawn using --repository-url to define the GitHub repository to run.

This approach allows you to run remotely hosted code as follows:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    pathway spawn --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py

When the --repository-url parameter is provided, the CLI automatically handles checking out the repository, installing any dependencies listed in the requirements.txt file within an isolated environment, and running the specified file.

Pathway CLI

Additionally, you can use the PATHWAY_SPAWN_ARGS environment variable as a shortcut for running pathway spawn. This allows you to run code from a GitHub repository with the following command:

GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
    PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
    PATHWAY_SPAWN_ARGS='--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py' \
    pathway spawn-from-env

Running the Example in Azure

The Pathway framework makes it simple to deploy programs on Azure using the Pathway - BYOL listing, available in the Azure Marketplace. This listing is free.

We recommend using the Azure Marketplace offering because it's straightforward: just follow a four-step deployment wizard and set up your project-specific settings. For detailed instructions on using this wizard, refer to the first dropdown section.

If the Marketplace solution doesn't meet your needs, you can also deploy using Azure Container Instances. This method is outlined in the second section of this tutorial. However, keep in mind that it's more complex.

Easy deployment with Azure Marketplace

Running a publicly available container in Azure Container Instances

Accessing the Execution Results

After the service had successfully started and performed the ETL step, you can verify that the results are in the S3-based Delta Lake using the delta-rs Python package.

launch.py
from deltalake import DeltaTable


storage_options = {
    "AWS_ACCESS_KEY_ID": AWS_S3_ACCESS_KEY,
    "AWS_SECRET_ACCESS_KEY": AWS_S3_SECRET_ACCESS_KEY,
    "AWS_REGION": AWS_REGION,
    "AWS_BUCKET_NAME": AWS_BUCKET_NAME,

    # Disabling DynamoDB sync since there are no parallel writes into this Delta Lake
    "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
}

delta_table = DeltaTable(
    s3_output_path,
    storage_options=storage_options,
)
pd_table_from_delta = delta_table.to_pandas()

pd_table_from_delta.shape[0]
862

You can also verify the count: there were indeed 862 commits in the pathwaycom/pathway repository as of the time this text was written.

Conclusions

Cloud deployment is a key part of developing advanced projects. It lets you deploy solutions that run reliably and predictably, while also allowing for flexible resource management, increased stability, and the ability to choose application availability zones.

However, it can be complex, especially for beginners who might face a system with containers, cloud services, virtual machines, and many other components.

This tutorial explains how to simplify program deployment on the Azure cloud using the Pathway CLI and the Azure Marketplace listing. It also provides guidance on deploying with Azure Container Instances as an alternative if the Marketplace listing isn't an option for you.

Two Deployment Options Comparison

Feel free to try it out and clone the example repository to develop your own data extraction solutions. We also welcome your feedback in our Discord community!