Deploy to Azure
If you've already gone through the AWS Deployment tutorial, feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the Running the Example in Azure section for more advanced content.
The Pathway framework enables you to define and run various data processing pipelines. You can find numerous tutorials that guide you through building systems like log monitoring, ETL pipelines with Kafka, or data preparation for Spark analytics.
Once you've developed and tested these pipelines locally, the next logical step is to deploy them in the cloud. Cloud deployment allows your code to run remotely, minimizing interruptions from local machine issues. This step is crucial for moving your code into a production-ready environment.
There are several ways to deploy your code to the cloud. You can deploy it on GCP, using Render or on AWS Fargate, for example. In this tutorial, you will learn how to deploy your Pathway code in the Azure ecosystem using the Azure Marketplace offering or Azure Container Instances and Pathway's tools.
The tutorial is structured as follows:
- Description of the ETL example pipeline.
- Instructions on Pathway CLI usage for running a Github-hosted code.
- Step-by-step guide to setting up a deployment with either Azure Marketplace Offer or via Azure Container Instances.
- Results verifications.
- Conclusions.
Before you continue, please ensure your project meets these basic requirements:
- The project is hosted on a public GitHub repository.
- The requirements.txt file in the root directory lists all the Python dependencies for the project.
ETL Example Pipeline
Let's take the "Data Preparation for Spark Analytics" tutorial as an example. This tutorial walks you through building an ETL process that tracks GitHub commit history, removes sensitive data, and loads the results into a Delta Lake. For a detailed explanation, you can refer to the article that covers this task in depth.
The tutorial's code is available in a Github repository. A few changes have been made to simplify the process:
- The GitHub PAT (Personal Access Token) can now be read from an environment variable.
- Spark computations have been removed since they aren't necessary in a cloud-based container.
Additionally, the README file has been updated to offer more guidance on using Pathway CLI tools to run the project.
There's an important point to consider regarding the task's output. Originally, there were two possible output modes: storing data in a locally-based Delta Lake or in an S3-based Delta Lake. In cloud deployment, using a locally-based Delta Lake isn't practical because it only exists within the container on a remote cloud worker and isn't accessible to the user. Therefore, this tutorial uses an S3-based Delta Lake to store the results, as it provides easy access afterward. This approach requires additional environment variables for the container to access the S3 service, which will be discussed further.
Pathway CLI
Pathway provides several tools that simplify both cloud deployment and development in general.
One of these tools is the Pathway CLI. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the spawn
command lets you run code using multiple computational threads or processes. For example, pathway spawn python main.py
runs your locally hosted main.py
file using Pathway.
This tutorial highlights another feature: the ability to run code directly from a GitHub repository, even if it's not hosted locally.
Take the airbyte-to-deltalake
example mentioned earlier. You can run it from the command line by setting two environment variables: GITHUB_PERSONAL_ACCESS_TOKEN
for your GitHub PAT and PATHWAY_LICENSE_KEY
for your Pathway license key. Then, simply call pathway spawn
using --repository-url
to define the GitHub repository to run.
This approach allows you to run remotely hosted code as follows:
GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
pathway spawn --repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py
When the --repository-url
parameter is provided, the CLI automatically handles checking out the repository, installing any dependencies listed in the requirements.txt
file within an isolated environment, and running the specified file.
Additionally, you can use the PATHWAY_SPAWN_ARGS
environment variable as a shortcut for running pathway spawn. This allows you to run code from a GitHub repository with the following command:
GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \
PATHWAY_LICENSE_KEY=YOUR_PATHWAY_LICENSE_KEY \
PATHWAY_SPAWN_ARGS='--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py' \
pathway spawn-from-env
Running the Example in Azure
The Pathway framework makes it simple to deploy programs on Azure using the Pathway - BYOL listing, available in the Azure Marketplace. This listing is free.
We recommend using the Azure Marketplace offering because it's straightforward: just follow a four-step deployment wizard and set up your project-specific settings. For detailed instructions on using this wizard, refer to the first dropdown section.
If the Marketplace solution doesn't meet your needs, you can also deploy using Azure Container Instances. This method is outlined in the second section of this tutorial. However, keep in mind that it's more complex.
Easy deployment with Azure Marketplace
Running a publicly available container in Azure Container Instances
Accessing the Execution Results
After the service had successfully started and performed the ETL step, you can verify that the results are in the S3-based Delta Lake using the delta-rs
Python package.
from deltalake import DeltaTable
storage_options = {
"AWS_ACCESS_KEY_ID": AWS_S3_ACCESS_KEY,
"AWS_SECRET_ACCESS_KEY": AWS_S3_SECRET_ACCESS_KEY,
"AWS_REGION": AWS_REGION,
"AWS_BUCKET_NAME": AWS_BUCKET_NAME,
# Disabling DynamoDB sync since there are no parallel writes into this Delta Lake
"AWS_S3_ALLOW_UNSAFE_RENAME": "True",
}
delta_table = DeltaTable(
s3_output_path,
storage_options=storage_options,
)
pd_table_from_delta = delta_table.to_pandas()
pd_table_from_delta.shape[0]
862
You can also verify the count: there were indeed 862 commits in the pathwaycom/pathway
repository as of the time this text was written.
Conclusions
Cloud deployment is a key part of developing advanced projects. It lets you deploy solutions that run reliably and predictably, while also allowing for flexible resource management, increased stability, and the ability to choose application availability zones.
However, it can be complex, especially for beginners who might face a system with containers, cloud services, virtual machines, and many other components.
This tutorial explains how to simplify program deployment on the Azure cloud using the Pathway CLI and the Azure Marketplace listing. It also provides guidance on deploying with Azure Container Instances as an alternative if the Marketplace listing isn't an option for you.
Feel free to try it out and clone the example repository to develop your own data extraction solutions. We also welcome your feedback in our Discord community!