Core Concepts
A review of the core concepts behind data representation and data transformation in the programming layer of Pathway.
These are the core concepts you need to know about Pathway:
Pipeline construction with connectors and transformations
In Pathway, there is a clear divide between the definition of the pipeline and the execution of the computation. The pipeline defines the sources of the data, the transformations performed on the data, and where the results are sent. The pipeline is a recipe of your processing, defining the ingredients (the data sources) and the different operations (the transformations): it does not contain any actual food (data). Once your pipeline is built, you can run the computation to ingest and process the data.
In this section, you will learn how to build your data pipeline.
Connect to your data sources with input connectors
Connectors are Pathway's interface with external systems, for both extracting the input data and sending the output data. Extracting input data from external data sources is done using input connectors.
Input connectors return a Pathway table which represents a snapshot of the input stream (more information on tables below).
You need to provide a schema to the input connector.
For example, let's consider a data stream made of events containing two fields: a primary key name
that is a string and an integer value age
.
class InputSchema(pw.Schema):
name: str = pw.column_definition(primary_key=True)
age: int
Using the schema, the input connector knows how to format the data. Pathway comes with many input connectors:
input_table = pw.io.csv.read("./input_dir/", schema=InputSchema)
The connector listens for incoming data and updates the resulting table accordingly.
You can find more information about the available connectors here.
Tables: dynamic content with static schema
In Pathway, data is modeled as tables, representing snapshots of the data streams.
All the data is modeled as tables, similar to classical relational tables, organized into columns and rows. Tables have a static schema, but their content is dynamic and is updated whenever a new event is received. A table is a snapshot of a data stream and is the latest state of all the events that have been received up to the current processing time.
For example, if two events have been received:
There is one row per entry. Each new entry is seen as an addition and represented by a new row.
Pathway also supports the removal and the update of entries. An update is represented by removing the previous entry and adding the new version at the same time:
Since the column name
is the primary key, the reception of a new event with an already existing name
value (Bob
here) is an update of the associated row in the table.
As Pathway handles potentially infinite and ever-changing streaming data, the number of rows and the content changes with time as new data comes into the system.
Processing the data with transformations
Pathway provides operators to modify the tables such as select
or join
.
These operators are called transformations.
Pathway has a functional programming approach: each transformation returns a new table, leaving the input table unchanged.
Using the transformations, you can define your processing pipeline, sequentially specifying the transformations your data will go through.
For example, you can define a pipeline filtering on the column age
, keeping only the entries with a positive value.
Then, you sum all the values and store the result in a single-row table with a single column sum_age
.
Here is a way to do it with Pathway, assuming a correct input table called input_table
:
filtered_table = input_table.filter(input_table.age >= 0)
result_table = filtered_table.reduce(
sum_age = pw.reducers.sum(filtered_table.age)
)
It's okay if you don't understand everything for now. Here are the takeaways:
- Each line produces a new table from the previous one. The first line filters the values and the second does the sum.
- Running this code snippet does not run any computation on data.
The last point, which may seem counter-intuitive at first, is discussed later in this article.
Don't hesitate to read our article about Pathway basic transformations.
External functions and LLMs
Pathway provides many ready-to-use transformations, but it may not be enough for your project. If you don't find what you need, don't worry you can use any Python function in your pipeline. Pathway allows you to seamlessly integrate with Python Machine Learning libraries, use LLM's, and call into synchronous and asynchronous API's.
Send the results to external systems using output connectors.
The data is sent out of Pathway using output connectors.
Output connectors are used to configure the connection to the chosen location (Kafka, PostgreSQL, etc.):
pw.io.csv.write(table,"./output_file.csv")
The connector forwards the changes to the table.
Pathway comes with many output connectors, you can learn more about them in our dedicated article.
Output connectors send the table to the chosen location (Kafka, PostgreSQL, etc.) as a stream of updates.
The output is a data stream.
The tables are produced out of the system as data streams.
The tables you want to send out of Pathway to external systems are also dynamic. They are updated due to new events entering the system. These changes are forwarded to external systems as a data stream: the updates are represented by new events. As previously, the event can represent the addition or the removal of an entry and an update is represented by a removal of the old entry and its addition with the new value.
Pathway handles the data in an incremental way. Instead of sending the entire version of the table whenever there is an update, Pathway only sends the changes to the table to be more efficient. Only the rows that are affected by the changes are sent to the external system.
For example, consider the case of a Pathway pipeline computing the sum of the ages. This value is stored in a single-row table and at , before Bob's update, the value of the sum is . Upon reception of Bob's new value, the sum is updated to . This update is propagated to external systems (Kafka, PostgreSQL, etc.) as the removal of the old entry and the insertion of the new one:
The diff
column represents whether the value has been added (diff=1
) or removed (diff=-1
).
Both events are issued at the same time (t_3
in this example) with no distinctive order.
The time of emission is also included in a column time
.
In practice, not all systems support data streams.
Pathway output connectors adapt the updates to match the system constraints.
For example, Pathway PostgreSQL connector sends the output into a PostgreSQL table: it will not insert the 1
and -1
values in a separate column, it will update the table directly in real time.
On the other hand, if the results are outputted to a CSV file, the new events will be appended to the end of the file with the columns diff
and time
.
This choice of outputting the results as logs into the CSV file allows having an incremental approach and avoid removing and rewriting the entire CSV file at each update.
This is why, in this case, the diff
and time
columns are added.
Note that for readability of the CSV output, only the previous value at time is shown. In practice, all the previous updates are written.
Dataflow
Transformations and connectors are used to define a pipeline: they are used to build the pipeline, but they do not trigger any computation.
In Pathway, the processing pipeline is modeled using a graph. This graph, called the dataflow, models the different transformation steps performed on the data. Each table is a node, linked with other nodes (tables) by transformations.
For example, the previous Pathway pipeline is represented as follows:
This dataflow is the core of Pathway. The user creates a pipeline which is translated into a dataflow by Pathway. The graph is built by the calls to Pathway operators but, at that point, no computations are done: there is simply no data.
Running the computation with the Rust engine
Run the computation with pw.run()
Now that your pipeline is fully ready, with both connectors and transformations, you can run the computation with the command run:
pw.run()
And that's it! With this, running your code will launch the computation. Each update in the input data streams will automatically trigger the update of the relevant data in the pipeline.
For example, consider our complete example:
import pathway as pw
class InputSchema(pw.Schema):
name: str = pw.column_definition(primary_key=True)
age: int
input_table = pw.io.kafka.read(kafka_settings, schema=InputSchema, topic="topic1", format="json")
filtered_table = input_table.filter(input_table.age >= 0)
result_table = filtered_table.reduce(sum_age = pw.reducers.sum(filtered_table.age))
pw.io.kafka.write(table, kafka_settings, topic="topic2", format="json")
pw.run()
The reception of a new value in Kafka triggers the insertion of a new row in input_table
.
This then triggers the update of filtered_table
and possibly of result_table
.
The changes are propagated until they have no impact anymore, the altered rows are filtered out by a filter
, or until they reach the output connector.
In the latter case, the changes are forwarded to the external system.
Pathway listens to the data sources for new updates until the process is terminated: the computation runs forever until the process gets killed. This is the normal behavior of Pathway.
During the whole run, the dataflow maintains the latest version of the data in order to enable quick updates: instead of ingesting all the data from scratch in the graph, only the relevant parts are updated to take into account the new data.
In our example, at the reception of a new value, the sum is not recomputed from scratch but only incremented by the said value, making the computation faster.
This dataflow allows the user to focus on the intended behavior of the processing pipeline, as if the data were static, and Pathway handles the updates on its own using the dataflow.
Fast in-memory computations thanks to a powerful Rust Engine
In Pathway, both the storage of the tables and the computations are done in memory. This can raise legitimate concerns about memory and speed. Indeed, Python is not known to be the most efficient language: it is a dynamically typed and interpreted language. Worse, its infamous GIL limits parallelism and concurrency...
Fortunately, Pathway comes with a powerful Rust engine which takes over once the pipeline is ready. Python is used for its accessibility to describe the (typed) pipeline, but the dataflow is built and maintained by Pathway engine. Pathway engine removes those limits associated with Python. In particular, Pathway natively supports multithreading and multiprocessing, and can be distributed using Kubernetes. The content of the tables is handled by Pathway Rust engine, making it very memory-efficient. Similarly, most of the transformations are handled at the Rust level, making the processing very fast.
If you add the incremental nature of the computations, you end-up with the fastest data processing engine on the market.
Static mode
With Pathway, it doesn't matter if you are dealing with static or streaming data. The same pipeline can be used for both kinds of data, Pathway's engine provides consistent outputs in both cases. You can combine real-time and historical data in the same code logic.
In the static mode, all the data is loaded and processed at once and then the process terminates. It does not wait for new data unlike in the streaming mode.
You can learn more about the streaming and static mode in our dedicated article.