Core Concepts

A review of the core concepts behind data representation and data transformation in the programming layer of Pathway.

These are the core concepts you need to know about Pathway:

Input connectors
Tables
Transformations
Output connectors
Dataflow
Runtime

Pipeline construction with connectors and transformations

In Pathway, there is a clear divide between the definition of the pipeline and the execution of the computation. The pipeline defines the sources of the data, the transformations performed on the data, and where the results are sent. The pipeline is a recipe of your processing, defining the ingredients (the data sources) and the different operations (the transformations): it does not contain any actual food (data). Once your pipeline is built, you can run the computation to ingest and process the data.

In this section, you will learn how to build your data pipeline.

Connect to your data sources with input connectors

Connectors are Pathway's interface with external systems, for both extracting the input data and sending the output data. Extracting input data from external data sources is done using input connectors.

Input connector ingests the data into Pathway

Input connectors return a Pathway table which represents a snapshot of the input stream (more information on tables below).

You need to provide a schema to the input connector. For example, let's consider a data stream made of events containing two fields: a primary key name that is a string and an integer value age.

class InputSchema(pw.Schema):
    name: str = pw.column_definition(primary_key=True)
    age: int

Using the schema, the input connector knows how to format the data. Pathway comes with many input connectors:

CSV

input_table = pw.io.csv.read("./input_dir/", schema=InputSchema)

The connector listens for incoming data and updates the resulting table accordingly.

You can find more information about the available connectors here.

Tables: dynamic content with static schema

In Pathway, data is modeled as tables, representing snapshots of the data streams.

All the data is modeled as tables, similar to classical relational tables, organized into columns and rows. Tables have a static schema, but their content is dynamic and is updated whenever a new event is received. A table is a snapshot of a data stream and is the latest state of all the events that have been received up to the current processing time.

For example, if two events have been received:

A table is a summary of the event stream at the current time.

There is one row per entry. Each new entry is seen as an addition and represented by a new row.

A new entry results in the addition of a new row to the table.

Pathway also supports the removal and the update of entries. An update is represented by removing the previous entry and adding the new version at the same time:

A new update is a removal followed by an addition.

Since the column name is the primary key, the reception of a new event with an already existing name value (Bob here) is an update of the associated row in the table.

As Pathway handles potentially infinite and ever-changing streaming data, the number of rows and the content changes with time as new data comes into the system.

Processing the data with transformations

Pathway provides operators to modify the tables such as select or join. These operators are called transformations. Pathway has a functional programming approach: each transformation returns a new table, leaving the input table unchanged. Using the transformations, you can define your processing pipeline, sequentially specifying the transformations your data will go through.

For example, you can define a pipeline filtering on the column age, keeping only the entries with a positive value. Then, you sum all the values and store the result in a single-row table with a single column sum_age. Here is a way to do it with Pathway, assuming a correct input table called input_table:

filtered_table = input_table.filter(input_table.age >= 0)
result_table = filtered_table.reduce(
    sum_age = pw.reducers.sum(filtered_table.age)
)

It's okay if you don't understand everything for now. Here are the takeaways:

Each line produces a new table from the previous one. The first line filters the values and the second does the sum.
Running this code snippet does not run any computation on data.

The last point, which may seem counter-intuitive at first, is discussed later in this article.

Don't hesitate to read our article about Pathway basic transformations.

External functions and LLMs

Pathway provides many ready-to-use transformations, but it may not be enough for your project. If you don't find what you need, don't worry you can use any Python function in your pipeline. Pathway allows you to seamlessly integrate with Python Machine Learning libraries, use LLM's, and call into synchronous and asynchronous API's.

Send the results to external systems using output connectors.

The data is sent out of Pathway using output connectors.

Output connectors sent the data out of Pathway.

Output connectors are used to configure the connection to the chosen location (Kafka, PostgreSQL, etc.):

CSV

pw.io.csv.write(table,"./output_file.csv")

The connector forwards the changes to the table.

Pathway comes with many output connectors, you can learn more about them in our dedicated article.

Output connectors send the table to the chosen location (Kafka, PostgreSQL, etc.) as a stream of updates.

The output is a data stream.

The tables are produced out of the system as data streams.

The tables you want to send out of Pathway to external systems are also dynamic. They are updated due to new events entering the system. These changes are forwarded to external systems as a data stream: the updates are represented by new events. As previously, the event can represent the addition or the removal of an entry and an update is represented by a removal of the old entry and its addition with the new value.

Pathway handles the data in an incremental way. Instead of sending the entire version of the table whenever there is an update, Pathway only sends the changes to the table to be more efficient. Only the rows that are affected by the changes are sent to the external system.

For example, consider the case of a Pathway pipeline computing the sum of the ages. This value is stored in a single-row table and at $t_{2}$ , before Bob's update, the value of the sum is $90$ . Upon reception of Bob's new value, the sum is updated to $40$ . This update is propagated to external systems (Kafka, PostgreSQL, etc.) as the removal of the old entry and the insertion of the new one:

The changes to the output are sent as an event stream.

The diff column represents whether the value has been added (diff=1) or removed (diff=-1). Both events are issued at the same time (t_3 in this example) with no distinctive order. The time of emission is also included in a column time.

In practice, not all systems support data streams. Pathway output connectors adapt the updates to match the system constraints. For example, Pathway PostgreSQL connector sends the output into a PostgreSQL table: it will not insert the 1 and -1 values in a separate column, it will update the table directly in real time. On the other hand, if the results are outputted to a CSV file, the new events will be appended to the end of the file with the columns diff and time. This choice of outputting the results as logs into the CSV file allows having an incremental approach and avoid removing and rewriting the entire CSV file at each update. This is why, in this case, the diff and time columns are added.

Different systems will handle the event stream differently.

Note that for readability of the CSV output, only the previous value $90$ at time $t_{2}$ is shown. In practice, all the previous updates are written.

Dataflow

Transformations and connectors are used to define a pipeline: they are used to build the pipeline, but they do not trigger any computation.

In Pathway, the processing pipeline is modeled using a graph. This graph, called the dataflow, models the different transformation steps performed on the data. Each table is a node, linked with other nodes (tables) by transformations.

For example, the previous Pathway pipeline is represented as follows:

Dataflow of our example before being run.

This dataflow is the core of Pathway. The user creates a pipeline which is translated into a dataflow by Pathway. The graph is built by the calls to Pathway operators but, at that point, no computations are done: there is simply no data.

Running the computation with the Rust engine

Run the computation with `pw.run()`

Now that your pipeline is fully ready, with both connectors and transformations, you can run the computation with the command run:

pw.run()

And that's it! With this, running your code will launch the computation. Each update in the input data streams will automatically trigger the update of the relevant data in the pipeline.

For example, consider our complete example:

import pathway as pw


class InputSchema(pw.Schema):
    name: str = pw.column_definition(primary_key=True)
    age: int


input_table = pw.io.kafka.read(kafka_settings, schema=InputSchema, topic="topic1", format="json")
filtered_table = input_table.filter(input_table.age >= 0)
result_table = filtered_table.reduce(sum_age = pw.reducers.sum(filtered_table.age))
pw.io.kafka.write(result_table, kafka_settings, topic="topic2", format="json")

pw.run()

The reception of a new value in Kafka triggers the insertion of a new row in input_table. This then triggers the update of filtered_table and possibly of result_table. The changes are propagated until they have no impact anymore, the altered rows are filtered out by a filter, or until they reach the output connector. In the latter case, the changes are forwarded to the external system.

The reception of the update is propagated in the dataflow.

Pathway listens to the data sources for new updates until the process is terminated: the computation runs forever until the process gets killed. This is the normal behavior of Pathway.

During the whole run, the dataflow maintains the latest version of the data in order to enable quick updates: instead of ingesting all the data from scratch in the graph, only the relevant parts are updated to take into account the new data.

In our example, at the reception of a new value, the sum is not recomputed from scratch but only incremented by the said value, making the computation faster.

This dataflow allows the user to focus on the intended behavior of the processing pipeline, as if the data were static, and Pathway handles the updates on its own using the dataflow.

Fast in-memory computations thanks to a powerful Rust Engine

In Pathway, both the storage of the tables and the computations are done in memory. This can raise legitimate concerns about memory and speed. Indeed, Python is not known to be the most efficient language: it is a dynamically typed and interpreted language. Worse, its infamous GIL limits parallelism and concurrency...

Fortunately, Pathway comes with a powerful Rust engine which takes over once the pipeline is ready. Python is used for its accessibility to describe the (typed) pipeline, but the dataflow is built and maintained by Pathway engine. Pathway engine removes those limits associated with Python. In particular, Pathway natively supports multithreading and multiprocessing, and can be distributed using Kubernetes. The content of the tables is handled by Pathway Rust engine, making it very memory-efficient. Similarly, most of the transformations are handled at the Rust level, making the processing very fast.

If you add the incremental nature of the computations, you end-up with the fastest data processing engine on the market.

Static mode

With Pathway, it doesn't matter if you are dealing with static or streaming data. The same pipeline can be used for both kinds of data, Pathway's engine provides consistent outputs in both cases. You can combine real-time and historical data in the same code logic.

In the static mode, all the data is loaded and processed at once and then the process terminates. It does not wait for new data unlike in the streaming mode.

You can learn more about the streaming and static mode in our dedicated article.

Advanced concepts

Event Time vs Processing Time

In the context of temporal behavior it is important to distinguish between an event time and a processing time. The event time is when the event happens, e.g. if your data consist of orders in the online shop, the event time is the time when the order happened. This information has to be present in your data because Pathway doesn't know when the event happened. Thus, event time can be any time you assign to your data.

The only time Pathway is aware of is when the record arrives to the Pathway engine. This time is called processing time. While the processing time of entries in a stream is always nondecreasing (because the time goes forward), due to latency the event time may be out of order. In extreme cases, this can manifest via events with high latency between their event time and processing time, which we refer to as late data.

Stateful and stateless transformations

In streaming, transformations can be classified into two categories: stateful transformations and stateless transformations.

In a nutshell, stateful transformations require memory, i.e. a state, to perform the operations, while stateless transformations do not require any memory, only the current data. This difference comes from the fact we are dealing with streams, meaning that some operations have to remember past data points to be able to compute the new ones.

For example, we can compare two transformations, the join and the filter:

The join is a stateful transformation. To join two streams tableA.join(tableB), you must store all the values of both stream in case a new data point is received. Indeed, if a new point is received from tableA, you will need to check all the already received data points from tableB to know if there is a new match in the join.
The filter is a stateless transformation. To filter a data stream, you only need to know the current data point, as passing the filter is independent of other data points.

In terms of memory, the stateless operations have a constant complexity as they do not need to keep anything in memory. On the other hand, the memory consumed by stateful operations grows linearly with the data stream.

How can Pathway remember old values if only stateless operations are used, and the old values are forgotten?

Pathway remembers old values through its input connectors, even when only stateless operations are used. Each connector has its own way of tracking historical data: filesystem connectors store old files in temporary storage, Deltalake/Iceberg keep snapshots, and Kafka requires explicit deletions alongside insertions. This means that instead of operators storing history, the connectors handle change tracking and forward only the necessary deltas, optimizing memory usage while ensuring data consistency.