A review of the key concepts behind data representation and data transformation in the programming layer of Pathway.
To be able to use Pathway to the full extent of its capacities, it is useful to have an accurate understanding of how Pathway allows you to describe and manipulate data. This text is recommended for advanced users, or if you are simply curious about what operations are possible in Pathway.
Below, we provide a quick review of the following terms: table, index, pointer, id, connector, transformers, and computation graph.
Here is a quick summary of the relationship between those:
The Pathway programming framework is organized around work with data tables - a table is simply called
Table is similar to a classical relational table as it is organized into columns and rows.
Each row is identified by a unique id which is the equivalent of an SQL primary key or of an index in pandas.
Each column is a collection of pairs composed by an id and a value.
❌ id is a reserved column name, don't use it to create a new column.
Table comes with a number of available methods to work on tables such as
join: to call such a method
f on a table
t, the dot notation
t.f() is used.
Similarly, to access a given column, you can use the notation
t.col_name: in particular,
t.id returns the ids.
You can check out our First-steps guide and the full API documentation.
The ids of the rows, also called indexes, are automatically generated by Pathway's engine unless explicitly provided with the function
Reindexing can be costly, so it is better to let Pathway handle the indexes, unless some specific indexes are needed.
Furthermore, the type of the id column is imposed: an id is not a number or a string but a pointer (
pw.Pointer) which points to the corresponding row.
Table operations are:
- Table input/output through connectors (CSV, Kafka, Debezium).
- Table transformations:
- in the simplest case, you can independently map each row of the table to a new value,
- in the most complex case, you can apply a transformer to several tables to produce several tables (See
- the basic version just picks a subset of columns,
- more powerful versions also allow joins on ids and computed columns.
- fast and efficient between tables that share the same
idset, Pathway offers special syntax for this case,
- joins on arbitrary columns, which reset the
ids to the ones derived from the joining columns.
- fast and efficient between tables that share the same
- Group-bys and reduces.
Transformers may be combined into other transformers, as Python functions.
How are these operations meant to work in realtime with streaming data? Pathway's Transformer syntax serves to schedule operations. These operations will be later executed by the runtime engine to process actual data.
Because the outcome of operations can depend on the data being processed, the operations as such need to be defined in a data-independent way.
Supported methods are given in our API docs.
Pathway provides two ways of defining custom logic at the level of data rows.
- Using the
applyfunction to transform a table. This applies the logic described in the provided lambda function to all rows of a table, acting essentially as a map on rows.
- For more advanced operations, you can use a
transformer class. A transformer class describes how computed fields in new columns of a table are defined based on input fields. The logic is expressed in a declarative manner, which allows for recursion or computations spanning multiple fields of different rows or even different tables.
Transformer and transformer classes are used to design a processing pipeline for the data. In Pathway, the processing pipeline is modeled using a graph. This graph, called the computation graph, models the different transformation steps performed on the data. Each node is either a table, an operation performed on one or more tables (e.g. a transformer), or a connector. As an example, let's consider this graph:
In this example graph, two data sources are read using connectors, building two tables T1 and T2.
Those tables are modified using two operations t1 and t2, resulting in two new tables T1' and T2'.
The operations t1 and t2 can be any operation manipulating a single table, a
select with a
apply for instance.
Both tables are then combined by an operation requiring two tables, such as a
join, resulting in a table T3.
Finally, output connectors are used to output both T2' and T3.
This computation graph is the core of Pathway.
The user creates its pipeline which is translated into a computation graph by Pathway.
The graph is built by the calls to Pathway operators but at that point no computations are done: there is simply no data.
The data is ingested by a specific command
pw.run() and, as a streaming system, Pathway never terminates afterwards.
The data is propagated from the input connectors to output connectors.
During the whole run, the computation graph is maintained with the latest version of the data in order to enable quick updates: instead of ingesting all the data from scratch in the graph, the relevant parts are updated to take into account the new updates.
This computation graph allows the user to focus on the intended behavior of the processing pipeline, as if the data was static, and Pathway handles the updates on its own using the computation graph.
Pathway comes with connectors which connect to streaming data sources at input, as well as connectors which send notifications of changes to Pathway outputs. When the computation is run, input connectors listen to input sources for incoming updates (see our article on connectors for more information on this). Similarly, output connectors are writing the output data whenever updates are received. See how to set up your first realtime app and some of our tutorials for an illustration of connector use.