Data representation concepts

A review of the key concepts behind data representation and data transformation in the programming layer of Pathway.

To be able to use Pathway to the full extent of its capacities, it is useful to have an accurate understanding of how Pathway allows you to describe and manipulate data. This text is recommended for advanced users, or if you are simply curious about what operations are possible in Pathway.

Below, we provide a quick review of the following terms: table, index, pointer, id, universe, connector, transformer, and computation graph.

Here is a quick summary of the relationship between those:

Universe

Tables

The Pathway programming framework is organized around work with data tables - a table is simply called Table.

Pathway' Table is similar to a classical relational table as it is organized into columns and rows. Each row is identified by a unique id which is the equivalent of an SQL primary key or of an index in pandas. Each column is a collection of pairs composed by an id and a value.

id is a reserved column name, don't use it to create a new column.

Pathway's Table comes with a number of available methods to work on tables such as select or join: to call such a method f on a table t, the dot notation t.f() is used. Similarly, to access a given column, you can use the notation t.col_name: in particular, t.id returns the ids. You can check out our First-steps guide and the full API documentation.

The ids of the rows, also called indexes, are automatically generated by Pathway's engine unless explicitly provided with the function .with_id(). Reindexing can be costly, so it is better to let Pathway handle the indexes, unless some specific indexes are needed. Furthermore, the type of the id column is imposed: an id is not a number or a string but a pointer (pw.Pointer) which points to the corresponding row.

Universe

The universe of a table is a central concept in Pathway. In a nutshell, the universe of a table is the collection of the ids of the said table. It is used to know if some operations can be performed: if you want to update the cells of one table t1 by the values contained in a table t2, you have to make sure that each id of t2 is also in t1.

Universe

Simple, right?

Simple, but there is a bit more to this than meets the eye! Pathway aims at working on ever-changing data tables: Pathway' engine ensures that each update in the data is taken into account in realtime on each table. In this context, the set of ids is not something static nor stable: universes are used to ensure operations are performed on compatible data tables, despite potential instability.

For example, two data tables from two different data sources may have the same set of ids at some point but may diverge with upcoming updates. By default, tables from different data sources will not have the same universe.

Pathway tries to infer whether the tables have the same universe but depending on how the tables are processed it may fail. You may want to force two tables to have the same universe (or one table have a universe which is a subset of the universe of the second): Pathway might have failed the inference or because you know that the two tables actually derive from the same data source: you can specify it manually with t1.unsafe_promise_universe_is_subset_of(t2).

Transformers

Simple Table operations are:

  • Table input/output through connectors (CSV, Kafka, Debezium).
  • Table transformations:
    • in the simplest case, you can independently map each row of the table to a new value,
    • in the most complex case, you can apply a transformer to several tables to produce several tables (See class_transformer).
  • Selection:
    • the basic version just picks a subset of columns,
    • more powerful versions also allow joins on ids and computed columns.
  • Filtering.
  • Joins:
    • fast and efficient between tables that share the same id set, Pathway offers special syntax for this case,
    • joins on arbitrary columns, which reset the ids to the ones derived from the joining columns.
  • Group-bys and reduces.

Transformers may be combined into other transformers, as Python functions.

How are these operations meant to work in realtime with streaming data? Pathway's Transformer syntax serves to schedule operations. These operations will be later executed by the runtime engine to process actual data.

Because the outcome of operations can depend on the data being processed, the operations as such need to be defined in a data-independent way.

Supported methods are given in our API docs.

Manipulating table rows

Pathway provides two ways of defining custom logic at the level of data rows.

  1. Using the apply function to transform a table. This applies the logic described in the provided lambda function to all rows of a table, acting essentially as a map on rows.
  2. Using a transformer class, which describes how computed fields in new columns of a table are defined based on input fields. The logic is expressed in a declarative manner, which allows for recursion or computations spanning multiple fields of different rows or even different tables.

Transformer classes

Transformer classes provide a way to achieve row-centric operations in Pathway where use of apply is not sufficient or not convenient. Transformer classes provide the easiest use of advanced computation, involving pointers between fields of tables.

Transformer classes are used for easy implementation of data-structure querying operations, defining APIs in Data Products, and on-demand computations. Take a look at the transformer class tutorial to find out more.

Pathway' transformer syntax allows you to express pipelines of transformations on entire (and ever-changing) data tables. In Pathway, transformers behave like functions, whose arguments are Pathway Table.

If you have used Spark SQL or Kafka Streams in the past, the syntax should feel familiar.

Computation graph

Transformer and transformer classes are used to design a processing pipeline for the data. In Pathway, the processing pipeline is modeled using a graph. This graph, called the computation graph, models the different transformation steps performed on the data. Each node is either a table, an operation performed on one or more tables (e.g. a transformer), or a connector. As an example, let's consider this graph:

Universe

In this example graph, two data sources are read using connectors, building two tables T1 and T2. Those tables are modified using two operations t1 and t2, resulting in two new tables T1' and T2'. The operations t1 and t2 can be any operation manipulating a single table, a select with a apply for instance. Both tables are then combined by an operation requiring two tables, such as a join, resulting in a table T3. Finally, output connectors are used to output both T2' and T3.

This computation graph is the core of Pathway. The user creates its pipeline which is translated into a computation graph by Pathway. The graph is built by the calls to Pathway operators but at that point no computations are done: there is simply no data. The data is ingested by a specific command pw.run() and, as a streaming system, Pathway never terminates afterwards. The data is propagated from the input connectors to output connectors. During the whole run, the computation graph is maintained with the latest version of the data in order to enable quick updates: instead of ingesting all the data from scratch in the graph, the relevant parts are updated to take into account the new updates. This computation graph allows the user to focus on the intended behavior of the processing pipeline, as if the data was static, and Pathway handles the updates on its own using the computation graph.

Input/output connectors

Pathway comes with connectors which connect to streaming data sources at input, as well as connectors which send notifications of changes to Pathway outputs. When the computation is run, input connectors listen to input sources for incoming updates (see our article on connectors for more information on this). Similarly, output connectors are writing the output data whenever updates are received. See how to set up your first realtime app and some of our tutorials for an illustration of connector use.

Olivier Ruas

Algorithm and Data Processing Magician