A quick introduction to Pathway' transformer classes.
Pathway' transformer syntax allows you to express pipelines of transformations on entire (and ever-changing) data tables. In Pathway, transformers behave like functions, whose arguments are Pathway Tables. If you have used Spark SQL or Kafka Streams in the past, the syntax should feel familiar.
In addition to this, Pathway also natively supports transformers defined on data rows. This is achieved through an objected-oriented (ORM) view of rows in data. These are known as Transformer Classes.
Transformer Classes are used for easy implementation of data-structure querying operations, defining APIs in Data Products, and on-demand computations.
Transformer classes provide a way to achieve row-centric operations in Pathway where use of
apply is not sufficient or not convenient.
Using transformer classes is the easiest way do advanced computation, involving pointers between fields of tables.
To create a transformer class is creating a class which is annotated by
In that class, you can declare other classes: each class defines one input table and one output table.
First, you can access and use the values of the input table by declaring the field existing in the table:
val = pw.input_attribute().
Note that the variable
val has to be named with the name of the targeted column of the input table.
You can then define the output field by using the annotation
@pw.output_attribute before a function: the name of the function will be the column name in the output column and the return value will be the value stored in that column.
As an example, let's consider the following transformer doing a map: the transformer takes a table which has a column named
col_name as input and applies a given function
f to each row and the output values are stored in a new column named
import pathway as pw @pw.transformer class my_transformer: class my_table(pw.ClassArg): col_name=pw.input_attribute() @pw.output_attribute def col_name_output(self): return f(self.col_name)
In this transformer, the class
my_table takes one input table whose columns will be match to the parameters defined using
pw.input_attribute() and will output a table whose columns are defined by functions annotated by
To test our transformer, let's consider this toy table
col_name 0 x 1 y 2 z
Let's apply the transformer to the table
t, and extract the resulting table stored in
t_map = my_transformer(my_table=t).my_table
You obtain the following table:
col_name_output 0 f(x) 1 f(y) 2 f(z)
Now that you are familiar with transformer classes and their basic syntax, let's see how they can be useful. Using transformer classes to do simple maps is a bit complicated, a map can be done in one line with Pathway:
t_map = t.select(col_name_output=apply(f,t.col_name))
So one natural question you might ask yourself is 'why use transformer classes?'.
It is true that when doing single row operations, using
apply is the way to go.
Transformer classes are made for more advanced operations, in particular operations involving different tables.
apply is limited to row-centric operations, transformer classes are able to perform look-ups and recursive operations on rows.
Furthermore, inside the transformer class, you can easily access any table referenced by a class by doing
For instance, if you need to add the values of two different tables, things get more complicated with only standard operations.
It is possible to make a
join and then use
apply, but it would result in copying the values in a new table before doing the sum.
This does not scale well on large datasets.
On the other hand, using a class transformer would allow you to do it without having to create a new table.
You can check out how easy it is to use transformer classes to combine several tables at once.
While transformer classes allow you to work with different rows from different tables at once, this comes with a price.
Indeed, using transformer classes may have up to quadratic complexity in the number of dependencies.
Here, dependencies refer to the rows responsible for look-ups: all the rows you are accessing and using except for the one referred by
As a rule of thumb, try to limit the number of row dependencies per row to not more than a dozen or so.
Transformer classes are not meant to access too many rows at once.
For complex operations involving many rows simultaneously, you may prefer to use a
join to obtain a single row containing all the relevant values and then use the standard
Transformer classes are a key component of Pathway programming framework.
You can also take a look at our connectors to see how to use different data sources to Pathway.