A quick introduction to Pathway's transformer classes.
Pathway's transformer syntax allows you to express pipelines of transformations on entire (and ever-changing) data tables. In Pathway, transformers behave like functions, whose arguments are Pathway Tables. If you have used Spark SQL or Kafka Streams in the past, the syntax should feel familiar.
In addition to this, Pathway also natively supports transformers defined on data rows. This is achieved through an objected-oriented (ORM) view of rows in data. These are known as Transformer Classes.
Transformer Classes are used for easy implementation of data-structure querying operations, defining APIs in Data Products, and on-demand computations.
Transformer classes provide a way to achieve row-centric operations in Pathway where use of
apply maps is not sufficient or not convenient.
Using transformer classes is the easiest way do advanced computation, involving pointers between fields of tables.
To create a transformer class is creating a class which is annotated by
In that class, we can declare other classes: each class defines one input table and one output table.
First, we can access and manipulate the values of the input table by declaring the field existing in the table:
val = pw.input_attribute().
Note that the variable
val has to be named with the name of the targeted column of the input table.
We can then define output field by using the annotation
@pw.output_attribute before a function: the name of the function will be the column name in the output column and the return value will be the value stored in that column.
As an example, let's consider the following transformer doing a map: the transformer takes a table which has a column named
col_name as input and applies a given function
f to each row and the output values are stored in a new column named
import pathway as email@example.com my_transformer: class my_table(pw.ClassArg): col_name=pw.input_attribute() @pw.output_attribute def col_name_output(self): return f(self.col_name)
In this transformer, the class
my_table takes one input table whose columns will be match to the parameters defined using
pw.input_attribute() and will output a table whose columns are defined by functions annotated by
To test our transformer, we consider this toy table
col_name 0 x 1 y 2 z
We apply the transformer to the table
t, and we extract the resulting table stored in
t_map = my_transformer(my_table=t).my_table
We obtain the following table:
col_name_output 0 f(x) 1 f(y) 2 f(z)
Now that we have seen the basis of transformer classes, it looks like a quite complicated way of doing a map, which can be done in one line by doing:
t_map = t.select(col_name_output=apply(f,t.col_name))
So one natural question one might ask is 'why using transformer classes?'.
It is true that when doing single row operations, using
apply is the way to go.
Transformer classes are made for more advanced operations, in particular operation involving different tables.
apply is limited to row-centric operations, transformer classes are able to perform look-ups and recursive operations on the rows.
Further more, inside the transformer class, we can easily access any table referenced by a class by doing
For instance, if you need to add the values of two different tables, things get more complicated with only standard operations.
It is possible to make a
join and then use
apply, but it would result in copying the values in a new table before doing the sum.
This does not scale well on large datasets.
On the other hand, using a class transformer would allow to do it without having to create a new table.
You can see how easy to use transformer classes to combine several tables at once.
Transformer classes are a key component of Pathway programming framework.
You can also take a look at our connectors to see how to use different data sources to Pathway!