Introduction to transformer classes

A quick introduction to Pathway' transformer classes.

Pathway' transformer syntax allows you to express pipelines of transformations on entire (and ever-changing) data tables. In Pathway, transformers behave like functions, whose arguments are Pathway Tables. If you have used Spark SQL or Kafka Streams in the past, the syntax should feel familiar.

In addition to this, Pathway also natively supports transformers defined on data rows. This is achieved through an objected-oriented (ORM) view of rows in data. These are known as Transformer Classes.

Transformer Classes are used for easy implementation of data-structure querying operations, defining APIs in Data Products, and on-demand computations.

Transformer classes provide a way to achieve row-centric operations in Pathway where use of apply is not sufficient or not convenient. Using transformer classes is the easiest way do advanced computation, involving pointers between fields of tables.

Transformers 101: how to make a map

To create a transformer class is creating a class which is annotated by @pw.transformer. In that class, you can declare other classes: each class defines one input table and one output table.

First, you can access and use the values of the input table by declaring the field existing in the table: val = pw.input_attribute(). Note that the variable val has to be named with the name of the targeted column of the input table.

You can then define the output field by using the annotation @pw.output_attribute before a function: the name of the function will be the column name in the output column and the return value will be the value stored in that column.

As an example, let's consider the following transformer doing a map: the transformer takes a table which has a column named col_name as input and applies a given function f to each row and the output values are stored in a new column named col_name_output:

import pathway as pw

@pw.transformer
class my_transformer:
    class my_table(pw.ClassArg):
        col_name=pw.input_attribute()

        @pw.output_attribute
        def col_name_output(self):
            return f(self.col_name)

In this transformer, the class my_table takes one input table whose columns will be match to the parameters defined using pw.input_attribute() and will output a table whose columns are defined by functions annotated by @pw.output_attribute.

To test our transformer, let's consider this toy table t:

    col_name
0   x
1   y
2   z

Let's apply the transformer to the table t, and extract the resulting table stored in my_table:

t_map = my_transformer(my_table=t).my_table

You obtain the following table:

    col_name_output
0   f(x)
1   f(y)
2   f(z)

Why using transformers?

Now that you are familiar with transformer classes and their basic syntax, let's see how they can be useful. Using transformer classes to do simple maps is a bit complicated, a map can be done in one line with Pathway:

t_map = t.select(col_name_output=apply(f,t.col_name))

So one natural question you might ask yourself is 'why use transformer classes?'.

It is true that when doing single row operations, using apply is the way to go. Transformer classes are made for more advanced operations, in particular operations involving different tables. While using apply is limited to row-centric operations, transformer classes are able to perform look-ups and recursive operations on rows. Furthermore, inside the transformer class, you can easily access any table referenced by a class by doing self.transformer_name.table_name.

For instance, if you need to add the values of two different tables, things get more complicated with only standard operations. It is possible to make a join and then use apply, but it would result in copying the values in a new table before doing the sum. This does not scale well on large datasets. On the other hand, using a class transformer would allow you to do it without having to create a new table. You can check out how easy it is to use transformer classes to combine several tables at once.

Complexity

While transformer classes allow you to work with different rows from different tables at once, this comes with a price. Indeed, using transformer classes may have up to quadratic complexity in the number of dependencies. Here, dependencies refer to the rows responsible for look-ups: all the rows you are accessing and using except for the one referred by self. As a rule of thumb, try to limit the number of row dependencies per row to not more than a dozen or so.

Transformer classes are not meant to access too many rows at once. For complex operations involving many rows simultaneously, you may prefer to use a join to obtain a single row containing all the relevant values and then use the standard pw.apply.

Going further

Transformer classes are a key component of Pathway programming framework.

If you want to learn more about transformer classes, you can see our basic examples of transformer classes or our advanced example on how to make a tree using transformer classes.

You can also take a look at our connectors to see how to use different data sources to Pathway.