Run In Colab  View in Github

Pathway: a survival guide

Must-read for both first-timers and veterans alike, this guide gathers the most commonly used basic elements of Pathway.

While the Pathway programming framework comes with advanced functionalities such as classifiers or fuzzy-joins, it is essential to master the basic operations at the core of the framework. As part of this survival guide, we are going to walk through the following topics:

If you want more information you can see our complete API docs or some of our tutorials.

Prerequisite

Be sure to import Pathway, and we need some tables:

import pathway as pwt_name = pw.debug.table_from_markdown(    """    | name 1  | Alice 2  | Bob 3  | Carole """)t_age = pw.debug.table_from_markdown(    """    | age 1  | 25 2  | 32 3  | 28 """)t_name_extra = pw.debug.table_from_markdown(    """    | name  | age 4  | David | 25 """)

We can display a snapshot of our table (for debugging purposes) using pw.debug.compute_and_print():

pw.debug.compute_and_print(t_name)
            | name
^2TMTFGY... | Alice
^YHZBTNY... | Bob
^SERVYWW... | Carole

In the following we will omit this for clarity reasons but keep in mind that it is required to print the actual data at a given time.

Selecting and indexing

  • Select: we can use select to select a particular column and we can use the dot notation to specify the name of the column.
t_extra.select(t_name_extra.name)
            | name
^8GR6BSX... | David
  • Filtering: we can use filter to keep rows following a given property.
t_age.filter(t_age.age>30)
            | age
^YHZBTNY... | 32
  • Reindexing: you can change the ids (accessible by table.id) by using .with_id_from(). We need a table with new ids:
t_new_ids = pw.debug.table_from_markdown(    """    | new_id_source 1  | 4 2  | 5 3  | 6 """)
t_name.unsafe_promise_universe_is_subset_of(t_new_ids).with_id_from(t_new_ids.new_id_source)
            | name
^8GR6BSX... | Alice
^76QPWK3... | Bob
^C4S6S48... | Carole

Here we need to use unsafe_promise_universe_is_subset_of, you can find the explanation in our article about Pathway's concepts. XXX: with_id_from() works the same, but take the ids of as new ids, as opposed to a dedicated column as in our previous example.

  • ix: uses a column's values as indexes. As an example, if we have a table containing with indexes pointing to another table, we can use this ix to obtain those lines:
t_selected_ids = pw.debug.table_from_markdown(    """      | selected_id 100  | 1 200  | 3 """)t_selected_ids.select(selected=t_name.ix_ref(t_selected_ids.selected_id).name)
            | selected
^M1T2QKJ... | Alice
^9WGHV46... | Carole
  • Group-by: we can use groupby to aggregate data sharing a common property and then use a reducer to compute an aggregated value.
t_spending = pw.debug.table_from_markdown(    """    | name  | amount 1  | Bob   | 100 2  | Alice | 50 3  | Alice | 125 4  | Bob   | 200 """)t_spending.groupby(t_spending.name).reduce(t_spending.name, sum=pw.reducers.sum(t_spending.amount))
            | name  | sum
^TSP7EFT... | Alice | 175
^4PVZ777... | Bob   | 300

You can do groupbys on multiples columns at once (e.g. .groupby(t.colA, t.colB)). The list of all the available reducers can be found here (available soon).

Working with multiples tables: union, concatenation, join

  • Union: we can use the operator + or += to add compute the union of two tables sharing the same ids.
t_age = t_age.unsafe_promise_same_universe_as(t_name)t_union = t_name + t_age
            | name   | age
^2TMTFGY... | Alice  | 25
^YHZBTNY... | Bob    | 32
^SERVYWW... | Carole | 28
  • Concatenation: we can use Table.concat(t1,t2) to concatenate two tables, but they need to have the same ids.
pw.Table.concat(t_union,t_name_extra)
            | name   | age
^531BJZ8... | Alice  | 25
^9SVRC47... | Bob    | 32
^R5XMQ21... | Carole | 28
^C4VQQCA... | David  | 25

As you can see, Pathway may reindex the obtained tables.

Info for Databricks Delta users: Concatenation is highly similar to the SQL MERGE INTO operation.

  • Join: we can do all usual types of joins in Pathway (inner, outer, left, right). The example below presents an inner join:
t_age.join(t_name, t_age.id==t_name.id).select(t_age.age, t_name.name)
            | age | name
^VJ3K9DF... | 25  | Alice
^R0GE4WM... | 28  | Carole
^V1RPZW8... | 32  | Bob

Note that in the equality t_age.id==t_name.id the left part must be a column of the table on which the join is done, namely t_name in our example. Doing t_name.id==t_age.id would throw an error.

Updating

  • Renaming with select:
t_name.select(surname=t_name.name)
            | surname
^2TMTFGY... | Alice
^YHZBTNY... | Bob
^SERVYWW... | Carole
  • Renaming with rename_columns:
t_name.rename_columns(surname=t_name.name)
            | surname
^2TMTFGY... | Alice
^YHZBTNY... | Bob
^SERVYWW... | Carole
  • Updating cells: you can the values of cells using update_cells which can be also done using the binary operator <<. The ids and column name should be the same.
t_updated_names = pw.debug.table_from_markdown(    """    | name 1  | Alicia 2  | Bobby 3  | Caro """)t_updated_names = t_updated_names.unsafe_promise_same_universe_as(t_name)t_name.update_cells(t_updated_names)
            | name
^2TMTFGY... | Alicia
^YHZBTNY... | Bobby
^SERVYWW... | Caro

Operations

  • Row-centered operations with apply: you can apply a function to each value of a column (or more) by using apply in a select.
t_age.select(thirties=pw.apply(lambda x: x>30, t_age.age)))
            | thirties
^2TMTFGY... | False
^SERVYWW... | False
^YHZBTNY... | True

Operations on multiples values of a single row can be easily done this way:

t_multiples_values = pw.debug.table_from_markdown(    """    | valA    | valB 1  | 1       | 10 2  | 100     | 1000 """)t_multiples_values.select(sum=pw.apply(lambda x,y: x+y, t_multiples_values.valA, t_multiples_values.valB))
            | sum
^2TMTFGY... | 11
^YHZBTNY... | 1100
  • Other operations with transformer classes: Pathway enables complex computation on data stream by using transformer classes. It is a bit advanced for this survival guide but you can find all the information about transformer classes in our tutorial.

Olivier Ruas

Algorithm and Data Processing Magician