Run In Colab  View in Github

Playing with joins.

A brief explanation on how to perform joins with Pathway.

Join is one of the basic table operation provided in Pathway. A join operation combines columns from two different tables by associating rows with matching some given values both tables.

This guide presents several samples of code using the joins:

The examples demonstrate usual use-cases, explain the behavior of outer joins, and point out some peculiarities you may encounter while using Pathway. In particular, they show how to:

Prerequisites

Be sure to import Pathway.

import pathway as pw

Also, we need some tables to play with. For the sake of presentation, let us consider a following made up scenario: a group of four friends (table friends) goes to grab some breakfast.

friends = pw.debug.table_from_markdown(    """    | name  |budget  1 | Alice | 13  2 | Bob   | 10  3 | Carol | 11  4 | Dan   | 12 """)menu = pw.debug.table_from_markdown(    """     | dish                              | price  11 | pancakes_with_jam                 | 11  12 | pb_jam_sandwich                   | 9  13 | jam_pb_pancakes                   | 12  14 | scrambled_egg                     | 11 """)

However, some of them have allergies (table allergies), and cannot eat everything. Luckily, the restaurant has a list of allergens contained in each dish (table allergens_in_menu).

allergies = pw.debug.table_from_markdown(    """     | person| allergy  21 | Alice | milk  22 | Bob   | eggs  23 | Carol | peanuts  24 | Carol | milk """)allergens_in_menu = pw.debug.table_from_markdown(    """     | dish                              | allergen  31 | pancakes_with_jam                 | milk  32 | pancakes_with_jam                 | eggs  33 | pb_jam_sandwich                   | peanuts  34 | jam_pb_pancakes                   | eggs  35 | jam_pb_pancakes                   | peanuts  36 | jam_pb_pancakes                   | milk  37 | scrambled_egg                     | eggs """)

Simple inner join.

Syntax

Putting it simply, the syntax of join is:

table.join(other, *on)

where

  • table and other are tables to be joined,
  • *on is a list of conditions of form:
table.some_column == other.some_other_column

Remark: the order of tables in the condition matters. That is, a condition of the following form won't be accepted:

other.some_other_column == table.some_column

Examples

As a warm-up, let us see how to make a simple join-queries. Let's begin with simple query that find all pairs person-dish that can put someone in a hospital.

bad_match = allergies.join(    allergens_in_menu, allergies.allergy == allergens_in_menu.allergen).select(allergens_in_menu.dish, allergies.person)pw.debug.compute_and_print(bad_match)
            | dish              | person
^NXS1ZCC... | jam_pb_pancakes   | Alice
^TKSYTBX... | jam_pb_pancakes   | Bob
^C62RFES... | jam_pb_pancakes   | Carol
^BX7DMG3... | jam_pb_pancakes   | Carol
^B7RFVMK... | pancakes_with_jam | Alice
^51FP1WY... | pancakes_with_jam | Bob
^FHCFCY1... | pancakes_with_jam | Carol
^ZQ7DWQ8... | pb_jam_sandwich   | Carol
^0MRRXPF... | scrambled_egg     | Bob

As we can see, the result is a table of life-threatening pairs.

The select function works here similarly as select on a table. The difference is that here, we can use columns of both tables as arguments, e.g.

def make_food_constraint_note(name, dish):    return f"{name} can't eat {dish}."bad_match_note = allergies.join(    allergens_in_menu, allergies.allergy == allergens_in_menu.allergen).select(    note=pw.apply(make_food_constraint_note, allergies.person, allergens_in_menu.dish))pw.debug.compute_and_print(bad_match_note)
            | note
^NXS1ZCC... | Alice can't eat jam_pb_pancakes.
^B7RFVMK... | Alice can't eat pancakes_with_jam.
^TKSYTBX... | Bob can't eat jam_pb_pancakes.
^51FP1WY... | Bob can't eat pancakes_with_jam.
^0MRRXPF... | Bob can't eat scrambled_egg.
^C62RFES... | Carol can't eat jam_pb_pancakes.
^BX7DMG3... | Carol can't eat jam_pb_pancakes.
^FHCFCY1... | Carol can't eat pancakes_with_jam.
^ZQ7DWQ8... | Carol can't eat pb_jam_sandwich.

Remark: note that id is now some auto generated number, which is the usual behavior of join.

On self joins.

In order to perform a self joins (a join of table with itself), we need to create its copy.

same_allergies = (    allergies.join(        allergies_copy := allergies.copy(), allergies.allergy == allergies_copy.allergy    )    .select(        l_name=allergies.person,        r_name=allergies_copy.person,    )    .filter(pw.this.l_name != pw.this.r_name))pw.debug.compute_and_print(same_allergies)
            | l_name | r_name
^V26FYPD... | Alice  | Carol
^YTH8WNY... | Carol  | Alice

Outer joins

The difference between outer joins and joins is that the outer join adds to the result table also entries that didn't match:

pw.debug.compute_and_print()

left.left_join(right, conditions)

includes entries from the left table and

left.right_join(right, conditions)

includes entries from the right table.

Simple example

To demonstrate outer-joins, we can consider a question about forbidden breakfast configurations (i.e. pairs person-dish, such that person is allergic to some ingredients).

people_allergies = friends.left_join(    allergies, friends.name == allergies.person).select(friends.name, allergies.allergy)pw.debug.compute_and_print(people_allergies)forbidden_breakfast = people_allergies.left_join(    allergens_in_menu, people_allergies.allergy == allergens_in_menu.allergen).select(people_allergies.name, allergens_in_menu.dish)pw.debug.compute_and_print(forbidden_breakfast)
            | name  | allergy
^CEESHNP... | Alice | milk
^GKYNQB3... | Bob   | eggs
^MCCDMWQ... | Carol | milk
^28382DM... | Carol | peanuts
^MZ49F0F... | Dan   |
            | name  | dish
^Y4H9YJC... | Alice | jam_pb_pancakes
^MHQRNFF... | Alice | pancakes_with_jam
^4DQ4S80... | Bob   | jam_pb_pancakes
^XPGS5EQ... | Bob   | pancakes_with_jam
^63EGDX6... | Bob   | scrambled_egg
^YZ60CRJ... | Carol | jam_pb_pancakes
^K3K3E16... | Carol | jam_pb_pancakes
^M6FZ6K4... | Carol | pancakes_with_jam
^RN5X9SH... | Carol | pb_jam_sandwich
^2YAV0VF... | Dan   |

Now, as we can see, the table has duplicate rows. This is intended behavior, as a particular person can be allergic to more than one ingredient. In other words, when we add a reason column to our result table, we can see that each row was included in the table above for a different reason.

forbidden_breakfast = people_allergies.left_join(    allergens_in_menu, people_allergies.allergy == allergens_in_menu.allergen).select(    people_allergies.name, allergens_in_menu.dish, reason=allergens_in_menu.allergen)pw.debug.compute_and_print(forbidden_breakfast)
            | name  | dish              | reason
^Y4H9YJC... | Alice | jam_pb_pancakes   | milk
^MHQRNFF... | Alice | pancakes_with_jam | milk
^4DQ4S80... | Bob   | jam_pb_pancakes   | eggs
^XPGS5EQ... | Bob   | pancakes_with_jam | eggs
^63EGDX6... | Bob   | scrambled_egg     | eggs
^YZ60CRJ... | Carol | jam_pb_pancakes   | milk
^K3K3E16... | Carol | jam_pb_pancakes   | peanuts
^M6FZ6K4... | Carol | pancakes_with_jam | milk
^RN5X9SH... | Carol | pb_jam_sandwich   | peanuts
^2YAV0VF... | Dan   |                   |

Removing duplicates

If we really want to have a table without the 'reason' column and without duplicates, we can achieve that with extra groupby and reduce :

forbidden_breakfast = people_allergies.left_join(    allergens_in_menu, people_allergies.allergy == allergens_in_menu.allergen).select(people_allergies.name, allergens_in_menu.dish)forbidden_breakfast_no_duplicates = forbidden_breakfast.groupby(    forbidden_breakfast.name, forbidden_breakfast.dish).reduce(forbidden_breakfast.name, forbidden_breakfast.dish)pw.debug.compute_and_print(forbidden_breakfast_no_duplicates)
            | name  | dish
^M8E4DFK... | Alice | jam_pb_pancakes
^JMHBFKG... | Alice | pancakes_with_jam
^AN7YRTF... | Bob   | jam_pb_pancakes
^GZ0TEZ6... | Bob   | pancakes_with_jam
^W8AZSTG... | Bob   | scrambled_egg
^DBZV8YR... | Carol | jam_pb_pancakes
^N585E7A... | Carol | pancakes_with_jam
^9DJG3GD... | Carol | pb_jam_sandwich
^30JAJCE... | Dan   |

That simple application of groupby-reduce combination essentially selects unique set of pairs (name, dish) from our table. More on the reduce-groupby operations can be found in the reduce-groupby manual.

Computed and omitted columns.

Because left join operation returns rows from the left table, even if there is no match in the right column, some columns can't be computed.

The left join computes values in all columns that can be computed, and leaves all the remaining columns empty.

As an example, we can consider a modified variant of the bad_match_note table. The modification is that we want to include all the people, and additionally we want to display their budget.

To that end, it is enough that we join the friends table with the bad_match table, as together they include all necessary information.

basic_customer_info = friends.left_join(    bad_match, friends.name == bad_match.person).select(    friends.name,    friends.budget,    note=pw.apply(make_food_constraint_note, friends.name, bad_match.dish),)pw.debug.compute_and_print(basic_customer_info)
            | name  | budget | note
^B74N75K... | Alice | 13     | Alice can't eat jam_pb_pancakes.
^YZBNRSA... | Alice | 13     | Alice can't eat pancakes_with_jam.
^Z8QGSEZ... | Bob   | 10     | Bob can't eat jam_pb_pancakes.
^CHHK1Q9... | Bob   | 10     | Bob can't eat pancakes_with_jam.
^DHWEFAB... | Bob   | 10     | Bob can't eat scrambled_egg.
^2SGPETY... | Carol | 11     | Carol can't eat jam_pb_pancakes.
^ADRQJVA... | Carol | 11     | Carol can't eat jam_pb_pancakes.
^JCXPCDF... | Carol | 11     | Carol can't eat pancakes_with_jam.
^H2S7YN7... | Carol | 11     | Carol can't eat pb_jam_sandwich.
^MZ49F0F... | Dan   | 12     |

To once again demonstrate the behavior, let us change friends.name to bad_match.person in the select part of our code.

basic_customer_info = friends.left_join(    bad_match, friends.name == bad_match.person).select(    bad_match.person,    friends.budget,    note=pw.apply(make_food_constraint_note, friends.name, bad_match.dish),)pw.debug.compute_and_print(basic_customer_info)
            | person | budget | note
^MZ49F0F... |        | 12     |
^B74N75K... | Alice  | 13     | Alice can't eat jam_pb_pancakes.
^YZBNRSA... | Alice  | 13     | Alice can't eat pancakes_with_jam.
^Z8QGSEZ... | Bob    | 10     | Bob can't eat jam_pb_pancakes.
^CHHK1Q9... | Bob    | 10     | Bob can't eat pancakes_with_jam.
^DHWEFAB... | Bob    | 10     | Bob can't eat scrambled_egg.
^2SGPETY... | Carol  | 11     | Carol can't eat jam_pb_pancakes.
^ADRQJVA... | Carol  | 11     | Carol can't eat jam_pb_pancakes.
^JCXPCDF... | Carol  | 11     | Carol can't eat pancakes_with_jam.
^H2S7YN7... | Carol  | 11     | Carol can't eat pb_jam_sandwich.

Now, the cell that earlier included Dan is empty, as value of this column is defined by column in bad_match table. Since there is no match, person was not filled in. Budget is still computable though, hence it is still present.

On right joins

In the examples above, we only relied on left_join. The right_join operation is quite similar in its behavior. Namely,

people_allergies = friends.left_join(    allergies, friends.name == allergies.person).select(friends.name, allergies.allergy)pw.debug.compute_and_print(people_allergies)
            | name  | allergy
^CEESHNP... | Alice | milk
^GKYNQB3... | Bob   | eggs
^MCCDMWQ... | Carol | milk
^28382DM... | Carol | peanuts
^MZ49F0F... | Dan   |

is almost (except for auto-generated IDs) equivalent to

people_allergies = allergies.right_join(    friends, allergies.person == friends.name).select(friends.name, allergies.allergy)pw.debug.compute_and_print(people_allergies)
            | name  | allergy
^AHH9C80... | Alice | milk
^2BR8WRG... | Bob   | eggs
^Y5CAVAS... | Carol | milk
^WQENVG9... | Carol | peanuts
^KARHAYN... | Dan   |

When we join two tables, the only difference is in syntax - since allergies is the table on which we call join_right, it must be first argument in the join condition, i.e. it is

allergies.person == friends.name

as opposed to

friends.name == allergies.person

used in the left_join.

On full outer joins.

The outer_join operation is a full outer join, which means that

left.outer_join(right, *on)

not only show the pairs of rows form left and right that meet the condition in *on, but also rows that didn't get matched with any other row, from both left and right table. To demonstrate this operation, let us introduce another group of friends and find out, for each person in a group, whether the other group has any people with the same name.

other_group = pw.debug.table_from_markdown(    """    | name  |budget  5 | Bob   | 12  6 | Carol | 14  7 | Eve   | 12 """)
pw.debug.compute_and_print(    friends.outer_join(other_group, friends.name == other_group.name).select(        l_name=friends.name,        l_id=friends.id,        r_name=other_group.name,        r_id=other_group.id,    ))
            | l_name | l_id        | r_name | r_id
^QRSK6DY... |        |             | Eve    | ^19D0FQ9...
^EZ884R5... | Alice  | ^2TMTFGY... |        |
^1Y14E07... | Bob    | ^YHZBTNY... | Bob    | ^76QPWK3...
^7RXQ0Z1... | Carol  | ^SERVYWW... | Carol  | ^C4S6S48...
^MZ49F0F... | Dan    | ^8GR6BSX... |        |

ID inheritance in join

Full (yet still informal) syntax of join is:

table.join(other, *on, id = None)

where

  • table and other are tables to be joined,
  • *on is a list of conditions of form:
table.some_column == other.some_other_column
  • optional id can be set to either table.id or other.id

Whenever id argument is not none, join will try to use the column passed in id argument as new id in the result of join. This operation will succeed only when there is a guarantee that resulting joined table has no multiple rows with the same id-to-inherit.

Below you can find three examples - one successful and two failed id inheritance. First, let us see what are the id-s of the original tables.

pw.debug.compute_and_print(friends)pw.debug.compute_and_print(allergies)
            | name  | budget
^2TMTFGY... | Alice | 13
^YHZBTNY... | Bob   | 10
^SERVYWW... | Carol | 11
^8GR6BSX... | Dan   | 12
            | person | allergy
^SP3EQAR... | Alice  | milk
^H9SEXTP... | Bob    | eggs
^CWQ381T... | Carol  | milk
^9PXHKBD... | Carol  | peanuts

Successful id inheritance

Let us try a join that inherits id-s from table allergies.

pw.debug.compute_and_print(    allergies.join(friends, allergies.person == friends.name, id=allergies.id).select(        friends.name, allergies.allergy, friends.budget    ))
            | name  | allergy | budget
^SP3EQAR... | Alice | milk    | 13
^H9SEXTP... | Bob   | eggs    | 10
^CWQ381T... | Carol | milk    | 11
^9PXHKBD... | Carol | peanuts | 11

As we can see, the id column is the same as in the friends table, which was not the case when the id parameter is not set.

pw.debug.compute_and_print(    allergies.join(friends, allergies.person == friends.name, id=allergies.id).select(        friends.name, allergies.allergy, friends.budget    ))
            | name  | allergy | budget
^SP3EQAR... | Alice | milk    | 13
^H9SEXTP... | Bob   | eggs    | 10
^CWQ381T... | Carol | milk    | 11
^9PXHKBD... | Carol | peanuts | 11

Failed id inheritance: duplicate id-s

The first possible problem with inheriting id is that one row of the source table could be matched with several entries of the other table. The code below will cause such problem, and will raise a KeyError.

pw.debug.compute_and_print(    allergies.join(friends, allergies.person == friends.name, id=friends.id).select(        friends.name, allergies.allergy    ))[stacktrace...]KeyError: 'duplicate key: ^SERVYWW6KDGEQ2WVZ3ZZB86VSR'

Failed id inheritance: empty id-s

Finally, if we consider outer joins between those two tables, we may encounter a situation in which we need to assign id that is empty in the join result:

pw.debug.compute_and_print(    allergies.right_join(friends, allergies.person == friends.name).select(        friends.name, allergies.allergy, allergies_id=allergies.id    ))
            | name  | allergy | allergies_id
^AHH9C80... | Alice | milk    | ^SP3EQAR...
^2BR8WRG... | Bob   | eggs    | ^H9SEXTP...
^Y5CAVAS... | Carol | milk    | ^CWQ381T...
^WQENVG9... | Carol | peanuts | ^9PXHKBD...
^KARHAYN... | Dan   |         |

As we can see, the allergies_id field is not set for Dan's entry. If we try to use allergies.id as the id parameter, we will encounter TypeError.

pw.debug.compute_and_print(    allergies.right_join(        friends, allergies.person == friends.name, id=allergies.id    ).select(friends.name, allergies.allergy, allergies_id=allergies.id))[stacktrace...]TypeError: type mismatch: expected a pointer, got None

Joins on a foreign key

In Pathway id column is auto-generated, and as such joining over foreign key kept in some other table requires extra care. Let's assume that we have another table likes that indicates that a friend (row in friends) likes some particular dish (row in menu).

likes = pw.debug.table_from_markdown(    """    | f_id  | m_id100 | 1     | 11101 | 1     | 13102 | 2     | 12103 | 2     | 13104 | 3     | 11105 | 3     | 14106 | 3     | 13107 | 4     | 12108 | 4     | 14""")

Without further specification, Pathway treats columns f_id and m_id as numbers:

pw.debug.compute_and_print(likes)
            | f_id | m_id
^M1T2QKJ... | 1    | 11
^4RH155M... | 1    | 13
^YQ30FHF... | 2    | 12
^AMFB2M6... | 2    | 13
^7J00RXZ... | 3    | 11
^NZSWGHA... | 3    | 13
^FCVQG87... | 3    | 14
^V771XE5... | 4    | 12
^0HAWE8Q... | 4    | 14

while the id of tables friends and menu was converted to Pointer.

pw.debug.compute_and_print(friends)pw.debug.compute_and_print(menu)
            | name  | budget
^2TMTFGY... | Alice | 13
^YHZBTNY... | Bob   | 10
^SERVYWW... | Carol | 11
^8GR6BSX... | Dan   | 12
            | dish              | price
^VYA37VV... | jam_pb_pancakes   | 12
^HMETZT8... | pancakes_with_jam | 11
^4B2REY1... | pb_jam_sandwich   | 9
^SE0WWPZ... | scrambled_egg     | 11

To handle joins using those columns, we can use pointer_from function

likes += likes.select(    f_id_ptr=friends.pointer_from(likes.f_id),    m_id_ptr=menu.pointer_from(likes.m_id),)pw.debug.compute_and_print(likes)
            | f_id | m_id | f_id_ptr    | m_id_ptr
^M1T2QKJ... | 1    | 11   | ^2TMTFGY... | ^HMETZT8...
^4RH155M... | 1    | 13   | ^2TMTFGY... | ^VYA37VV...
^YQ30FHF... | 2    | 12   | ^YHZBTNY... | ^4B2REY1...
^AMFB2M6... | 2    | 13   | ^YHZBTNY... | ^VYA37VV...
^7J00RXZ... | 3    | 11   | ^SERVYWW... | ^HMETZT8...
^NZSWGHA... | 3    | 13   | ^SERVYWW... | ^VYA37VV...
^FCVQG87... | 3    | 14   | ^SERVYWW... | ^SE0WWPZ...
^V771XE5... | 4    | 12   | ^8GR6BSX... | ^4B2REY1...
^0HAWE8Q... | 4    | 14   | ^8GR6BSX... | ^SE0WWPZ...

An example of code joining friends with menu using likes is presented the next section on chaining joins.

Chaining joins:

Pathway provides two ways of chaining joins. The first relies on usage of pw.this, the second allows for slightly more compact code. Below, we focus on chaining joins using pw.this.

Simple join chaining

Below, we show how to chain joins using pw.this. To show how it can be used, we revisit the first example of left_join, in which we computed a left_join on a table that was obtained by another left_join. Instead of storing the result of the first left_join in people_allergies, we can use the following:

pw.debug.compute_and_print(    friends.left_join(allergies, friends.name == allergies.person)    .select(friends.name, allergies.allergy, friends.budget)    .left_join(allergens_in_menu, pw.this.allergy == allergens_in_menu.allergen)    .select(pw.this.name, allergens_in_menu.dish))
            | name  | dish
^Y4H9YJC... | Alice | jam_pb_pancakes
^MHQRNFF... | Alice | pancakes_with_jam
^4DQ4S80... | Bob   | jam_pb_pancakes
^XPGS5EQ... | Bob   | pancakes_with_jam
^63EGDX6... | Bob   | scrambled_egg
^YZ60CRJ... | Carol | jam_pb_pancakes
^K3K3E16... | Carol | jam_pb_pancakes
^M6FZ6K4... | Carol | pancakes_with_jam
^RN5X9SH... | Carol | pb_jam_sandwich
^2YAV0VF... | Dan   |

Essentially, in the context of joins pw.this allows us to address the table computed by

  friends.left_join(allergies, friends.name == allergies.person
      ).select(friends.name, allergies.allergy)

without breaking the chain and storing intermediate result in a temporary variable.

More precisely, given a chain:

table.join(...).select(...).join(...).select(...)...

pw.this can be used to address the result of the latest select.

Long chain example

To demonstrate longer chains in action, we go back to the table likes we used to show how to handle (generate) foreign keys.

feasible_choice = (    friends.join(likes, friends.id == likes.f_id_ptr)    .select(friends.name, friends.budget, likes.m_id_ptr)    .join(menu, pw.this.m_id_ptr == menu.id)    .select(pw.this.name, pw.this.budget, menu.dish, menu.price)    .left_join(allergies, pw.this.name == allergies.person)    .select(        pw.this.name, pw.this.budget, pw.this.dish, pw.this.price, allergies.allergy    )    .join(allergens_in_menu, pw.this.dish == allergens_in_menu.dish)    .select(        pw.this.name,        pw.this.budget,        pw.this.dish,        pw.this.price,        pw.this.allergy,        allergens_in_menu.allergen,    )    .filter((pw.this.price <= pw.this.budget) & (pw.this.allergy != pw.this.allergen)))pw.debug.compute_and_print(feasible_choice)
            | name  | budget | dish              | price | allergy | allergen
^C7E41EW... | Alice | 13     | jam_pb_pancakes   | 12    | milk    | eggs
^3C8BYCT... | Alice | 13     | jam_pb_pancakes   | 12    | milk    | peanuts
^V49K9AY... | Alice | 13     | pancakes_with_jam | 11    | milk    | eggs
^5HD9M6D... | Bob   | 10     | pb_jam_sandwich   | 9     | eggs    | peanuts
^3F4G2QJ... | Carol | 11     | pancakes_with_jam | 11    | milk    | eggs
^Q416RZ3... | Carol | 11     | pancakes_with_jam | 11    | peanuts | eggs
^05R585H... | Carol | 11     | pancakes_with_jam | 11    | peanuts | milk
^ZP22G15... | Carol | 11     | scrambled_egg     | 11    | milk    | eggs
^Z3CSET2... | Carol | 11     | scrambled_egg     | 11    | peanuts | eggs
^H08RKW9... | Dan   | 12     | pb_jam_sandwich   | 9     |         | peanuts
^41ZJ9RM... | Dan   | 12     | scrambled_egg     | 11    |         | eggs

As we can see, this table contains all choices of person and dish, such that person likes particular dish, is not allergic to it, and can afford it. We can further simplify the result by adding another groupby-reduce at the end of the chain.

pw.debug.compute_and_print(    feasible_choice.groupby(        pw.this.name, pw.this.dish, pw.this.budget, pw.this.price    ).reduce(pw.this.name, pw.this.dish, pw.this.budget, pw.this.price))
            | name  | dish              | budget | price
^XJV0C2H... | Alice | jam_pb_pancakes   | 13     | 12
^B9PDF7J... | Alice | pancakes_with_jam | 13     | 11
^86P08KH... | Bob   | pb_jam_sandwich   | 10     | 9
^FS0VA1H... | Carol | pancakes_with_jam | 11     | 11
^SFK6S54... | Carol | scrambled_egg     | 11     | 11
^AN8SK6X... | Dan   | pb_jam_sandwich   | 12     | 9
^G13RSRH... | Dan   | scrambled_egg     | 12     | 11

Krzysztof Nowicki

Datastore Research Engineer