TEST

Indexing from a groupby and Single-row Tables using ix_ref

In this tutorial, you will learn how to use ix_ref in a grouped table to access a row from the grouped table and how to manipulate single-row tables.

Accessing a grouped row by value using ix_ref

Imagine you've just been assigned a new project: analyzing salary statistics in your company. In particular, your goal is to determine the number of employees who earn more than the average salary in each department.

Let's consider the following table:

import pathway as pw


salaries = pw.debug.table_from_markdown(
    """
 salary   | department
 1800   | Sales
 2000   | Finance
 2300   | Sales
 2700   | Finance
 1900   | Finance
 """
)
pw.debug.compute_and_print(salaries)
            | salary | department
^X1MXHYY... | 1800   | Sales
^3HN31E1... | 1900   | Finance
^YYY4HAB... | 2000   | Finance
^Z3QWT29... | 2300   | Sales
^3CZ78B4... | 2700   | Finance

You can easily compute the average the salary in using a groupby on the department column:

statistics = salaries.groupby(pw.this.department).reduce(
    pw.this.department, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(statistics)
            | department | average_salary
^R67Q44G... | Finance    | 2200.0
^9Z1MT42... | Sales      | 2050.0

Now, you could do a join to add the average salary column to each entry in the salary table:

salaries_join = salaries.join(
    statistics, pw.left.department == pw.right.department
).select(*pw.left, pw.right.average_salary)
pw.debug.compute_and_print(salaries_join)
            | salary | department | average_salary
^9Z1Y9HX... | 1800   | Sales      | 2050.0
^R67VKGS... | 1900   | Finance    | 2200.0
^R67X4RH... | 2000   | Finance    | 2200.0
^9Z1GYFA... | 2300   | Sales      | 2050.0
^R67RXF8... | 2700   | Finance    | 2200.0

And then, you can filter whether the salary is higher than the average_salary column.

However, this is very tedious.

In Pathway, you can directly access the row of a grouped table from the corresponding value using ix_ref

ix_ref is a value-based indexer: you can access the row corresponding to the value Sales with statistics.ix_ref("Sales").

ix_ref

With ix_ref(value), you access the row associated with value by the groupby, and you can directly access the wanted column with the usual dot notation: to access the average salary of a given department, you can do statistics.ix_ref(pw.this.department).average_salary

statistics = statistics.select(
    average_salary=pw.cast(int, pw.this.average_salary),
    department=pw.this.department,
)
higher_than_average_salaries = salaries.filter(
    pw.this.salary >= statistics.ix_ref(pw.this.department).average_salary
)
pw.debug.compute_and_print(higher_than_average_salaries)
            | salary | department
^Z3QWT29... | 2300   | Sales
^3CZ78B4... | 2700   | Finance

Now you can count the number of entries in each department with a simple groupby:

number_employees = higher_than_average_salaries.groupby(pw.this.department).reduce(
    count=pw.reducers.count()
)
pw.debug.compute_and_print(number_employees)
            | count
^R67Q44G... | 1
^9Z1MT42... | 1

Multi-values indexing

We can also use ix_ref to index using tuples of values by using ix_ref(*args).

For example, let's say we have an extra column position:

salaries_with_position = pw.debug.table_from_markdown(
    """
 salary | department | position
 1800   | Sales | junior
 2000   | Finance | junior
 2300   | Sales | senior
 2700   | Finance | senior
 1900   | Finance | junior
 """
)

We can now make a groupby using both the department and position columns:

grouped_table_with_position = salaries_with_position.groupby(
    pw.this.department, pw.this.position
).reduce(
    pw.this.department, pw.this.position, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(grouped_table_with_position)
            | department | position | average_salary
^2K7JN4G... | Finance    | junior   | 1950.0
^NVRTF1A... | Finance    | senior   | 2700.0
^514BB56... | Sales      | junior   | 1800.0
^74PV169... | Sales      | senior   | 2300.0

Now we can access the row using both department and position columns:

pw.debug.compute_and_print(
    salaries_with_position.select(
        *pw.this,
        average_salary=grouped_table_with_position.ix_ref(
            pw.this.department, pw.this.position
        ).average_salary,
    )
)
            | salary | department | position | average_salary
^X1MXHYY... | 1800   | Sales      | junior   | 1800.0
^3HN31E1... | 1900   | Finance    | junior   | 1950.0
^YYY4HAB... | 2000   | Finance    | junior   | 1950.0
^Z3QWT29... | 2300   | Sales      | senior   | 2300.0
^3CZ78B4... | 2700   | Finance    | senior   | 2700.0

You can also use the values directly:

pw.debug.compute_and_print(
    salaries_with_position.select(
        *pw.this,
        average_salary=grouped_table_with_position.ix_ref(
            "Sales", "junior"
        ).average_salary,
    )
)
            | salary | department | position | average_salary
^X1MXHYY... | 1800   | Sales      | junior   | 1800.0
^3HN31E1... | 1900   | Finance    | junior   | 1800.0
^YYY4HAB... | 2000   | Finance    | junior   | 1800.0
^Z3QWT29... | 2300   | Sales      | senior   | 1800.0
^3CZ78B4... | 2700   | Finance    | senior   | 1800.0

Tables with primary keys

ix_ref is not limited to tables obtained by a groupby/reduce scheme: it works with any table with primary keys. Primary keys are the columns chosen to index the table.

By default, Pathway indexes the table with uuid indexes, except when doing a groupby/reduce where the columns used for the groupby are used to index the table.

You can reindex the table by manually choosing primary keys, using .with_id_from:

indexed_table = pw.debug.table_from_markdown(
    """
colA   | colB
10     | A
20     | B
"""
)
pw.debug.compute_and_print(indexed_table)
reindexed_table = indexed_table.with_id_from(pw.this.colB)
pw.debug.compute_and_print(reindexed_table)
            | colA | colB
^X1MXHYY... | 10   | A
^YYY4HAB... | 20   | B
            | colA | colB
^KHNET5G... | 10   | A
^FFN4QBS... | 20   | B

You can see that indexes have been updated.

With primary keys, the rows can now be accessed using ix_ref:

pw.debug.compute_and_print(
    indexed_table.select(new_val=reindexed_table.ix_ref("A").colA)
)
            | new_val
^YYY4HAB... | 10
^X1MXHYY... | 10

Single-Row Tables

A special case is an empty groupby: all the entries are associated to the same group.

It is the case when you consider global statistics such as the average salary on the entire company. In Pathway, it can be computed with a simple reduce().

Let's compute the sum of all the salaries in the company:

pw.debug.compute_and_print(salaries.reduce(sum_salary=pw.reducers.sum(pw.this.salary)))
            | sum_salary
^PWSRT42... | 10700

As you can see, Pathway returns a single-row table and not the single value.

As tempting as it is, in Pathway, you cannot use the value directly and do:

nb_employees = employee_salary.reduce(pw.reducers.avg(pw.this.salary))

In Pathway, you cannot obtain the value as an int or a float as you could in SQL: you need to use .ix_ref()

In Pathway, we access the value of a single-row table using singlerowtable.ix_ref().

As previously, .ix_ref() access the entire (single) row so you still need to specify the column. You access the average with average_table.ix_ref().average_salary.

Let's see how it goes:

global_statistics = salaries.reduce(average_salary=pw.reducers.avg(pw.this.salary))
pw.debug.compute_and_print(global_statistics)
global_statistics = global_statistics.cast_to_types(average_salary=int)
results = (
    salaries.filter(pw.this.salary >= global_statistics.ix_ref().average_salary)
).reduce(count=pw.reducers.count())
pw.debug.compute_and_print(results)
            | average_salary
^PWSRT42... | 2140.0
            | count
^PWSRT42... | 2

That's it!

You now have the number of employees with a higher salary than average, also contained in a single-row table.

ix_ref() can be used to copy the value in all the rows of the table:

salaries_with_average = salaries.select(
    *pw.this, global_statistics.ix_ref().average_salary
)
pw.debug.compute_and_print(salaries_with_average)
            | salary | department | average_salary
^X1MXHYY... | 1800   | Sales      | 2140
^3HN31E1... | 1900   | Finance    | 2140
^YYY4HAB... | 2000   | Finance    | 2140
^Z3QWT29... | 2300   | Sales      | 2140
^3CZ78B4... | 2700   | Finance    | 2140

Bonus: SQL version

With Pathway's SQL API, you can directly query tables using SQL queries:

sql_result = pw.sql(
    "SELECT COUNT(*) AS count FROM salaries WHERE salary > (SELECT AVG(salary) FROM salaries)",
    salaries=salaries,
)
pw.debug.compute_and_print(sql_result)
            | count
^PWSRT42... | 2

Be careful when using the SQL API: it only supports subqueries on single-row tables. This subquery usage is an exception, and we strongly encourage you to use the Python syntax as much as possible!