groupby
and single-row tables using ix_ref
In this tutorial, you will learn how to use ix_ref
in a grouped table to access a row from the grouped table and how to manipulate single-row tables.
ix_ref
Imagine you've just been assigned a new project: analyzing salary statistics in your company. In particular, your goal is to determine the number of employees who earn more than the average salary in each department.
Let's consider the following table:
import pathway as pw
salaries = pw.debug.table_from_markdown(
"""
salary | department
1800 | Sales
2000 | Finance
2300 | Sales
2700 | Finance
1900 | Finance
"""
)
pw.debug.compute_and_print(salaries)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department
^X1MXHYY... | 1800 | Sales
^3HN31E1... | 1900 | Finance
^YYY4HAB... | 2000 | Finance
^Z3QWT29... | 2300 | Sales
^3CZ78B4... | 2700 | Finance
You can easily compute the average the salary in using a groupby
on the department
column:
statistics = salaries.groupby(pw.this.department).reduce(
pw.this.department, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(statistics)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| department | average_salary
^R67Q44G... | Finance | 2200.0
^9Z1MT42... | Sales | 2050.0
Now, you could do a join to add the average salary column to each entry in the salary
table:
salaries_join = salaries.join(
statistics, pw.left.department == pw.right.department
).select(*pw.left, pw.right.average_salary)
pw.debug.compute_and_print(salaries_join)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department | average_salary
^8Q4Y9HX... | 1800 | Sales | 2050.0
^6F9BKGS... | 1900 | Finance | 2200.0
^3Y7D4RH... | 2000 | Finance | 2200.0
^HX0GYFA... | 2300 | Sales | 2050.0
^TS5RXF8... | 2700 | Finance | 2200.0
And then, you can filter whether the salary is higher than the average_salary
column.
However, this is very tedious.
In Pathway, you can directly access the row of a grouped table from the corresponding value using ix_ref
ix_ref
is a value-based indexer: you can access the row corresponding to the value Sales
with statistics.ix_ref("Sales")
.
With ix_ref(value)
, you access the row associated with value
by the groupby
, and you can directly access the wanted column with the usual dot notation:
to access the average salary of a given department, you can do statistics.ix_ref(pw.this.department).average_salary
statistics = statistics.select(
average_salary=pw.cast(int, pw.this.average_salary),
department=pw.this.department,
)
higher_than_average_salaries = salaries.filter(
pw.this.salary >= statistics.ix_ref(pw.this.department).average_salary
)
pw.debug.compute_and_print(higher_than_average_salaries)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department
^Z3QWT29... | 2300 | Sales
^3CZ78B4... | 2700 | Finance
Now you can count the number of entries in each department with a simple groupby
:
number_employees = higher_than_average_salaries.groupby(pw.this.department).reduce(
count=pw.reducers.count()
)
pw.debug.compute_and_print(number_employees)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| count
^R67Q44G... | 1
^9Z1MT42... | 1
We can also use ix_ref
to index using tuples of values by using ix_ref(*args)
.
For example, let's say we have an extra column position
:
salaries_with_position = pw.debug.table_from_markdown(
"""
salary | department | position
1800 | Sales | junior
2000 | Finance | junior
2300 | Sales | senior
2700 | Finance | senior
1900 | Finance | junior
"""
)
We can now make a groupby using both the department
and position
columns:
grouped_table_with_position = salaries_with_position.groupby(
pw.this.department, pw.this.position
).reduce(
pw.this.department, pw.this.position, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(grouped_table_with_position)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| department | position | average_salary
^2K7JN4G... | Finance | junior | 1950.0
^NVRTF1A... | Finance | senior | 2700.0
^514BB56... | Sales | junior | 1800.0
^74PV169... | Sales | senior | 2300.0
Now we can access the row using both department
and position
columns:
pw.debug.compute_and_print(
salaries_with_position.select(
*pw.this,
average_salary=grouped_table_with_position.ix_ref(
pw.this.department, pw.this.position
).average_salary,
)
)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department | position | average_salary
^X1MXHYY... | 1800 | Sales | junior | 1800.0
^3HN31E1... | 1900 | Finance | junior | 1950.0
^YYY4HAB... | 2000 | Finance | junior | 1950.0
^Z3QWT29... | 2300 | Sales | senior | 2300.0
^3CZ78B4... | 2700 | Finance | senior | 2700.0
You can also use the values directly:
pw.debug.compute_and_print(
salaries_with_position.select(
*pw.this,
average_salary=grouped_table_with_position.ix_ref(
"Sales", "junior"
).average_salary,
)
)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department | position | average_salary
^X1MXHYY... | 1800 | Sales | junior | 1800.0
^3HN31E1... | 1900 | Finance | junior | 1800.0
^YYY4HAB... | 2000 | Finance | junior | 1800.0
^Z3QWT29... | 2300 | Sales | senior | 1800.0
^3CZ78B4... | 2700 | Finance | senior | 1800.0
ix_ref
is not limited to tables obtained by a groupby/reduce scheme: it works with any table with primary keys.
Primary keys are the columns chosen to index the table.
By default, Pathway indexes the table with uuid indexes, except when doing a groupby/reduce where the columns used for the groupby are used to index the table.
You can reindex the table by manually choosing primary keys, using .with_id_from
:
indexed_table = pw.debug.table_from_markdown(
"""
colA | colB
10 | A
20 | B
"""
)
pw.debug.compute_and_print(indexed_table)
reindexed_table = indexed_table.with_id_from(pw.this.colB)
pw.debug.compute_and_print(reindexed_table)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| colA | colB
^X1MXHYY... | 10 | A
^YYY4HAB... | 20 | B
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| colA | colB
^KHNET5G... | 10 | A
^FFN4QBS... | 20 | B
You can see that indexes have been updated.
With primary keys, the rows can now be accessed using ix_ref
:
pw.debug.compute_and_print(
indexed_table.select(new_val=reindexed_table.ix_ref("A").colA)
)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| new_val
^YYY4HAB... | 10
^X1MXHYY... | 10
A special case is an empty groupby
: all the entries are associated to the same group.
It is the case when you consider global statistics such as the average salary on the entire company.
In Pathway, it can be computed with a simple reduce()
.
Let's compute the sum of all the salaries in the company:
pw.debug.compute_and_print(salaries.reduce(sum_salary=pw.reducers.sum(pw.this.salary)))
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| sum_salary
^PWSRT42... | 10700
As you can see, Pathway returns a single-row table and not the single value.
As tempting as it is, in Pathway, you cannot use the value directly and do:
nb_employees = employee_salary.reduce(pw.reducers.avg(pw.this.salary))
In Pathway, you cannot obtain the value as an int or a float as you could in SQL: you need to use .ix_ref()
In Pathway, we access the value of a single-row table using singlerowtable.ix_ref()
.
As previously, .ix_ref()
access the entire (single) row so you still need to specify the column.
You access the average with average_table.ix_ref().average_salary
.
Let's see how it goes:
global_statistics = salaries.reduce(average_salary=pw.reducers.avg(pw.this.salary))
pw.debug.compute_and_print(global_statistics)
global_statistics = global_statistics.cast_to_types(average_salary=int)
results = (
salaries.filter(pw.this.salary >= global_statistics.ix_ref().average_salary)
).reduce(count=pw.reducers.count())
pw.debug.compute_and_print(results)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| average_salary
^PWSRT42... | 2140.0
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| count
^PWSRT42... | 2
That's it!
You now have the number of employees with a higher salary than average, also contained in a single-row table.
ix_ref()
can be used to copy the value in all the rows of the table:
salaries_with_average = salaries.select(
*pw.this, global_statistics.ix_ref().average_salary
)
pw.debug.compute_and_print(salaries_with_average)
[2023-12-08T16:30:33]:INFO:Preparing Pathway computation
| salary | department | average_salary
^X1MXHYY... | 1800 | Sales | 2140
^3HN31E1... | 1900 | Finance | 2140
^YYY4HAB... | 2000 | Finance | 2140
^Z3QWT29... | 2300 | Sales | 2140
^3CZ78B4... | 2700 | Finance | 2140
With Pathway's SQL API, you can directly query tables using SQL queries:
sql_result = pw.sql(
"SELECT COUNT(*) AS count FROM salaries WHERE salary > (SELECT AVG(salary) FROM salaries)",
salaries=salaries,
)
pw.debug.compute_and_print(sql_result)
[2023-12-08T16:30:34]:INFO:Preparing Pathway computation
| count
^PWSRT42... | 2
Be careful when using the SQL API: it only supports subqueries on single-row tables. This subquery usage is an exception, and we strongly encourage you to use the Python syntax as much as possible!