Indexing from a groupby
and Single-row Tables using ix_ref
In this tutorial, you will learn how to use ix_ref
in a grouped table to access a row from the grouped table and how to manipulate single-row tables.
Accessing a grouped row by value using ix_ref
Imagine you've just been assigned a new project: analyzing salary statistics in your company. In particular, your goal is to determine the number of employees who earn more than the average salary in each department.
Let's consider the following table:
import pathway as pw
salaries = pw.debug.table_from_markdown(
"""
salary | department
1800 | Sales
2000 | Finance
2300 | Sales
2700 | Finance
1900 | Finance
"""
)
pw.debug.compute_and_print(salaries)
| salary | department
^X1MXHYY... | 1800 | Sales
^3HN31E1... | 1900 | Finance
^YYY4HAB... | 2000 | Finance
^Z3QWT29... | 2300 | Sales
^3CZ78B4... | 2700 | Finance
You can easily compute the average the salary in using a groupby
on the department
column:
statistics = salaries.groupby(pw.this.department).reduce(
pw.this.department, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(statistics)
| department | average_salary
^R67Q44G... | Finance | 2200.0
^9Z1MT42... | Sales | 2050.0
Now, you could do a join to add the average salary column to each entry in the salary
table:
salaries_join = salaries.join(
statistics, pw.left.department == pw.right.department
).select(*pw.left, pw.right.average_salary)
pw.debug.compute_and_print(salaries_join)
| salary | department | average_salary
^9Z1Y9HX... | 1800 | Sales | 2050.0
^R67VKGS... | 1900 | Finance | 2200.0
^R67X4RH... | 2000 | Finance | 2200.0
^9Z1GYFA... | 2300 | Sales | 2050.0
^R67RXF8... | 2700 | Finance | 2200.0
And then, you can filter whether the salary is higher than the average_salary
column.
However, this is very tedious.
In Pathway, you can directly access the row of a grouped table from the corresponding value using ix_ref
ix_ref
is a value-based indexer: you can access the row corresponding to the value Sales
with statistics.ix_ref("Sales")
.
With ix_ref(value)
, you access the row associated with value
by the groupby
, and you can directly access the wanted column with the usual dot notation:
to access the average salary of a given department, you can do statistics.ix_ref(pw.this.department).average_salary
statistics = statistics.select(
average_salary=pw.cast(int, pw.this.average_salary),
department=pw.this.department,
)
higher_than_average_salaries = salaries.filter(
pw.this.salary >= statistics.ix_ref(pw.this.department).average_salary
)
pw.debug.compute_and_print(higher_than_average_salaries)
| salary | department
^Z3QWT29... | 2300 | Sales
^3CZ78B4... | 2700 | Finance
Now you can count the number of entries in each department with a simple groupby
:
number_employees = higher_than_average_salaries.groupby(pw.this.department).reduce(
count=pw.reducers.count()
)
pw.debug.compute_and_print(number_employees)
| count
^R67Q44G... | 1
^9Z1MT42... | 1
Multi-values indexing
We can also use ix_ref
to index using tuples of values by using ix_ref(*args)
.
For example, let's say we have an extra column position
:
salaries_with_position = pw.debug.table_from_markdown(
"""
salary | department | position
1800 | Sales | junior
2000 | Finance | junior
2300 | Sales | senior
2700 | Finance | senior
1900 | Finance | junior
"""
)
We can now make a groupby using both the department
and position
columns:
grouped_table_with_position = salaries_with_position.groupby(
pw.this.department, pw.this.position
).reduce(
pw.this.department, pw.this.position, average_salary=pw.reducers.avg(pw.this.salary)
)
pw.debug.compute_and_print(grouped_table_with_position)
| department | position | average_salary
^2K7JN4G... | Finance | junior | 1950.0
^NVRTF1A... | Finance | senior | 2700.0
^514BB56... | Sales | junior | 1800.0
^74PV169... | Sales | senior | 2300.0
Now we can access the row using both department
and position
columns:
pw.debug.compute_and_print(
salaries_with_position.select(
*pw.this,
average_salary=grouped_table_with_position.ix_ref(
pw.this.department, pw.this.position
).average_salary,
)
)
| salary | department | position | average_salary
^X1MXHYY... | 1800 | Sales | junior | 1800.0
^3HN31E1... | 1900 | Finance | junior | 1950.0
^YYY4HAB... | 2000 | Finance | junior | 1950.0
^Z3QWT29... | 2300 | Sales | senior | 2300.0
^3CZ78B4... | 2700 | Finance | senior | 2700.0
You can also use the values directly:
pw.debug.compute_and_print(
salaries_with_position.select(
*pw.this,
average_salary=grouped_table_with_position.ix_ref(
"Sales", "junior"
).average_salary,
)
)
| salary | department | position | average_salary
^X1MXHYY... | 1800 | Sales | junior | 1800.0
^3HN31E1... | 1900 | Finance | junior | 1800.0
^YYY4HAB... | 2000 | Finance | junior | 1800.0
^Z3QWT29... | 2300 | Sales | senior | 1800.0
^3CZ78B4... | 2700 | Finance | senior | 1800.0
Tables with primary keys
ix_ref
is not limited to tables obtained by a groupby/reduce scheme: it works with any table with primary keys.
Primary keys are the columns chosen to index the table.
By default, Pathway indexes the table with uuid indexes, except when doing a groupby/reduce where the columns used for the groupby are used to index the table.
You can reindex the table by manually choosing primary keys, using .with_id_from
:
indexed_table = pw.debug.table_from_markdown(
"""
colA | colB
10 | A
20 | B
"""
)
pw.debug.compute_and_print(indexed_table)
reindexed_table = indexed_table.with_id_from(pw.this.colB)
pw.debug.compute_and_print(reindexed_table)
| colA | colB
^X1MXHYY... | 10 | A
^YYY4HAB... | 20 | B
| colA | colB
^KHNET5G... | 10 | A
^FFN4QBS... | 20 | B
You can see that indexes have been updated.
With primary keys, the rows can now be accessed using ix_ref
:
pw.debug.compute_and_print(
indexed_table.select(new_val=reindexed_table.ix_ref("A").colA)
)
| new_val
^YYY4HAB... | 10
^X1MXHYY... | 10
Single-Row Tables
A special case is an empty groupby
: all the entries are associated to the same group.
It is the case when you consider global statistics such as the average salary on the entire company.
In Pathway, it can be computed with a simple reduce()
.
Let's compute the sum of all the salaries in the company:
pw.debug.compute_and_print(salaries.reduce(sum_salary=pw.reducers.sum(pw.this.salary)))
| sum_salary
^PWSRT42... | 10700
As you can see, Pathway returns a single-row table and not the single value.
As tempting as it is, in Pathway, you cannot use the value directly and do:
nb_employees = employee_salary.reduce(pw.reducers.avg(pw.this.salary))
In Pathway, you cannot obtain the value as an int or a float as you could in SQL: you need to use .ix_ref()
In Pathway, we access the value of a single-row table using singlerowtable.ix_ref()
.
As previously, .ix_ref()
access the entire (single) row so you still need to specify the column.
You access the average with average_table.ix_ref().average_salary
.
Let's see how it goes:
global_statistics = salaries.reduce(average_salary=pw.reducers.avg(pw.this.salary))
pw.debug.compute_and_print(global_statistics)
global_statistics = global_statistics.cast_to_types(average_salary=int)
results = (
salaries.filter(pw.this.salary >= global_statistics.ix_ref().average_salary)
).reduce(count=pw.reducers.count())
pw.debug.compute_and_print(results)
| average_salary
^PWSRT42... | 2140.0
| count
^PWSRT42... | 2
That's it!
You now have the number of employees with a higher salary than average, also contained in a single-row table.
ix_ref()
can be used to copy the value in all the rows of the table:
salaries_with_average = salaries.select(
*pw.this, global_statistics.ix_ref().average_salary
)
pw.debug.compute_and_print(salaries_with_average)
| salary | department | average_salary
^X1MXHYY... | 1800 | Sales | 2140
^3HN31E1... | 1900 | Finance | 2140
^YYY4HAB... | 2000 | Finance | 2140
^Z3QWT29... | 2300 | Sales | 2140
^3CZ78B4... | 2700 | Finance | 2140
Bonus: SQL version
With Pathway's SQL API, you can directly query tables using SQL queries:
sql_result = pw.sql(
"SELECT COUNT(*) AS count FROM salaries WHERE salary > (SELECT AVG(salary) FROM salaries)",
salaries=salaries,
)
pw.debug.compute_and_print(sql_result)
| count
^PWSRT42... | 2
Be careful when using the SQL API: it only supports subqueries on single-row tables. This subquery usage is an exception, and we strongly encourage you to use the Python syntax as much as possible!