Groupby - Reduce
In this manu[a]l, you will learn how to aggregate data with the groupby-reduce scheme.
Together, the groupby
and reduce
operations can be used
to aggregate data across the rows of the table. In this guide,
we expand upon a simple demonstration from the
First-steps Guide
and:
- explain syntax of groupby and reduce
- explain two kinds of columns we get from groupby
- explain automatic id generation
- show a few simple applications
Prerequisites
The assumption is that we are familiar with some basic operations explained in the First-steps Guide. As usual, we begin with importing Pathway.
import pathway as pw
To demonstrate the capabilities of groupby and reduce operations, let us consider a made up scenario.
Storyline: let's assume that you made a poll, asking whether a particular food item is a fruit or a vegetable.
An answer to such a question is a tuple (food_item, label, vote, fractional_vote)
.
That is, if someone could tell that tomato is a fruit, but they are not really sure,
it could be registered as two tuples:
- (tomato, fruit, 1, 0.5),
- (tomato, vegetable, 0, 0.5)
Below, we have the results of the poll, stored in the table poll.
poll = pw.debug.table_from_markdown(
"""
| food_item | label | vote | fractional_vote | time
0 | tomato | fruit | 1 | 0.5 | 1669882728
1 | tomato | vegetable | 0 | 0.5 | 1669882728
2 | apple | fruit | 1 | 1 | 1669883612
3 | apple | vegetable | 0 | 0 | 1669883612
4 | pepper | vegetable | 1 | 0.5 | 1669883059
5 | pepper | fruit | 0 | 0.5 | 1669883059
6 | tomato | fruit | 0 | 0.3 | 1669880159
7 | tomato | vegetable | 1 | 0.7 | 1669880159
8 | corn | fruit | 0 | 0.3 | 1669876829
9 | corn | vegetable | 1 | 0.7 | 1669876829
10 | tomato | fruit | 0 | 0.4 | 1669874325
11 | tomato | vegetable | 1 | 0.6 | 1669874325
12 | pepper | fruit | 0 | 0.45 | 1669887207
13 | pepper | vegetable | 1 | 0.55 | 1669887207
14 | apple | fruit | 1 | 1 | 1669874325
15 | apple | vegetable | 0 | 0 | 1669874325
"""
)
To demonstrate a simple groupby-reduce application, let's ask about
the total fractional_vote
that was assigned to any combination of foot_item
, label
.
First we explain the syntax of both groupby
and reduce
.
Then, we show groupby-reduce code in action.
Groupby Syntax
The syntax of the groupby
operation is fairly simple:
table.groupby(*C)
It takes a list of columns *C
as argument and groups the row according to their values in those columns.
In other words, all the rows with the same values, column-wise, in each column of *C
are put into the same group.
As a result, it returns a GroupedTable
object, which stores
a single row for each unique tuple from columns in *C
and a collection
of grouped items corresponding to each column that is not in *C
.
In the example above, if we groupby over a pair of columns food_item
, label
,
the groupby computes a collection of votes
and a collection of fractional_votes
for each unique combination food_item
, label
.
In order to use this object, we need to process those collections
with the reduce
operation.
Reduce Syntax
The reduce
function behaves a little bit like select
, and it also takes
two kinds of arguments:
grouped_table.reduce(*SC, *NC)
- *SC is simply a list of columns that were present in the table before the groupby operation,
- *NC is a list of new columns (i.e. columns with new name), each defined as some function of cells in the row of grouped_table.
It can be used together with groupby, as follows:
table.groupby(*C).reduce(*SC, *NC)
The main difference between reduce
and select
is that each row of grouped_table
has two kinds of entries, simple cells and groups.
The reduce
operation allows us to apply a reducer to transform a group into a value.
Counting Votes With Groupby-Reduce
Below, you can see an example that uses the sum
reducer to compute the sum of all
votes and the sum of all fractional votes.
aggregated_results = poll.groupby(poll.food_item, poll.label).reduce(
poll.food_item,
poll.label,
total_votes=pw.reducers.sum(poll.vote),
total_fractional_vote=pw.reducers.sum(poll.fractional_vote),
)
pw.debug.compute_and_print(aggregated_results)
| food_item | label | total_votes | total_fractional_vote
^8X6FZAN... | apple | fruit | 2 | 2.0
^1PM8GGB... | apple | vegetable | 0 | 0.0
^Y7P7Z85... | corn | fruit | 0 | 0.3
^R5DPRJ1... | corn | vegetable | 1 | 0.7
^MPW1XV7... | pepper | fruit | 0 | 0.95
^TW138S3... | pepper | vegetable | 2 | 1.05
^Z7NB38W... | tomato | fruit | 1 | 1.2
^E3Z6HDV... | tomato | vegetable | 2 | 1.8
Groupby-Reduce Column Constraints
To briefly summarize, if is the set of all columns of a table
- the columns from (used as comparison key) can be used as regular columns in the reduce function
- for all the remaining columns (in ), we need to apply a reducer, before we use them in expressions
In particular, we can mix columns from and reduced columns from in column expressions.
def make_a_note(label: str, tot_votes: int, tot_fractional_vote: float):
return f"{label} got {tot_votes} votes, with total fractional_vote of {round(tot_fractional_vote, 2)}"
aggregated_results_note = poll.groupby(poll.food_item, poll.label).reduce(
poll.food_item,
note=pw.apply(
make_a_note,
poll.label,
pw.reducers.sum(poll.vote),
pw.reducers.sum(poll.fractional_vote),
),
)
pw.debug.compute_and_print(aggregated_results_note)
| food_item | note
^8X6FZAN... | apple | fruit got 2 votes, with total fractional_vote of 2.0
^1PM8GGB... | apple | vegetable got 0 votes, with total fractional_vote of 0.0
^Y7P7Z85... | corn | fruit got 0 votes, with total fractional_vote of 0.3
^R5DPRJ1... | corn | vegetable got 1 votes, with total fractional_vote of 0.7
^MPW1XV7... | pepper | fruit got 0 votes, with total fractional_vote of 0.95
^TW138S3... | pepper | vegetable got 2 votes, with total fractional_vote of 1.05
^Z7NB38W... | tomato | fruit got 1 votes, with total fractional_vote of 1.2
^E3Z6HDV... | tomato | vegetable got 2 votes, with total fractional_vote of 1.8
Auto generated id
The groupby(*C).reduce(...)
operation guarantees that each row of the
output corresponds to a unique tuple of values from *C. Therefore *C
can be used to generate a unique id for the resulting table.
In fact that is the default behavior of Pathway' groupby operation and can be used e.g. to join the results of this table with some other table that has this id as a foreign key.
queries = pw.debug.table_from_markdown(
"""
| food_item | label
1 | tomato | fruit
2 | pepper | vegetable
3 | corn | vegetable
"""
).with_id_from(pw.this.food_item, pw.this.label)
pw.debug.compute_and_print(
queries.join(
aggregated_results_note, queries.id == aggregated_results_note.id, id=queries.id
).select(queries.food_item, aggregated_results_note.note)
)
| food_item | note
^R5DPRJ1... | corn | vegetable got 1 votes, with total fractional_vote of 0.7
^TW138S3... | pepper | vegetable got 2 votes, with total fractional_vote of 1.05
^Z7NB38W... | tomato | fruit got 1 votes, with total fractional_vote of 1.2
More examples of joins (including another example of a join over a foreign key) can be found in the join manual (full article, foreign key example).
More Examples
Recent activity with max reducer
Below, you can see a piece of code that finds the latest votes that were submitted to the poll.
It is done with groupby
-reduce
operations chained with join
and filter
, using pw.this
.
hour = 3600
pw.debug.compute_and_print(
poll.groupby()
.reduce(time=pw.reducers.max(poll.time))
.join(poll, id=poll.id)
.select(
poll.food_item,
poll.label,
poll.vote,
poll.fractional_vote,
poll.time,
latest=pw.left.time,
)
.filter(pw.this.time >= pw.this.latest - 2 * hour)
.select(
pw.this.food_item,
pw.this.label,
pw.this.vote,
pw.this.fractional_vote,
)
)
| food_item | label | vote | fractional_vote
^Z3QWT29... | apple | fruit | 1 | 1.0
^3CZ78B4... | apple | vegetable | 0 | 0.0
^GFDBR0G... | pepper | fruit | 0 | 0.45
^3S2X6B2... | pepper | fruit | 0 | 0.5
^3HN31E1... | pepper | vegetable | 1 | 0.5
^3J2R55X... | pepper | vegetable | 1 | 0.55
^A984WV0... | tomato | fruit | 0 | 0.3
^X1MXHYY... | tomato | fruit | 1 | 0.5
^YYY4HAB... | tomato | vegetable | 0 | 0.5
^6A0QZMJ... | tomato | vegetable | 1 | 0.7
Removing duplicates
Below, duplicates are removed from the table with groupby-reduce.
On its own, selecting food_item
and label
from poll returns duplicate rows:
pw.debug.compute_and_print(poll.select(poll.food_item, poll.label))
| food_item | label
^XTPZRQ2... | apple | fruit
^Z3QWT29... | apple | fruit
^3CZ78B4... | apple | vegetable
^MA4SK6C... | apple | vegetable
^SN0FH7F... | corn | fruit
^9KM937R... | corn | vegetable
^GFDBR0G... | pepper | fruit
^3S2X6B2... | pepper | fruit
^3HN31E1... | pepper | vegetable
^3J2R55X... | pepper | vegetable
^X1MXHYY... | tomato | fruit
^QECDZJF... | tomato | fruit
^A984WV0... | tomato | fruit
^YYY4HAB... | tomato | vegetable
^6A0QZMJ... | tomato | vegetable
^03PYXDQ... | tomato | vegetable
However, we can apply groupby-reduce to select a set of unique rows:
pw.debug.compute_and_print(
poll.groupby(poll.food_item, poll.label).reduce(poll.food_item, poll.label)
)
| food_item | label
^8X6FZAN... | apple | fruit
^1PM8GGB... | apple | vegetable
^Y7P7Z85... | corn | fruit
^R5DPRJ1... | corn | vegetable
^MPW1XV7... | pepper | fruit
^TW138S3... | pepper | vegetable
^Z7NB38W... | tomato | fruit
^E3Z6HDV... | tomato | vegetable
Chained groupby-reduce-join-select
Below, you can find an example of groupby - reduce chained with join -select,
using pw.this
.
To demonstrate that, we can ask our poll about total fractional vote for each pair
food_item
, label
and total fractional vote assigned to rows for each food_item
.
relative_score = (
poll.groupby(poll.food_item)
.reduce(
poll.food_item,
total_fractional_vote=pw.reducers.sum(poll.fractional_vote),
)
.join(aggregated_results, pw.left.food_item == pw.right.food_item)
.select(
pw.left.food_item,
pw.right.label,
label_fractional_vote=pw.right.total_fractional_vote,
total_fractional_vote=pw.left.total_fractional_vote,
)
)
pw.debug.compute_and_print(relative_score)
| food_item | label | label_fractional_vote | total_fractional_vote
^7YTTW8S... | apple | fruit | 2.0 | 2.0
^7YTJZBX... | apple | vegetable | 0.0 | 2.0
^PHR71C0... | corn | fruit | 0.3 | 1.0
^PHR3860... | corn | vegetable | 0.7 | 1.0
^C5046A2... | pepper | fruit | 0.95 | 2.0
^C504ZMA... | pepper | vegetable | 1.05 | 2.0
^NHNP79E... | tomato | fruit | 1.2 | 3.0
^NHNGEC4... | tomato | vegetable | 1.8 | 3.0
Election using argmax reducer
Below, we present a snippet of code, that in the context of a poll, finds the most obvious information: which label got the most votes.
Let's take a look on what exactly is the result of argmax
reducer:
pw.debug.compute_and_print(
relative_score.groupby(relative_score.food_item).reduce(
argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote)
)
)
| argmax_id
^NHNKR02... | ^NHNGEC4...
^7YTG4C2... | ^7YTTW8S...
^C50B2KX... | ^C504ZMA...
^PHR4QTJ... | ^PHR3860...
As you can see, it returns an ID of the row that maximizes label_fractional_vote
for a fixed food_item
.
You can filter interesting rows using those ID-s as follows:
pw.debug.compute_and_print(
relative_score.groupby(relative_score.food_item)
.reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))
.join(relative_score, pw.left.argmax_id == relative_score.id)
.select(
relative_score.food_item,
relative_score.label,
relative_score.label_fractional_vote,
)
)
| food_item | label | label_fractional_vote
^0AVWY3X... | apple | fruit | 2.0
^EF480M4... | corn | vegetable | 0.7
^DV5NNAH... | pepper | vegetable | 1.05
^S8WYQGD... | tomato | vegetable | 1.8
Remark: the code snippet above is equivalent to:
pw.debug.compute_and_print(
relative_score.groupby(relative_score.food_item)
.reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))
.select(
relative_score.ix(pw.this.argmax_id).food_item,
relative_score.ix(pw.this.argmax_id).label,
relative_score.ix(pw.this.argmax_id).label_fractional_vote,
)
)
| food_item | label | label_fractional_vote
^7YTG4C2... | apple | fruit | 2.0
^PHR4QTJ... | corn | vegetable | 0.7
^C50B2KX... | pepper | vegetable | 1.05
^NHNKR02... | tomato | vegetable | 1.8
You can read more about joins, *.ix and ID-s in other places.