Groupby - Reduce

In this manu[a]l, you will learn how to aggregate data with the groupby-reduce scheme.

Together, the groupby and reduce operations can be used to aggregate data across the rows of the table. In this guide, we expand upon a simple demonstration from the First-steps Guide and:

Prerequisites

The assumption is that we are familiar with some basic operations explained in the First-steps Guide. As usual, we begin with importing Pathway.

import pathway as pw

To demonstrate the capabilities of groupby and reduce operations, let us consider a made up scenario.

Storyline: let's assume that you made a poll, asking whether a particular food item is a fruit or a vegetable.

An answer to such a question is a tuple (food_item, label, vote, fractional_vote). That is, if someone could tell that tomato is a fruit, but they are not really sure, it could be registered as two tuples:

  • (tomato, fruit, 1, 0.5),
  • (tomato, vegetable, 0, 0.5)

Below, we have the results of the poll, stored in the table poll.

poll = pw.debug.table_from_markdown(
    """
    | food_item | label     | vote  | fractional_vote   | time
0   | tomato    | fruit     | 1     | 0.5               | 1669882728
1   | tomato    | vegetable | 0     | 0.5               | 1669882728
2   | apple     | fruit     | 1     | 1                 | 1669883612
3   | apple     | vegetable | 0     | 0                 | 1669883612
4   | pepper    | vegetable | 1     | 0.5               | 1669883059
5   | pepper    | fruit     | 0     | 0.5               | 1669883059
6   | tomato    | fruit     | 0     | 0.3               | 1669880159
7   | tomato    | vegetable | 1     | 0.7               | 1669880159
8   | corn      | fruit     | 0     | 0.3               | 1669876829
9   | corn      | vegetable | 1     | 0.7               | 1669876829
10  | tomato    | fruit     | 0     | 0.4               | 1669874325
11  | tomato    | vegetable | 1     | 0.6               | 1669874325
12  | pepper    | fruit     | 0     | 0.45              | 1669887207
13  | pepper    | vegetable | 1     | 0.55              | 1669887207
14  | apple     | fruit     | 1     | 1                 | 1669874325
15  | apple     | vegetable | 0     | 0                 | 1669874325

"""
)

To demonstrate a simple groupby-reduce application, let's ask about the total fractional_vote that was assigned to any combination of foot_item, label.

First we explain the syntax of both groupby and reduce. Then, we show groupby-reduce code in action.

Groupby Syntax

The syntax of the groupby operation is fairly simple:

table.groupby(*C)

It takes a list of columns *C as argument and groups the row according to their values in those columns. In other words, all the rows with the same values, column-wise, in each column of *C are put into the same group.

As a result, it returns a GroupedTable object, which stores a single row for each unique tuple from columns in *C and a collection of grouped items corresponding to each column that is not in *C.

In the example above, if we groupby over a pair of columns food_item, label, the groupby computes a collection of votes and a collection of fractional_votes for each unique combination food_item, label.

In order to use this object, we need to process those collections with the reduce operation.

Reduce Syntax

The reduce function behaves a little bit like select, and it also takes two kinds of arguments:

grouped_table.reduce(*SC, *NC)
  • *SC is simply a list of columns that were present in the table before the groupby operation,
  • *NC is a list of new columns (i.e. columns with new name), each defined as some function of cells in the row of grouped_table.

It can be used together with groupby, as follows:

table.groupby(*C).reduce(*SC, *NC)

The main difference between reduce and select is that each row of grouped_table has two kinds of entries, simple cells and groups.

The reduce operation allows us to apply a reducer to transform a group into a value.

Counting Votes With Groupby-Reduce

Below, you can see an example that uses the sum reducer to compute the sum of all votes and the sum of all fractional votes.

aggregated_results = poll.groupby(poll.food_item, poll.label).reduce(
    poll.food_item,
    poll.label,
    total_votes=pw.reducers.sum(poll.vote),
    total_fractional_vote=pw.reducers.sum(poll.fractional_vote),
)

pw.debug.compute_and_print(aggregated_results)
            | food_item | label     | total_votes | total_fractional_vote
^8X6FZAN... | apple     | fruit     | 2           | 2.0
^1PM8GGB... | apple     | vegetable | 0           | 0.0
^Y7P7Z85... | corn      | fruit     | 0           | 0.3
^R5DPRJ1... | corn      | vegetable | 1           | 0.7
^MPW1XV7... | pepper    | fruit     | 0           | 0.95
^TW138S3... | pepper    | vegetable | 2           | 1.05
^Z7NB38W... | tomato    | fruit     | 1           | 1.2
^E3Z6HDV... | tomato    | vegetable | 2           | 1.8

Groupby-Reduce Column Constraints

To briefly summarize, if is the set of all columns of a table

  • the columns from (used as comparison key) can be used as regular columns in the reduce function
  • for all the remaining columns (in ), we need to apply a reducer, before we use them in expressions

In particular, we can mix columns from and reduced columns from in column expressions.

def make_a_note(label: str, tot_votes: int, tot_fractional_vote: float):
    return f"{label} got {tot_votes} votes, with total fractional_vote of {round(tot_fractional_vote, 2)}"


aggregated_results_note = poll.groupby(poll.food_item, poll.label).reduce(
    poll.food_item,
    note=pw.apply(
        make_a_note,
        poll.label,
        pw.reducers.sum(poll.vote),
        pw.reducers.sum(poll.fractional_vote),
    ),
)

pw.debug.compute_and_print(aggregated_results_note)
            | food_item | note
^8X6FZAN... | apple     | fruit got 2 votes, with total fractional_vote of 2.0
^1PM8GGB... | apple     | vegetable got 0 votes, with total fractional_vote of 0.0
^Y7P7Z85... | corn      | fruit got 0 votes, with total fractional_vote of 0.3
^R5DPRJ1... | corn      | vegetable got 1 votes, with total fractional_vote of 0.7
^MPW1XV7... | pepper    | fruit got 0 votes, with total fractional_vote of 0.95
^TW138S3... | pepper    | vegetable got 2 votes, with total fractional_vote of 1.05
^Z7NB38W... | tomato    | fruit got 1 votes, with total fractional_vote of 1.2
^E3Z6HDV... | tomato    | vegetable got 2 votes, with total fractional_vote of 1.8

Auto generated id

The groupby(*C).reduce(...) operation guarantees that each row of the output corresponds to a unique tuple of values from *C. Therefore *C can be used to generate a unique id for the resulting table.

In fact that is the default behavior of Pathway' groupby operation and can be used e.g. to join the results of this table with some other table that has this id as a foreign key.

queries = pw.debug.table_from_markdown(
    """
    | food_item | label
1   | tomato    | fruit
2   | pepper    | vegetable
3   | corn      | vegetable

"""
).with_id_from(pw.this.food_item, pw.this.label)

pw.debug.compute_and_print(
    queries.join(
        aggregated_results_note, queries.id == aggregated_results_note.id, id=queries.id
    ).select(queries.food_item, aggregated_results_note.note)
)
            | food_item | note
^R5DPRJ1... | corn      | vegetable got 1 votes, with total fractional_vote of 0.7
^TW138S3... | pepper    | vegetable got 2 votes, with total fractional_vote of 1.05
^Z7NB38W... | tomato    | fruit got 1 votes, with total fractional_vote of 1.2

More examples of joins (including another example of a join over a foreign key) can be found in the join manual (full article, foreign key example).

More Examples

Recent activity with max reducer

Below, you can see a piece of code that finds the latest votes that were submitted to the poll. It is done with groupby-reduce operations chained with join and filter, using pw.this.

hour = 3600
pw.debug.compute_and_print(
    poll.groupby()
    .reduce(time=pw.reducers.max(poll.time))
    .join(poll, id=poll.id)
    .select(
        poll.food_item,
        poll.label,
        poll.vote,
        poll.fractional_vote,
        poll.time,
        latest=pw.left.time,
    )
    .filter(pw.this.time >= pw.this.latest - 2 * hour)
    .select(
        pw.this.food_item,
        pw.this.label,
        pw.this.vote,
        pw.this.fractional_vote,
    )
)
            | food_item | label     | vote | fractional_vote
^Z3QWT29... | apple     | fruit     | 1    | 1.0
^3CZ78B4... | apple     | vegetable | 0    | 0.0
^GFDBR0G... | pepper    | fruit     | 0    | 0.45
^3S2X6B2... | pepper    | fruit     | 0    | 0.5
^3HN31E1... | pepper    | vegetable | 1    | 0.5
^3J2R55X... | pepper    | vegetable | 1    | 0.55
^A984WV0... | tomato    | fruit     | 0    | 0.3
^X1MXHYY... | tomato    | fruit     | 1    | 0.5
^YYY4HAB... | tomato    | vegetable | 0    | 0.5
^6A0QZMJ... | tomato    | vegetable | 1    | 0.7

Removing duplicates

Below, duplicates are removed from the table with groupby-reduce. On its own, selecting food_item and label from poll returns duplicate rows:

pw.debug.compute_and_print(poll.select(poll.food_item, poll.label))
            | food_item | label
^XTPZRQ2... | apple     | fruit
^Z3QWT29... | apple     | fruit
^3CZ78B4... | apple     | vegetable
^MA4SK6C... | apple     | vegetable
^SN0FH7F... | corn      | fruit
^9KM937R... | corn      | vegetable
^GFDBR0G... | pepper    | fruit
^3S2X6B2... | pepper    | fruit
^3HN31E1... | pepper    | vegetable
^3J2R55X... | pepper    | vegetable
^X1MXHYY... | tomato    | fruit
^QECDZJF... | tomato    | fruit
^A984WV0... | tomato    | fruit
^YYY4HAB... | tomato    | vegetable
^6A0QZMJ... | tomato    | vegetable
^03PYXDQ... | tomato    | vegetable

However, we can apply groupby-reduce to select a set of unique rows:

pw.debug.compute_and_print(
    poll.groupby(poll.food_item, poll.label).reduce(poll.food_item, poll.label)
)
            | food_item | label
^8X6FZAN... | apple     | fruit
^1PM8GGB... | apple     | vegetable
^Y7P7Z85... | corn      | fruit
^R5DPRJ1... | corn      | vegetable
^MPW1XV7... | pepper    | fruit
^TW138S3... | pepper    | vegetable
^Z7NB38W... | tomato    | fruit
^E3Z6HDV... | tomato    | vegetable

Chained groupby-reduce-join-select

Below, you can find an example of groupby - reduce chained with join -select, using pw.this.

To demonstrate that, we can ask our poll about total fractional vote for each pair food_item, label and total fractional vote assigned to rows for each food_item.

relative_score = (
    poll.groupby(poll.food_item)
    .reduce(
        poll.food_item,
        total_fractional_vote=pw.reducers.sum(poll.fractional_vote),
    )
    .join(aggregated_results, pw.left.food_item == pw.right.food_item)
    .select(
        pw.left.food_item,
        pw.right.label,
        label_fractional_vote=pw.right.total_fractional_vote,
        total_fractional_vote=pw.left.total_fractional_vote,
    )
)

pw.debug.compute_and_print(relative_score)
            | food_item | label     | label_fractional_vote | total_fractional_vote
^7YTTW8S... | apple     | fruit     | 2.0                   | 2.0
^7YTJZBX... | apple     | vegetable | 0.0                   | 2.0
^PHR71C0... | corn      | fruit     | 0.3                   | 1.0
^PHR3860... | corn      | vegetable | 0.7                   | 1.0
^C5046A2... | pepper    | fruit     | 0.95                  | 2.0
^C504ZMA... | pepper    | vegetable | 1.05                  | 2.0
^NHNP79E... | tomato    | fruit     | 1.2                   | 3.0
^NHNGEC4... | tomato    | vegetable | 1.8                   | 3.0

Election using argmax reducer

Below, we present a snippet of code, that in the context of a poll, finds the most obvious information: which label got the most votes.

Let's take a look on what exactly is the result of argmax reducer:

pw.debug.compute_and_print(
    relative_score.groupby(relative_score.food_item).reduce(
        argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote)
    )
)
            | argmax_id
^NHNKR02... | ^NHNGEC4...
^7YTG4C2... | ^7YTTW8S...
^C50B2KX... | ^C504ZMA...
^PHR4QTJ... | ^PHR3860...

As you can see, it returns an ID of the row that maximizes label_fractional_vote for a fixed food_item. You can filter interesting rows using those ID-s as follows:

pw.debug.compute_and_print(
    relative_score.groupby(relative_score.food_item)
    .reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))
    .join(relative_score, pw.left.argmax_id == relative_score.id)
    .select(
        relative_score.food_item,
        relative_score.label,
        relative_score.label_fractional_vote,
    )
)
            | food_item | label     | label_fractional_vote
^0AVWY3X... | apple     | fruit     | 2.0
^EF480M4... | corn      | vegetable | 0.7
^DV5NNAH... | pepper    | vegetable | 1.05
^S8WYQGD... | tomato    | vegetable | 1.8

Remark: the code snippet above is equivalent to:

pw.debug.compute_and_print(
    relative_score.groupby(relative_score.food_item)
    .reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))
    .select(
        relative_score.ix(pw.this.argmax_id).food_item,
        relative_score.ix(pw.this.argmax_id).label,
        relative_score.ix(pw.this.argmax_id).label_fractional_vote,
    )
)
            | food_item | label     | label_fractional_vote
^7YTG4C2... | apple     | fruit     | 2.0
^PHR4QTJ... | corn      | vegetable | 0.7
^C50B2KX... | pepper    | vegetable | 1.05
^NHNKR02... | tomato    | vegetable | 1.8

You can read more about joins, *.ix and ID-s in other places.