Run In Colab  View in Github

Groupby - Reduce

In this manual, you will learn how to aggregate data with the groupby-reduce scheme.

Together, the groupby and reduce operations can be used to aggregate data across the rows of the table. In this guide, we expand upon a simple demonstration from Survival Guide and :

Prerequisites

The assumption is that we are familiar with some basic operations explained in Survival Guide. As usual, we begin with importing Pathway.

import pathway as pw

To demonstrate the capabilities of groupby and reduce operations, let us consider some made up scenario.

Storyline: let's assume that we made a poll, asking whether particular food item is a fruit or a vegetable.

An answer to such question is a tuple (food_item, label, vote, fractional_vote). That is, if someone could tell that tomato is a fruit, but they are not really sure, it could be registered as two tuples:

  • (tomato, fruit, 1, 0.5),
  • (tomato, vegetable, 0, 0.5)

Below, we have the results of the poll, stored in the table poll.

poll = pw.debug.table_from_markdown(    """    | food_item | label     | vote  | fractional_vote   | time0   | tomato    | fruit     | 1     | 0.5               | 16698827281   | tomato    | vegetable | 0     | 0.5               | 16698827282   | apple     | fruit     | 1     | 1                 | 16698836123   | apple     | vegetable | 0     | 0                 | 16698836124   | pepper    | vegetable | 1     | 0.5               | 16698830595   | pepper    | fruit     | 0     | 0.5               | 16698830596   | tomato    | fruit     | 0     | 0.3               | 16698801597   | tomato    | vegetable | 1     | 0.7               | 16698801598   | corn      | fruit     | 0     | 0.3               | 16698768299   | corn      | vegetable | 1     | 0.7               | 166987682910  | tomato    | fruit     | 0     | 0.4               | 166987432511  | tomato    | vegetable | 1     | 0.6               | 166987432512  | pepper    | fruit     | 0     | 0.45              | 166988720713  | pepper    | vegetable | 1     | 0.55              | 166988720714  | apple     | fruit     | 1     | 1                 | 166987432515  | apple     | vegetable | 0     | 0                 | 1669874325""")

To demonstrate a simple groupby-reduce application, let's ask about the total fractional_vote that was assigned to any combination foot_item, label.

First we explain the syntax of both groupby and reduce. Then, we show groupby-reduce code in action.

Groupby Syntax

The syntax of the groupby operation is fairly simple:

table.groupby(*C)

It takes a list of columns *C as argument, and groups the row according to their values in those columns. In other words, all the rows with the same values, column-wise, in each column of *C are put into the same group.

As a result, it returns a GroupedTable object, which stores a single row for each unique tuple from columns in *C, and a collection of grouped items corresponding to each column that is not in *C.

In the example above, if we groupby over pair of columns food_item, label, the groupby computes a collection of votes and a collection of fractional_votes for each unique combination food_item, label.

In order to use this object, we need to process those collections with the reduce operation.

Reduce Syntax

The reduce function behaves a little bit like select, and it also takes two kinds of arguments:

grouped_table.reduce(*SC, *NC)
  • *SC is simply a list of columns that were present in the table before the groupby operation,
  • *NC is a list of new columns (i.e. columns with new name), each defined as some function of cells in the row of grouped_table.

It can be used together with groupby, as follows:

table.groupby(*C).reduce(*SC, *NC)

The main difference between reduce and select is that each row of grouped_table has two kinds of entries, simple cells and groups.

The reduce operation allows us to apply a reducer to transform a group into a value.

Counting Votes With Groupby-Reduce

Below, you can see an example that uses the sum reducer to compute the sum of all votes and the sum of all fractional votes.

aggregated_results = poll.groupby(poll.food_item, poll.label).reduce(    poll.food_item,    poll.label,    total_votes=pw.reducers.sum(poll.vote),    total_fractional_vote=pw.reducers.sum(poll.fractional_vote),)pw.debug.compute_and_print(aggregated_results)
            | food_item | label     | total_votes | total_fractional_vote
^NDDC1CV... | apple     | fruit     | 2           | 2.0
^0QWZR64... | apple     | vegetable | 0           | 0.0
^EY2V9JZ... | corn      | fruit     | 0           | 0.3
^SYKEK0R... | corn      | vegetable | 1           | 0.7
^8J0ZA4Z... | pepper    | fruit     | 0           | 0.95
^6FVPERH... | pepper    | vegetable | 2           | 1.05
^PJ7D6BC... | tomato    | fruit     | 1           | 1.2
^JYAGRWY... | tomato    | vegetable | 2           | 1.8

Groupby-Reduce Column Constraints

To briefly summarize, if is the set of all columns of a table

  • the columns from (used as comparison key) can be used as regular columns in the reduce function
  • for all the remaining columns (in ), we need to apply a reducer, before we use them in expressions

In particular, we can mix columns from and reduced columns from in column expressions.

def make_a_note(label: str, tot_votes: int, tot_fractional_vote: float):    return f"{label} got {tot_votes} votes, with total fractional_vote of {round(tot_fractional_vote, 2)}"aggregated_results_note = poll.groupby(poll.food_item, poll.label).reduce(    poll.food_item,    note=pw.apply(        make_a_note,        poll.label,        pw.reducers.sum(poll.vote),        pw.reducers.sum(poll.fractional_vote),    ),)pw.debug.compute_and_print(aggregated_results_note)
            | food_item | note
^NDDC1CV... | apple     | fruit got 2 votes, with total fractional_vote of 2.0
^0QWZR64... | apple     | vegetable got 0 votes, with total fractional_vote of 0.0
^EY2V9JZ... | corn      | fruit got 0 votes, with total fractional_vote of 0.3
^SYKEK0R... | corn      | vegetable got 1 votes, with total fractional_vote of 0.7
^8J0ZA4Z... | pepper    | fruit got 0 votes, with total fractional_vote of 0.95
^6FVPERH... | pepper    | vegetable got 2 votes, with total fractional_vote of 1.05
^PJ7D6BC... | tomato    | fruit got 1 votes, with total fractional_vote of 1.2
^JYAGRWY... | tomato    | vegetable got 2 votes, with total fractional_vote of 1.8

More Examples

Recent activity with max reducer

Below, we show a piece of code that finds the latest votes that were submitted to our poll. It is done with groupby-reduce operations chained with join and filter, using pw.this.

hour = 3600pw.debug.compute_and_print(    poll.groupby()    .reduce(time=pw.reducers.max(poll.time))    .join(poll, id=poll.id)    .select(        poll.food_item,        poll.label,        poll.vote,        poll.fractional_vote,        poll.time,        latest=pw.this.time,    )    .filter(pw.this.time >= pw.this.latest - 2 * hour)    .select(        pw.this.food_item,        pw.this.label,        pw.this.vote,        pw.this.fractional_vote,    ))
            | food_item | label     | vote | fractional_vote
^YHZBTNY... | apple     | fruit     | 1    | 1.0
^SERVYWW... | apple     | vegetable | 0    | 0.0
^4B2REY1... | pepper    | fruit     | 0    | 0.45
^76QPWK3... | pepper    | fruit     | 0    | 0.5
^8GR6BSX... | pepper    | vegetable | 1    | 0.5
^VYA37VV... | pepper    | vegetable | 1    | 0.55
^C4S6S48... | tomato    | fruit     | 0    | 0.3
^8JFNKVV... | tomato    | fruit     | 1    | 0.5
^2TMTFGY... | tomato    | vegetable | 0    | 0.5
^19D0FQ9... | tomato    | vegetable | 1    | 0.7

Removing duplicates

Below, we show how can we remove duplicates from table with groupby-reduce. On its own, selecting food_item and label from poll returns duplicate rows:

pw.debug.compute_and_print(poll.select(poll.food_item, poll.label))
            | food_item | label
^SE0WWPZ... | apple     | fruit
^YHZBTNY... | apple     | fruit
^NN09ZJR... | apple     | vegetable
^SERVYWW... | apple     | vegetable
^7ZG1GY6... | corn      | fruit
^4A5S1JN... | corn      | vegetable
^4B2REY1... | pepper    | fruit
^76QPWK3... | pepper    | fruit
^VYA37VV... | pepper    | vegetable
^8GR6BSX... | pepper    | vegetable
^8JFNKVV... | tomato    | fruit
^C4S6S48... | tomato    | fruit
^4N5HNMX... | tomato    | fruit
^19D0FQ9... | tomato    | vegetable
^2TMTFGY... | tomato    | vegetable
^HMETZT8... | tomato    | vegetable

However, we can apply groupby-reduce to select a set of unique rows:

pw.debug.compute_and_print(    poll.groupby(poll.food_item, poll.label).reduce(poll.food_item, poll.label))
            | food_item | label
^NDDC1CV... | apple     | fruit
^0QWZR64... | apple     | vegetable
^EY2V9JZ... | corn      | fruit
^SYKEK0R... | corn      | vegetable
^8J0ZA4Z... | pepper    | fruit
^6FVPERH... | pepper    | vegetable
^PJ7D6BC... | tomato    | fruit
^JYAGRWY... | tomato    | vegetable

Chained groupby-reduce-join-select

Below, you can find an example of groupby - reduce chained with join -select, using pw.this.

To demonstrate that, we can ask our poll about total fractional vote for each pair food_item, label and total fractional vote assigned to rows for each food_item.

relative_score = (    poll.groupby(poll.food_item)    .reduce(        poll.food_item,        total_fractional_vote=pw.reducers.sum(poll.fractional_vote),    )    .join(aggregated_results, pw.this.food_item == aggregated_results.food_item)    .select(        pw.this.food_item,        aggregated_results.label,        label_fractional_vote=aggregated_results.total_fractional_vote,        total_fractional_vote=pw.this.total_fractional_vote,    ))pw.debug.compute_and_print(relative_score)
            | food_item | label     | label_fractional_vote | total_fractional_vote
^D5KPC8X... | apple     | fruit     | 2.0                   | 2.0
^E97ATBC... | apple     | vegetable | 0.0                   | 2.0
^VS2ZSXQ... | corn      | fruit     | 0.3                   | 1.0
^GJJGT7Z... | corn      | vegetable | 0.7                   | 1.0
^TJ62GZB... | pepper    | fruit     | 0.95                  | 2.0
^DZ9XSD5... | pepper    | vegetable | 1.05                  | 2.0
^1NC0NHJ... | tomato    | fruit     | 1.2                   | 3.0
^92XHP8W... | tomato    | vegetable | 1.8                   | 3.0

Election using argmax reducer

Below, we present a snippet of code, that in the context of a poll, finds the most obvious information: which label got the most votes.

Let's take a look on what exactly is the result of argmax reducer:

pw.debug.compute_and_print(    relative_score.groupby(relative_score.food_item).reduce(        argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote)    ))
            | argmax_id
^QP417F6... | ^DZ9XSD5...
^REGKSGJ... | ^D5KPC8X...
^5WT76K7... | ^GJJGT7Z...
^J74ZRTM... | ^92XHP8W...

As we can see, it returns an ID of the row, that maximizes label_fractional_vote for a fixed food_item. We can filter interesting rows using those ID-s as follows:

pw.debug.compute_and_print(    relative_score.groupby(relative_score.food_item)    .reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))    .join(relative_score, pw.this.argmax_id == relative_score.id)    .select(        relative_score.food_item,        relative_score.label,        relative_score.label_fractional_vote,    ))
            | food_item | label     | label_fractional_vote
^29S8WD4... | apple     | fruit     | 2.0
^85V47J1... | corn      | vegetable | 0.7
^V9ECPCT... | pepper    | vegetable | 1.05
^DDYF5WX... | tomato    | vegetable | 1.8

Remark: the code snippet above is equivalent to:

pw.debug.compute_and_print(    relative_score.groupby(relative_score.food_item)    .reduce(argmax_id=pw.reducers.argmax(relative_score.label_fractional_vote))    .select(        relative_score.ix[pw.this.argmax_id].food_item,        relative_score.ix[pw.this.argmax_id].label,        relative_score.ix[pw.this.argmax_id].label_fractional_vote,    ))
            | food_item | label     | label_fractional_vote
^REGKSGJ... | apple     | fruit     | 2.0
^5WT76K7... | corn      | vegetable | 0.7
^QP417F6... | pepper    | vegetable | 1.05
^J74ZRTM... | tomato    | vegetable | 1.8

You can read more about joins, *.ix and ID-s in other places.

Krzysztof Nowicki

Datastore Research Engineer