AsyncTransformer

One way of transforming data in Pathway, when simple transformations are not enough, is using UDFs. However, if the flexibility of the UDFs is still not enough, you can use even more general and flexible AsyncTransformer, useful especially for asynchronous computations.

AsyncTransformer is a different mechanism than UDFs. It acts on the whole Pathway Table and returns a new Table. In contrast to UDFs, it is fully asynchronous. It starts the invoke method for every row that arrives, without waiting for the previous batches to finish. When the call is finished, its result is returned to the engine with a new processing time.

To write an AsyncTransformer you need to inherit from pw.AsyncTransformer and implement the invoke method (it is a coroutine). The names of the arguments of the method have to be the same as the columns in the input table. You can use additional arguments but then you have to specify their default value (it might be useful if you want to use the same AsyncTransformer on multiple Pathway tables with different sets of columns). You have to use all columns from the input table. The order of columns/arguments doesn't matter as they are passed to the method as keyword arguments.

You also need to define the schema of a table that is produced. The invoke method has to return a dictionary containing values to put in all columns of the output table. The keys in the dictionary has to match fields from the output schema. Let's create a simple AsyncTransformer that produces a Table with two output columns - value and ret.

import pathway as pw

import asyncio


class OutputSchema(pw.Schema):
    value: int
    ret: int


class SimpleAsyncTransformer(pw.AsyncTransformer, output_schema=OutputSchema):
    async def invoke(self, value: int) -> dict:
        await asyncio.sleep(value / 10)
        return dict(value=value, ret=value + 1)

Let's use the transformer on the example input table. The result table containing only successful calls can be retrieved from the successful property of the transformer.

table = pw.debug.table_from_markdown(
    """
    value
      12
       6
       2
       2
       6

"""
)

result = SimpleAsyncTransformer(input_table=table).successful
pw.debug.compute_and_print(result)
            | value | ret
^Z3QWT29... | 2     | 3
^3CZ78B4... | 2     | 3
^YYY4HAB... | 6     | 7
^3HN31E1... | 6     | 7
^X1MXHYY... | 12    | 13

The result is correct. Now let's take a look at the output times:

pw.debug.compute_and_print_update_stream(result)
            | value | ret | __time__      | __diff__
^Z3QWT29... | 2     | 3   | 1734428452112 | 1
^3CZ78B4... | 2     | 3   | 1734428452312 | 1
^YYY4HAB... | 6     | 7   | 1734428452312 | 1
^3HN31E1... | 6     | 7   | 1734428452712 | 1
^X1MXHYY... | 12    | 13  | 1734428452712 | 1

Even though all values have equal processing times initially, the output times are different between the rows. It is the effect of AsyncTransformer not waiting for other rows to finish. Thanks to that some rows can be processed downstream quicker. If you want some rows to wait for some other rows to finish, take a look at the instance parameter.

Failing calls

The invoke method is usually written by an external user (like you) and it can contain bugs (unless you write bug-free code). When the invoke call raises an exception or times out (see the next section for that), its output won't be included in the successful table. The failed rows are put in the table accessible by the failed property. Let's define a new AsyncTransformer to check that. Maybe we don't like the value and we fail our function whenever we get it as an argument.

class SometimesFailingAsyncTransformer(pw.AsyncTransformer, output_schema=OutputSchema):
    async def invoke(self, value: int) -> dict:
        if value == 12:
            raise ValueError("incorrect value")
        return dict(value=value, ret=value + 1)


t = SometimesFailingAsyncTransformer(input_table=table)
pw.debug.compute_and_print(t.successful)
pw.debug.compute_and_print(t.failed)
            | value | ret
^Z3QWT29... | 2     | 3
^3CZ78B4... | 2     | 3
^YYY4HAB... | 6     | 7
^3HN31E1... | 6     | 7
            | value | ret
^X1MXHYY... |       |

In the failed table you only get the ids of failed rows (other columns contain None). Because the invoke call failed it was impossible to return any values. You can check which values have failed by joining with the input table:

failed = t.failed.join(table, pw.left.id == pw.right.id, id=pw.left.id).select(
    pw.right.value
)
pw.debug.compute_and_print(failed)
            | value
^X1MXHYY... | 12

Now, you can see that the failed row actually has in the value column.

Controlling AsyncTransformer behavior

It is possible to control the behavior of AsyncTransformer using parameters similar to those in UDFs. They can be passed to with_options method. The available options are:

  • capacity - the maximum number of concurrent operations,
  • timeout - the maximum time (in seconds) to wait for the function result,
  • retry_strategy - the strategy for handling retries in case of failures. The same strategies as for asynchronous UDFs can be used. Examples: ExponentialBackoffRetryStrategy, FixedDelayRetryStrategy,
  • cache_strategy - the caching strategy. The same strategies as for UDFs can be used. Examples: DiskCache, InMemoryCache.

In the following example, you add a timeout to the SimpleAsyncTransformer defined above. It is set to seconds.

t = SimpleAsyncTransformer(input_table=table).with_options(timeout=0.9)

pw.debug.compute_and_print(t.successful)
failed = t.failed.join(table, pw.left.id == pw.right.id).select(pw.right.value)
pw.debug.compute_and_print(failed)
            | value | ret
^Z3QWT29... | 2     | 3
^3CZ78B4... | 2     | 3
^YYY4HAB... | 6     | 7
^3HN31E1... | 6     | 7
            | value
^MFBX8ZW... | 12

Recall that the transformer sleeps the invoke method for a time passed as the method argument divided by . That's why calls with value less than were successful, and calls with value greater than failed.

AsyncTransformer consistency

By default, AsyncTransformer preserves order for a given key. It means that if some row is still executed by AsyncTransformer and its update starts being executed and finishes earlier than the original row, it'll wait for the completion of the original row processing before being returned to the engine. The update cannot have an earlier time assigned than the original row as it would break the correctness of the computations.

Let's analyze this case by computing the sums of entries from the stream. You want to compute the sum for each group independently.

table = pw.debug.table_from_markdown(
    """
    group | value | __time__
      1   |   2   |     2
      1   |   3   |     2
      2   |   1   |     2
      1   |  -3   |     4
      2   |   2   |     4
"""
)
sums = table.groupby(pw.this.group).reduce(
    pw.this.group, value=pw.reducers.sum(pw.this.value)
)

pw.debug.compute_and_print_update_stream(sums)
            | group | value | __time__ | __diff__
^YYY4HAB... | 1     | 5     | 2        | 1
^Z3QWT29... | 2     | 1     | 2        | 1
^YYY4HAB... | 1     | 5     | 4        | -1
^Z3QWT29... | 2     | 1     | 4        | -1
^YYY4HAB... | 1     | 2     | 4        | 1
^Z3QWT29... | 2     | 3     | 4        | 1

The sums computed in time are and . They are deleted in time and replaced with sums and . Let's modify SimpleAsyncTransformer to propagate the group column as well and apply it to the sums table.

class OutputWithGroupSchema(pw.Schema):
    group: int
    value: int
    ret: int


class GroupAsyncTransformer(pw.AsyncTransformer, output_schema=OutputWithGroupSchema):
    async def invoke(self, value: int, group: int) -> dict:
        await asyncio.sleep(value / 10)
        return dict(group=group, value=value, ret=value + 1)


result = GroupAsyncTransformer(input_table=sums).successful
pw.debug.compute_and_print_update_stream(result)
            | group | value | ret | __time__      | __diff__
^Z3QWT29... | 2     | 1     | 2   | 1734428456004 | 1
^Z3QWT29... | 2     | 1     | 2   | 1734428456104 | -1
^Z3QWT29... | 2     | 3     | 4   | 1734428456104 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428456304 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428456506 | -1
^YYY4HAB... | 1     | 2     | 3   | 1734428456506 | 1

All rows reach GroupAsyncTransformer at approximately the same time. In group , the value at time is , and at time is . The first value is processed faster and returned to the engine. When a call for the next value finishes, the old value is removed and a new value is returned to the engine.

The situation for group is different. The value at time is greater than the value at time (). Because of that, the second call to invoke finishes earlier and has to wait for the first call to finish. When the first call finishes, they are both returned to the engine. The value from the second call is newer and immediately replaces the old value.

Partial consistency

Sometimes, the consistency for rows with a single key might not be enough for you. If you want to guarantee an order within a group of records, you can use the instance parameter of the AsyncTransformer. Rows within a single instance are ordered. It means that the results for rows with higher initial processing times can't overtake the results for rows with lower initial processing times. All results within a single instance with equal processing times wait for all rows with this time to finish. Using the instance parameter does not block new calls from starting. Only the results of the calls get synchronized. To demonstrate the synchronization, we create a new table with more data:

table = pw.debug.table_from_markdown(
    """
    group | value | __time__
      1   |   2   |     2
      1   |   3   |     2
      2   |   1   |     2
      3   |   1   |     2
      4   |   3   |     2
      1   |  -3   |     4
      2   |   3   |     4
      3   |   1   |     4
      4   |  -1   |     4
"""
)
sums = table.groupby(pw.this.group).reduce(
    pw.this.group, value=pw.reducers.sum(pw.this.value)
)

pw.debug.compute_and_print_update_stream(sums)
            | group | value | __time__ | __diff__
^YYY4HAB... | 1     | 5     | 2        | 1
^Z3QWT29... | 2     | 1     | 2        | 1
^3CZ78B4... | 3     | 1     | 2        | 1
^3HN31E1... | 4     | 3     | 2        | 1
^YYY4HAB... | 1     | 5     | 4        | -1
^Z3QWT29... | 2     | 1     | 4        | -1
^3CZ78B4... | 3     | 1     | 4        | -1
^3HN31E1... | 4     | 3     | 4        | -1
^YYY4HAB... | 1     | 2     | 4        | 1
^Z3QWT29... | 2     | 4     | 4        | 1
^3CZ78B4... | 3     | 2     | 4        | 1
^3HN31E1... | 4     | 2     | 4        | 1

Now, you have four groups, with one row for each group. You want to guarantee consistency separately for even and odd groups. To do that, you need to set the instance of GroupAsyncTransformer appropriately.

result = GroupAsyncTransformer(input_table=sums, instance=pw.this.group % 2).successful
pw.debug.compute_and_print_update_stream(result)
            | group | value | ret | __time__      | __diff__
^Z3QWT29... | 2     | 1     | 2   | 1734428456744 | 1
^3HN31E1... | 4     | 3     | 4   | 1734428456744 | 1
^Z3QWT29... | 2     | 1     | 2   | 1734428457044 | -1
^3HN31E1... | 4     | 3     | 4   | 1734428457044 | -1
^Z3QWT29... | 2     | 4     | 5   | 1734428457044 | 1
^3HN31E1... | 4     | 2     | 3   | 1734428457044 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428457146 | 1
^3CZ78B4... | 3     | 1     | 2   | 1734428457146 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428457244 | -1
^3CZ78B4... | 3     | 1     | 2   | 1734428457244 | -1
^YYY4HAB... | 1     | 2     | 3   | 1734428457244 | 1
^3CZ78B4... | 3     | 2     | 3   | 1734428457244 | 1

The updates for groups are bundled together. Group at time could finish earlier, but it waits for group . Groups are also dependent on each other. Group could finish quicker, but it waits for group to finish.

You can have a look at how the updates would proceed if no instance was specified:

result = GroupAsyncTransformer(input_table=sums).successful
pw.debug.compute_and_print_update_stream(result)
            | group | value | ret | __time__      | __diff__
^Z3QWT29... | 2     | 1     | 2   | 1734428457358 | 1
^3CZ78B4... | 3     | 2     | 3   | 1734428457458 | 1
^3HN31E1... | 4     | 3     | 4   | 1734428457558 | 1
^Z3QWT29... | 2     | 1     | 2   | 1734428457658 | -1
^3HN31E1... | 4     | 3     | 4   | 1734428457658 | -1
^Z3QWT29... | 2     | 4     | 5   | 1734428457658 | 1
^3HN31E1... | 4     | 2     | 3   | 1734428457658 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428457758 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428457858 | -1
^YYY4HAB... | 1     | 2     | 3   | 1734428457858 | 1

As you can see, only ordering within a group is preserved.

Full consistency

By using the instance parameter, it is possible to make the output preserve the temporal ordering of the input. It is enough to set instance to the same value for all rows, for example by using a constant. Then results for rows with a given time will wait for all previous times to finish before being returned to the engine. Rows with a given time are returned all at once and have the same time assigned. The new calls are not blocked from starting. Only the results get synchronized.

Let's use constant instance in the example from the previous section.

result = GroupAsyncTransformer(input_table=sums, instance=0).successful
pw.debug.compute_and_print_update_stream(result)
            | group | value | ret | __time__      | __diff__
^YYY4HAB... | 1     | 5     | 6   | 1734428457982 | 1
^Z3QWT29... | 2     | 1     | 2   | 1734428457982 | 1
^3CZ78B4... | 3     | 1     | 2   | 1734428457982 | 1
^3HN31E1... | 4     | 3     | 4   | 1734428457982 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428458482 | -1
^Z3QWT29... | 2     | 1     | 2   | 1734428458482 | -1
^3CZ78B4... | 3     | 1     | 2   | 1734428458482 | -1
^3HN31E1... | 4     | 3     | 4   | 1734428458482 | -1
^YYY4HAB... | 1     | 2     | 3   | 1734428458482 | 1
^Z3QWT29... | 2     | 4     | 5   | 1734428458482 | 1
^3CZ78B4... | 3     | 2     | 3   | 1734428458482 | 1
^3HN31E1... | 4     | 2     | 3   | 1734428458482 | 1

All rows are returned at the same time. There are also no updates because calls for time are finished later than calls for time . You can play with the data to make time finish before time and see that the update happens once.

Failing calls consistency

If the instance parameter is used and the call for a given instance fails, the instance is in the failed state from this time. AsyncTransformer requires all calls with a given (instance, processing time) pair to finish successfully. If at least one call fails, returning other rows could leave the instance in an inconsistent state. Let's take a look at what happens if group fails at time .

class SuspiciousGroupAsyncTransformer(
    pw.AsyncTransformer, output_schema=OutputWithGroupSchema
):
    async def invoke(self, value: int, group: int) -> dict:
        if group == 4 and value == 2:
            raise ValueError("err")
        await asyncio.sleep(value / 10)
        return dict(group=group, value=value, ret=value + 1)


result = SuspiciousGroupAsyncTransformer(
    input_table=sums, instance=pw.this.group % 2
).successful
pw.debug.compute_and_print_update_stream(result)
            | group | value | ret | __time__      | __diff__
^Z3QWT29... | 2     | 1     | 2   | 1734428458586 | 1
^3HN31E1... | 4     | 3     | 4   | 1734428458586 | 1
^Z3QWT29... | 2     | 1     | 2   | 1734428458888 | -1
^3HN31E1... | 4     | 3     | 4   | 1734428458888 | -1
^YYY4HAB... | 1     | 5     | 6   | 1734428458988 | 1
^3CZ78B4... | 3     | 1     | 2   | 1734428458988 | 1
^YYY4HAB... | 1     | 5     | 6   | 1734428459086 | -1
^3CZ78B4... | 3     | 1     | 2   | 1734428459086 | -1
^YYY4HAB... | 1     | 2     | 3   | 1734428459086 | 1
^3CZ78B4... | 3     | 2     | 3   | 1734428459086 | 1

New values for the even instance (groups ) coming from the entries at time are not inserted because group fails and hence the whole instance fails. None of the entries in the odd instance (groups ) fail so it is updated normally.

Conclusions

In this guide, you've learned how to create your own AsyncTransformer when you need to process the data asynchronously in Pathway. You know how to control its behavior by setting parameters like timeout, cache_strategy and retry_strategy. You can control the tradeoff between the speed and the consistency of the results.

Now, you also understand the difference between asynchronous UDFs and AsyncTransformer. The former is asynchronous only within a single batch of data and can return values only to a single column, while the latter is fully asynchronous and can return multiple columns. It also allows for specifying the consistency level by using the instance parameter.