When moving from batch to streaming you need to understand the nature of data coming over time. Pathway makes it easy to switch from batch to streaming by guaranteeing out-of-the-box the correctness of computations, by updating a result whenever new data points are inputted. This correctness comes with a price, and you may want to trade it for better latency or memory consumption. In Pathway, you can set the behavior of the temporal operations to determine the tradeoff between accuracy, latency, and memory consumption that will suit your application.
When working with streaming data, your events will happen over time and the result of your application will be based on data that has been processed so far. In reality, the situation will be even more complicated - the time at which event occurs will be different than the time at which it is processed, and the difference between those two - latency - will vary from event to event making it hard to predict. It is caused by different speeds and levels of reliability of channels you will use to send data to a streaming system.
When some data is late, what should we do? It's impossible to know if there is no data or if the data is late. Should we wait indefinitely, ready to update the results, or should we finalize the computations as fast as possible and free the resources? Also, when computing aggregated values, such as an average. Should we start the computation as soon as we receive the first data point, or should we wait to have enough data? The default behavior of Pathway is to start the computation right away and wait for potential late data to ensure correctness: by default, the results Pathway produces are always consistent with the received data. However, this has a cost: Pathway needs to keep track of old data in case a late data point comes and the result needs to be recalculated.
For example, if you compute the exact number of unique website visitors over 1-hour intervals and get an entry from 1 day ago, you must remember all users from that time. It is okay, but imagine getting an entry from 1 year ago and keeping all the data! Doing this might be ridiculous, but it is the only way to guarantee the correctness if the maximal latency is unknown. Unfortunately, this can make the memory consumption unbounded. The solution is to specify a temporal behavior to let Pathway know which data is essential for you and which can be ignored.
Because data don't come all simultaneously, there is a tradeoff between latency and efficiency - it could be better to wait for more data before doing the computations. For example, having an output update at each data reception can be costly for regularly and frequently incoming data, and delaying the update might be more efficient. Also, it may be wise to wait to have enough data before computing aggregated statistics.
When you aggregate the data, by default you start getting results when the first record gets processed. It provides low latency but in some situations can be undesirable. In the previous example of counting unique website visitors, you are fine with getting the best estimate on the result before the interval ends. On the other hand, if you want to have alerts based on suspiciously low power usage to detect potential outages, you will trigger alerts, just because you work with incomplete data. The solution would be to wait until most data should have been already collected.
This last issue concerns anomaly detection use cases. Imagine that you aggregate data into windows, i.e., clusters of data within some duration of time from each other. Now, you analyze data inside each window, and if it is suspicious in some way, you raise an alert. Naturally, your alert should only be based on the latest or few latest windows. To accomplish that, you want to forget data if they are no longer relevant.
All these issues can be solved in Pathway by using temporal behaviors! They will allow you to specify when computations should occur, whether to keep updating based on late data, or if you want to remove outdated results from the output of an operator.
Temporal behavior in Pathway is specified using 3 parameters -
The purpose of
delay is to inform the engine to wait for a given amount of time before any computation is done after a new record arrives. There are two reasons to do that. The first one is to avoid recomputation by buffering the data, by which you specify your desired tradeoff between latency and efficiency. The second one is distinguishing between windows for which you want the latest result, even if it is based on incomplete data, or if you want to wait until most of the data has arrived. You can see the example of the latter in the From Jupyter to Deploy tutorial.
cutoff is used to specify how long to wait for late data. It sets the time, after which the results of computation results will no longer be updated, even if late data arrives, thus allowing Pathway to clear memory.
keep_results allows you to specify whether the computation results should be kept after
keep_results=True, which is its default value, the operator's output is kept, but it is no longer updated with late data. When you set
keep_results=False, not only will the results not be updated, but they will be removed from the output. It is useful for the anomaly detection use case, an example of which you can check in
To understand what these arguments exactly mean for specific operators, read our tutorials on how to use behaviors with Interval Joins and Windows .
To set these three parameters, provide them as an argument to
pw.temporal.common_behavior. Pathway also provides a "shortcut" for windows you want to calculate exactly once, after they have already closed, by using
pw.temporal.exactly_once_behavior. You can read more about
pw.temporal.exactly_once_behavior in our tutorial on using behaviors with Windows.