Pathway Premieres at PyData Global

Zuzanna Stamirowska·December 1, 2022·0 min read

Our CSO Adrian Kosowski was a guest speaker at ​​PyData Global for a talk on Reactive Data Processing in Python.

(Pathway: Reactive Data Processing in Python, PyData Global, December 2022).

Machine Learning models designed to work with streaming systems make decisions on new data points as they arrive. But there is a downside: model decisions can't be easily changed later when the model is updated with fresher data, user feedback, or freshly tuned hyperparameters. This is often a blocker for anomaly detection, recommender systems, process mining, and human-in-the-loop planning.

To deal with this, we demonstrated design patterns to easily express reactive data processing logic. We used Pathway, a scalable data processing framework built around a Python programming interface. Pathway was battle-tested with operational data in enterprise, including graphs and event streams in real-world supply chains.

The talk gave a thorough explanation of the practical engineering challenges behind reactive data processing with a Machine Learning angle to it, and the steps needed to overcome these challenges.

In stream processing, Machine Learning models make decisions on new data points as soon as they arrive. Such immediate decisions are extremely useful, but not always the best. For example, when we consider anomaly detection on a stream of events, new effects or trends can usually only be detected with confidence some time after they have started. Past decisions will need to be revisited and reclassified - but which ones exactly? Stream processing does not bring a direct answer, and full batch recomputation can be extremely costly. The same problem holds across numerous contexts: recommender systems, process mining, ontology querying, human-in-the-loop planning systems,... How can you gracefully handle data and models which need revisiting with time, while not over-complicating even the simplest data transformations?

During the talk, we covered the key engineering steps needed to deal with such problems through a reactive data processing design. Achieving such a design was our primary motivation to build Pathway. Pathway is a scalable data processing framework centered around a Python programming interface. It is deployed for processing live operational data in enterprise, including graphs and event streams in real-world supply chains, and is now becoming publicly available in an open-core model.

We showed design patterns which allow to easily express reactive Machine Learning logic. We highlighted where it is possible to rely on the usual Python data science stack and external libraries, and where special attention is needed. Most design patterns should feel familiar to users of Pandas or PySpark dataframes, so we focused on the key differences - and why they are necessary to achieve efficient reactive operation.

In the course of the talk we did a code demo, and showed how to create one’s own reactive data pipeline and microservice. The example was a reactive app which predicts the future popularity and sentiment for trending topics in a well-known social network, across different geographies. We filled in key steps in the code together, and then saw it in action in full deployment (with source data API integration + frontend connected with FastAPI).

The talk was addressed to anyone - Machine Learning Engineers, Software Engineers, and Data Engineers - with an interest in building "smart" data pipelines and data products in a real-time or streaming setting. You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle, and the steps needed to overcome these challenges.