What is Pathway?
Pathway is a Python-based data processing framework which takes care of streaming data updates for you. It makes realtime data processing as easy as it can be.
Curious how Pathway resolves key pain points related to streaming data? Our CPO, Adrian, explains this in a quick video introduction to reactive data processing.
(Pathway: Reactive Data Processing in Python, PyData Global, December 2022).
The easiest (Path)way from streaming data to realtime applications
With Pathway, you can focus on building your data pipeline, designing both the analytics and the model you want. Then, you only need to plug the data in and Pathway will build and maintain your pipeline, sending the updates to your app in realtime.
Pathway allows you to build a data pipeline to process your data. The results of your pipeline are then broadcast as a stream to the storage system you want (Kafka, CSV files, PostgreSQL...). The queries are done on those output data streams, outside Pathway' scope. Pathway for Enterprise also offers database-like functionalities, such as persistency, snapshots, and the capacity to answer real-time queries (see the Features page for more information).
As an example, you way want to monitor the logs of your web server by sending an alert if there are more than 1000 connections within the last minute. Let's see how to do it in Pathway:
- Set an input connector to get a table containing the logs. This table is updated whenever a new connection is detected and forwarded by FileBeat/Logstash.
- Use Pathway to compute a sliding window over the last minute.
- Output an alert using an output connector (over Slack or Kibana for example) if the number of connections is too large.
The output - in this case the alerts sent to Slack - will be automatically updated by Pathway whenever new connections are received. If you want to learn more about this use-case, check out our tutorial about it.
Pathway allows you to consider the data as if you were in batch mode: you manipulate your tables as if all the data was already available, and Pathway will update the results whenever a row is added or removed. From the developer point of view, the streaming mode is totally transparent. There is also no need to design separately the batch layer and the speed layer as would be the case in a system with a lambda architecture: the same engine of Pathway provides realtime processing with the same capabilities, consistency promises, and design throughput as a batch processing framework.
YOBO! (You Only Build Once)
Building a processing pipeline with Pathway is like manipulating static and finite data. Then, when you get really big data and switch to streaming mode, Pathway will manage all the updates for you.
What does it mean for your code?
In practice, you need to use the provided connectors to plug Pathway over your data streams. Then you can build your pipeline in Python using Pathway API. Build your pipeline as if you were manipulating standard static and finite tables, such as SQL tables. You only need to build your pipeline once. Indeed, it will be automatically maintained by Pathway's engine whenever new data points are received. Finally, you can output the resulting tables to your end application using the provided output connectors.
Here is what a Pathway pipeline typically looks like:
# Import Pathwayimport pathway as pw# Connect Pathway to your input data streams# using Pathway's input connectorsinput_table = pw.io.csv.read('./sum_input_data/', ["value"], types={"value": pw.Type.INT}, mode="streaming")# Build your processing pipeline by considering# tables without worrying about the updatesinput_table = input_table.reduce(sum=pw.reducers.sum(input_table.value))# Output your resulting tables using Pathway's# output connectors: the updates in the tables# will be automatically processed and output!pw.io.csv.write(input_table, "output_stream.csv")# Don't forget to launch the streaming process# using the dedicated run function.pw.run()
Since you are processing data streams, the processing never ends: everything after the pw.run()
is unreachable code.
The outputs of Pathway are also data streams: they are updated as long as new data points are received.
The reception of a new data point will trigger the updates of the data in the pipeline, and the changes to the output tables will be sent by the output connectors.
How do I start?
Start by installing Pathway.
To start writing your own code quickly, take a look at our Fist-steps guide.
You can also take a look at some recipes in the Pathway cookbook:
- Detecting suspicious user activity.
- Find the time elapsed between events in an event stream.
- Compute the PageRank of a network.
Building larger applications
First, take a look at out guide on Pathway connectors.
To understand how Pathway can fit into your data processing architecture, take a look at some of the showcases in our pathway-examples repo.
- Learn how to make a realtime Twitter Analysis App with Pathway.
- See realtime Machine Learning in action with a Nearest-Neighbors classifier!
Feature FAQ
๐ Realtime Machine Learning
- Pathway updates your Machine Learning decisions and models automatically, in real time.
- You can rely on a unified view of all current data.
- You can treat streaming data just as if they were ordinary tables.
๐ Reactive design
- Let your users provide feedback, signal data issues, and change settings.
- Have outputs of your data app which react to live input.
- Show up-to-date results with sub-second latency.
๐ Full power of Python
- Write code in Pathway with all the power and flexibility of Python classes and functions.
- Interface to the outside world through SQL, Kafka, Debezium, or REST API.
See the full feature list.
What can be built with Pathway
- Data apps serving responses & insights based on an always up-to-date data model.
- Data engineering pipelines: Smart-transformation in ELT.
- Data apps for simulation (what-if scenario testing).
- Realtime analytics: low-latency analytics on realtime event streaming data.
What data does Pathway work with?
Pathway is specifically designed to store, process, and output any mix of:
- Pure-SQL data.
- Time series data.
- IoT messages (harmonized).
- Event stream data.
- Spatiotemporal data (things that move).
- Graph and ontologies data.