Connectors in Pathway
In order to use Pathway, one of the first things you need to do is to access the data you want to manipulate. In Pathway, accessing the data is done using connectors.
Pathway comes with connectors which connect to external data sources at input, as well as connectors which output the data outside of Pathway. Before reading any further, make sure you are already familiar with the streaming and static modes of Pathway as the connectors are different depending on the chosen mode.
Before going into more details about the different connectors and how they work, here is a summary of available connectors, grouped by mode and whether they are input or output connectors:
Streaming mode | Static mode | |
---|---|---|
Input connectors | Kafka CSV Debezium Amazon S3 Plain text | Markdown Pandas CSV |
Output connectors | Kafka PostgreSQL CSV | pw.debug.compute_and_print CSV |
The one you need is not in the table? Don't worry, more are coming and you can always reach out to let us know what you would find helpful!
Streaming mode connectors
In the streaming mode, input connectors wait for incoming updates. Whenever an update is received, it is propagated in the computation graph (more details here) until it reaches the output connectors which output the resulting changes. This is where it becomes interesting: the table created by the input connector, and all the computations based on it, will be automatically updated whenever an update is received (e.g. a new CSV file has been created). All the computations and outputs are automatically updated by Pathway to take into account the updates from the stream, without requiring any operation from your part: this is the magic of Pathway! In practice, the updates are triggered by commits, which ensure the atomicity of each update.
As Pathway deals with never-ending streaming data: the computation runs forever until the process gets killed. This is the normal behavior of Pathway.
Output connectors are the only way to access the results obtained in Pathway in the streaming mode. However, our outputs are not static but are updated with every new received update: we do not output the full table but only the changes. Every change is represented by a row containing the following columns:
- the columns of the table, representing the values which have been modified.
time
, representing the logical time of the update: this time is incremented at each new commit.diff
, representing whether this update is an addition or a removal. Only two values are possible, 1 for addition and -1 for removal. An update is represented by two rows, one for deleting the previous value and one to add the new.
To see how to set up a realtime streaming application with Pathway, you can try our examples with csv and Kafka input connectors.
Connectors in static mode
In the static mode, the computation is done in batch mode: all the data is read, processed, and output at once. There is no notion of update in this mode. This mode is mainly used for debugging and testing.
In addition of a csv connector which dumps the output table into a csv file, Pathway provides a function which allows to build the graph, to ingest all the data and to print a given table contained in the graph: pw.debug.compute_and_print
.
Example
Let's have a quick example on how to manipulate hand-written table in the static mode:
import pathway as pwt = pw.debug.table_from_markdown( """ | name | age 1 | Alice | 15 2 | Bob | 32 3 | Carole| 28 4 | David | 35 """)pw.debug.compute_and_print(t)
| name | age
^YYY4HAB... | Alice | 15
^Z3QWT29... | Bob | 32
^3CZ78B4... | Carole | 28
^3HN31E1... | David | 35
Compatibility issues
Both modes are incompatible: you cannot mix connectors from streaming and static modes. Indeed, the nature of the data manipulated by both types of connectors is very different: data streams versus static data.
For instance, one might want to print the value of their table table
in their pipeline to check if the values are correct.
Between two select
, a pw.debug.compute_and_print(table)
is inserted and the computation is run with streaming input connectors.
What do you think would happen?
The program loops. Indeed, pw.debug.compute_and_print
waits for the whole data to be ingested entirely in the computation graph before printing the table.
This makes sense with finite static data but not in the streaming mode where updates are continuously coming!
Be careful when you want to debug your pipeline with static data!
Tutorials
To learn how to use the different connectors, you can see our tutorials:
Conclusion
Connectors are a vital part of Pathway as they define how your data is accessed to and from Pathway. It is important to make the distinction between input/output and streaming/static connectors as they have very different purposes and cannot be mixed.
Pathway provides several connectors, allowing you to connect to your data in different settings in a simple and efficient way. We will regularly update this section and provide more connectors.
You can see one of our recipes to see how a full data processing pipeline works with connectors.