In order to use Pathway, one of the first things you need to do is to access the data you want to manipulate. In Pathway, accessing the data is done using connectors.
Pathway comes with connectors which connect to external data sources at input, as well as connectors which output the data outside of Pathway. Before reading any further, make sure you are already familiar with the streaming and static modes of Pathway as the connectors are different depending on the chosen mode.
Before going into more details about the different connectors and how they work, here is a summary of available connectors, grouped by mode and whether they are input or output connectors:
The one you need is not in the table? Don't worry, more are coming and you can always reach out to let us know what you would find helpful!
Connectors support different formats, such as CSV or JSON. The supported formats depend on the connectors. However, all the connectors support the binary format.
In the streaming mode, input connectors wait for incoming updates. Whenever an update is received, it is propagated in the computation graph (more details here) until it reaches the output connectors which output the resulting changes. This is where it becomes interesting: the table created by the input connector, and all the computations based on it, will be automatically updated whenever an update is received (e.g. a new CSV file has been created). All the computations and outputs are automatically updated by Pathway to take into account the updates from the stream, without requiring any operation from your part: this is the magic of Pathway! In practice, the updates are triggered by commits, which ensure the atomicity of each update.
As Pathway deals with never-ending streaming data: the computation runs forever until the process gets killed. This is the normal behavior of Pathway.
Output connectors are the only way to access the results obtained in Pathway in the streaming mode. However, our outputs are not static but are updated with every new received update: we do not output the full table but only the changes. Every change is represented by a row containing the following columns:
- the columns of the table, representing the values which have been modified.
time, representing the logical time of the update: this time is incremented at each new commit.
diff, representing whether this update is an addition or a removal. Only two values are possible, 1 for addition and -1 for removal. An update is represented by two rows, one for deleting the previous value and one to add the new.
In the static mode, the computation is done in batch mode: all the data is read, processed, and output at once. There is no notion of update in this mode. This mode is mainly used for debugging and testing.
In addition of a csv connector which dumps the output table into a csv file, Pathway provides a function which allows to build the graph, to ingest all the data and to print a given table contained in the graph:
Regardless of the mode, a connector can persist the data read along with results of some intermediate computations. This feature can be enabled by specifying persistence config in
pw.run method. If the connector is persistent, the Pathway program will preserve its' auxiliary data and will be able to restart from the place where it was terminated last time. This feature may be useful for you in case you need to do re-runs of a program with some added data, or, you want to have a possibility to survive code crashes.
Let's have a quick example on how to manipulate hand-written table in the static mode:
import pathway as pw
t = pw.debug.table_from_markdown(
| name | age
1 | Alice | 15
2 | Bob | 32
3 | Carole| 28
4 | David | 35 """
[2024-03-01T18:27:00]:INFO:Preparing Pathway computation
| name | age
^YYY4HAB... | Alice | 15
^Z3QWT29... | Bob | 32
^3CZ78B4... | Carole | 28
^3HN31E1... | David | 35
Both modes are incompatible: you cannot mix connectors from streaming and static modes. Indeed, the nature of the data manipulated by both types of connectors is very different: data streams versus static data.
For instance, one might want to print the value of their table
table in their pipeline to check if the values are correct.
pw.debug.compute_and_print(table) is inserted and the computation is run with streaming input connectors.
What do you think would happen?
The program loops. Indeed,
pw.debug.compute_and_print waits for the whole data to be ingested entirely in the computation graph before printing the table.
This makes sense with finite static data but not in the streaming mode where updates are continuously coming!
Be careful when you want to debug your pipeline with static data!
To learn how to use the different connectors, you can see our tutorials:
- CSV connectors
- Database connectors
- Kafka connectors
- Switch from Kafka to Redpanda
- Python input connector
- Python output connector
Connectors are a vital part of Pathway as they define how your data is accessed to and from Pathway. It is important to make the distinction between input/output and streaming/static connectors as they have very different purposes and cannot be mixed.
Pathway provides several connectors, allowing you to connect to your data in different settings in a simple and efficient way. We will regularly update this section and provide more connectors.
You can see one of our recipes to see how a full data processing pipeline works with connectors.