Connectors in Pathway
In order to use Pathway, one of the first things you need to do is to access the data you want to manipulate. In Pathway, accessing the data is done using connectors.
Pathway comes with connectors which connect to external data sources at input, as well as connectors which output the data outside of Pathway. Before reading any further, make sure you are already familiar with the streaming and static modes of Pathway as the connectors are different depending on the chosen mode.
Before going into more details about the different connectors and how they work, here is a summary of available connectors, grouped by mode and whether they are input or output connectors:
Streaming mode | Static mode | |
---|---|---|
Input connectors | Airbyte Amazon S3 CSV Debezium Delta Lake File System Google Drive http JSON Lines Kafka NATS MinIO MongoDB Plain text Python Redpanda SharePoint SQLite | Amazon S3 CSV Delta Lake File System Google Drive JSON Lines Markdown MinIO Pandas |
Output connectors | BigQuery CSV Delta Lake Elastic Search File System Google PubSub http JSON Lines Kafka NATS Logstash MongoDB PostgreSQL Redpanda Slack | CSV pw.debug.compute_and_print pw.debug.compute_and_print_update_stream |
The one you need is not in the table? Don't worry, more are coming and you can always reach out to let us know what you would find helpful!
Generation of artificial data streams.
Obtaining a real data stream for testing your application can be challenging in specific scenarios.
Fortunately, you can use our demo
module to simulate an incoming data stream.
This module allows you to create custom data streams from scratch or by utilizing a CSV file to test and experiment with realtime data streams.
Formatting
Connectors support different formats, such as CSV or JSON. The supported formats depend on the connectors. However, all the connectors support the binary format.
Streaming mode connectors
In the streaming mode, input connectors wait for incoming updates. Whenever an update is received, it is propagated in the dataflow (more details here) until it reaches the output connectors which output the resulting changes. This is where it becomes interesting: the table created by the input connector, and all the computations based on it, will be automatically updated whenever an update is received (e.g. a new CSV file has been created). All the computations and outputs are automatically updated by Pathway to take into account the updates from the stream, without requiring any operation from your part: this is the magic of Pathway! In practice, the updates are triggered by commits, which ensure the atomicity of each update.
As Pathway deals with never-ending streaming data: the computation runs forever until the process gets killed. This is the normal behavior of Pathway.
Output connectors are the only way to access the results obtained in Pathway in the streaming mode. However, our outputs are not static but are updated with every new received update: we do not output the full table but only the changes. Every change is represented by a row containing the following columns:
- the columns of the table, representing the values which have been modified.
time
, representing the logical time of the update: this time is incremented at each new commit.diff
, representing whether this update is an addition or a removal. Only two values are possible, 1 for addition and -1 for removal. An update is represented by two rows, one for deleting the previous value and one to add the new.
To see how to set up a realtime streaming application with Pathway, you can try our examples with csv and Kafka input connectors.
Connectors in static mode
In the static mode, the computation is done in batch mode: all the data is read, processed, and output at once. There is no notion of update in this mode. This mode is mainly used for debugging and testing.
In addition of a csv connector which dumps the output table into a csv file, Pathway provides a function which allows to build the graph, to ingest all the data and to print a given table contained in the graph: pw.debug.compute_and_print
.
Persistence in connectors
Regardless of the mode, a connector can persist the data read along with results of some intermediate computations. This feature can be enabled by specifying persistence config in pw.run
method. If the connector is persistent, the Pathway program will preserve its' auxiliary data and will be able to restart from the place where it was terminated last time. This feature may be useful for you in case you need to do re-runs of a program with some added data, or, you want to have a possibility to survive code crashes.
Example
Let's have a quick example on how to manipulate hand-written table in the static mode:
import pathway as pw
t = pw.debug.table_from_markdown(
"""
| name | age
1 | Alice | 15
2 | Bob | 32
3 | Carole| 28
4 | David | 35 """
)
pw.debug.compute_and_print(t)
| name | age
^YYY4HAB... | Alice | 15
^Z3QWT29... | Bob | 32
^3CZ78B4... | Carole | 28
^3HN31E1... | David | 35
Compatibility issues
Both modes are incompatible: you cannot mix connectors from streaming and static modes. Indeed, the nature of the data manipulated by both types of connectors is very different: data streams versus static data.
For instance, one might want to print the value of their table table
in their pipeline to check if the values are correct.
Between two select
, a pw.debug.compute_and_print(table)
is inserted and the computation is run with streaming input connectors.
What do you think would happen?
The program loops. Indeed, pw.debug.compute_and_print
waits for the whole data to be ingested entirely in the dataflow before printing the table.
This makes sense with finite static data but not in the streaming mode where updates are continuously coming!
Be careful when you want to debug your pipeline with static data!
Tutorials
To learn how to use the different connectors, you can see our tutorials:
- CSV connectors
- Database connectors
- Kafka connectors
- Switch from Kafka to Redpanda
- Python input connector
- Python output connector
- Google Drive connector
Conclusion
Connectors are a vital part of Pathway as they define how your data is accessed to and from Pathway. It is important to make the distinction between input/output and streaming/static connectors as they have very different purposes and cannot be mixed.
Pathway provides several connectors, allowing you to connect to your data in different settings in a simple and efficient way. We will regularly update this section and provide more connectors.
You can see one of our recipes to see how a full data processing pipeline works with connectors.