Connectors in Pathway

In order to use Pathway, one of the first things you need to do is to access the data you want to manipulate. In Pathway, accessing the data is done using connectors.

Pathway comes with connectors which connect to external data sources at input, as well as connectors which outputs the data outside of Pathway. Before reading any further, make sure you are already familiar with the streaming and static modes of Pathway as the connectors are different depending on the mode.

Before going into more details about the different connectors and how they work, here is a summary of the available connectors, grouped by mode and whether they are input or output connectors:

Streaming modeStatic mode
Input connectorsKafka
CSV
Debezium
Amazon S3
Plain text
Markdown
Pandas
CSV
Ouput connectorsKafka
PostgreSQL
CSV
pw.debug.compute_and_print
CSV

The one you need is not in the table? Don't worry, more are coming!

Streaming mode connectors

In the streaming mode, input connectors wait for incoming updates. Whenever an update is received, it is propagated in the computation graph (more details here) until it reaches the output connectors which output the resulting changes. This is where it becomes interesting: the table created by the input connector, and all the computations based on it, will be automatically updated whenever an update is received (e.g. a new csv file has been created). All the computations and outputs are automatically updated by Pathway to take into account the updates from the stream, without requiring any operation from your part: this is the magic of Pathway! In practice, the updates are triggered by the reception of a message containing *COMMIT*: this message ensures the atomicity of each update.

As Pathway deals with never-ending streaming data: the computation runs forever until the process got killed. This is the normal behavior of Pathway.

Output connectors are the only way to access the results obtained in Pathway in the streaming mode. However, our outputs are not static but are updated with every new received update: we do not output the full table but only the changes. Every change is represented by a row containing the following columns:

  • the columns of the table, representing the values which have been modified.
  • time, representing the logical time of the update: this time is incremented at each new commit.
  • diff, representing whether this update is an addition or a removal. Only two values are possible, 1 for addition and -1 for removal. An update is represented by two rows, one for deleting the previous value and one to add the new.

To see how to setup a realtime streaming application with Pathway, you can try our examples with csv and Kafka input connectors.

Connectors in static mode

In the static mode, the computation is done in batch mode: all the data is read, processed, and output at once. There is no notion of update in this mode. This mode is mainly used for debugging and testing.

In addition of a csv connector which dumps the output table into a csv file, Pathway provides a function which allows to build the graph, to ingest all the data and to print a given table contained in the graph: pw.debug.compute_and_print.

Example

Let's have a quick example on how to manipulate hand written table in the static mode:

import pathway as pwt = pw.debug.table_from_markdown(    """    | name  | age 1  | Alice | 15 2  | Bob   | 32 3  | Carole| 28 4  | David | 35 """)pw.debug.compute_and_print(t)
            | name   | age
^2TMTFGY... | Alice  | 15
^YHZBTNY... | Bob    | 32
^SERVYWW... | Carole | 28
^8GR6BSX... | David  | 35

Compatibility issues

Both modes are incompatible: you cannot mix connector from streaming and static mode. Indeed, the nature of the data manipulated by both types of connectors is very different: data streams versus static data.

For instance, one might want to print the value of her table table in hers pipeline to check if the values are correct. Between two select, a pw.debug.compute_and_print(table) is inserted and the computation is run with streaming input connectors.

What do you think would happen?

The programs loops. Indeed, pw.debug.compute_and_print waits for the whole data to be ingested entirely in the computation graph before printing the table. This makes sense with finite static data but not in the streaming mode where updates are continuously coming!

Be careful when you want to debug your pipeline with static data!

Tutorials

To learn how to use the different connectors, you can see our tutorials:

Conclusion

Connectors are a vital part of Pathway as they define how your data is accessed to and from Pathway. It is important to make the distinction between input/output and streaming/static connectors as they have very different purposes and cannot be mixed.

Pathway provides several connectors, allowing you to connect to your data in different settings in a simple and efficient way. We will regularly update this section and provide more connectors.

You can see one of our recipe to see how a full data processing pipeline works with connectors.

Olivier Ruas

Algorithm and Data Processing Magician