How Pathway Connectors Work

Pathway framework enables you to efficiently ingest data from a wide variety of sources, such as task queues (e.g., Kafka, NATS, MQTT, Kinesis), file systems, and databases. This data is generally processed efficiently thanks to the runtime being implemented in Rust.

Although the framework's frontend is exposed via Python, having the core in a Rust runtime provides two major advantages. First, Rust avoids GC pauses and provides predictable latency suitable for long-running streaming workloads. Second, it avoids limitations imposed by the Python Global Interpreter Lock (GIL), enabling truly parallel execution.

This article is intended for anyone interested in the design of Pathway connectors, those considering contributing to the connector ecosystem, and anyone curious about the internal architecture of the framework in general.

You will learn:

  • How Pathway abstracts connector functionality,
  • The basic interfaces that need to be implemented on the Rust engine side to create a custom connector and contribut it to Pathway, and
  • Which mechanisms are automatically provided by Pathway that the connector itself does not need to implement.

Understanding the Requirements

Ingesting the input data is one the first questions you face when building a computational data pipeline — after all, without input, there is nothing to process. This makes defining a solid foundational architecture essential.

Before thinking about the architecture itself, it's important to understand the general properties that a data processing framework such as Pathway should satisfy. This ensures that the architectural decisions are justified.

First, the framework must support a wide variety of data sources. This means the interface should have a clearly separated layer that handles the interaction with each source system and doesn't duplicate any of the common functionalities.

Second, the system should be designed so that data formats can be reused across different sources. For example, messages arriving from Kafka may be in JSON format, external datasets read from files may be in JSON Lines, and dataset objects in S3 could follow similar structures. While not always feasible, when it is, there's a clear benefit in separation the logic of data ingestion from the logic of parsing.

Another essential requirement is support for data persistence. Data persistence allows you to stop a pipeline and restart it later without reprocessing the same data from the very beginning. Upon restart, the system must resume exactly where it left off. This is critical to avoid data inconsistencies: no record should be processed more than once. From a fault-tolerance perspective, the framework must ensure no reprocessing and no data loss within the committed frontier — neither earlier nor later than intended.

Ideally, this should be generalized. Features like logging, retrieval mechanisms, or backpressure control should be automatically handled by the framework, rather than requiring custom implementation in every connector.

Connector architecture overview

Reading Data in Batches

Next, it makes sense to understand how data ingestion works in a streaming context. In a streaming pipeline, data flows continuously and can be effectively infinite. Processing each record individually would be extremely inefficient.

Pathway addresses this using computational mini-batches. Each connector allows you to specify an autocommit_duration_ms parameter, which defines the time window for batching incoming data. Once this duration elapses, the system treats all records in that window as a single mini-batch. The batch is processed as a unit, and the results are emitted as one transaction.

Data is batched according to the given interval

This approach allows data to enter the system in controlled portions. Users can adjust the granularity in milliseconds, depending on their needs:

  • For low-latency processing, they might set a small duration (e.g., 10 ms), sending data through the pipeline almost immediately.
  • For more efficient batch processing, they can increase the duration, producing fewer but larger mini-batches and taking advantage of batch-based computation benefits.

The time-based approach is not the only way to control the sizes of the batches: the backpressure management, described further, can also terminate the batches before the specified interval if there is a need to guarantee the maximum number of in-flight entries.

With these foundational considerations and input data mechanisms in place, the next section will describe the minimal interface — or trait, in Rust terminology — that must be implemented to create a custom input connector for Pathway.

Interfaces

The previous section gives an understanding of the type of system and an intuition for the generality required from the design. Now take a look at the simplified trait for the external storage reader.

trait Reader {
    fn read(&mut self) -> Result<ReadResult, ReadError>;
    fn seek(&mut self, frontier: &OffsetAntichain) -> Result<(), ReadError>;
}

Reading Incoming Data

The act of reading incoming data is kept as simple as possible. A method called read is defined, and a result of the read operation or an error is returned by this method. It should be noted that the reading logic itself may vary significantly from system to system. In some systems, larger chunks of data may be read at once and buffered internally within the connector logic; this behavior can be beneficial when many types of sources are being read. Because buffering behavior may differ between systems, it is implemented in the respective connectors, but a single record is returned by the read method at a time.

It is worth noting that some libraries for reading data from external systems already perform internal buffering. For example, librdkafka provides its own buffering mechanisms, which can be configured via its parameters: queued.min.messages, queued.max.messages.kbytes and others. In such cases, adding an additional buffering layer on top may be redundant.

The return value is composed of either an error or an event object, denoted as ReadResult in the snippet above, the high-level outcome of the read operation. The entity is not represented by raw data; instead, it denotes different events that may occur during reading. These events include the following.

pub enum ReadResult {
    Finished,
    NewSource(SourceMetadata),
    FinishedSource { commit_possibility: CommitPossibility },
    Data(ReaderContext, Offset),
}

The data stream may be finished, meaning that no new data is expected. For example, when connectors are run in static load mode (i.e., a mode in which static ingestion of a finite dataset is supported by the connector), once all data has been consumed and no updates are being observed, the stream is considered finished.

A pair of events, corresponding to the new and the finished sources, may be produced. To understand these, the notion of a source block can be introduced. In addition to input being segmented into temporal computational mini-batches, the data itself may be grouped into transactional blocks required by the source. For example, when a dataset composed of many files is being read, it is generally desired that changes from each file be applied transactionally. If a file is re-read, rows may be removed and added, and it is desired that both removal and addition phases be treated as a single atomic operation.

To support such behavior, a "new source" event is emitted to indicate that a new block of data has begun, and this block should be treated as a single atomic unit. The transactional semantics of such blocks must not depend on automatic mini-batch commits; if a mini-batch is triggered in the middle of such a block, the data must not be committed yet, and a "finished source" event must be awaited from the connector.

The commit_possibility flag can also be noted. In a simpler design, the parameters aren't required, and after the source is finished, the block would be immediately considered complete. However, for flexibility, the capability to group multiple blocks together (based on metadata) is desired, so that not only data within a block respects transactional boundaries, but blocks themselves form an atomic transaction. This prevents independent blocks from being committed separately when they logically belong together. The flag is used to indicate whether committing is allowed and/or required at that moment.

Finally, the simplest event is Data. This event indicates that actual data has been read. All previous variants represent meta-events, whereas this one carries the payload. A ReaderContext is included in the Data event.

There are several metadata events that may happen during the reading

The ReaderContext is used to allow prior parsing work to be reused by connector authors. Multiple commonly used modes in which data may arrive exist; for example, raw bytes are very common (messages from Python connectors, dataset lines, MQTT messages, etc.). Because many messages may share a common format, repeatedly parsing them within each connector or pipeline would be wasteful. This entity therefore represents several possible formats in which data may be provided.

For example:

  • Connectors processing raw streams may emit raw bytes, so that a single parser that can be shared across many connectors may be provided by the user.
  • Connectors producing domain-specific structured data may directly emit events, with the type (insert/upsert/delete), an optional key, and the value payload being specified.

As a result, connector authors are enabled to:

  • produce connectors that merely deliver messages without parsing (with the assumption that a suitable parser may already exist), or
  • produce connectors that deliver fully prepared Pathway rows.

Once a new parser is added, it becomes immediately usable for all connectors that emit compatible contexts.

Finally, it should be noted that the Data event also includes an offset, which will be discussed in the next subsection.

Denoting Offsets

Data persistence introduces an additional consideration that haven't been addressed yet. Persistence, by definition, requires to know from which point in the input one should resume when restarting, without starting from the beginning. In other words, one needs to know the exact position to jump to during recovery.

However, this position is inherently system-specific and cannot be fully abstracted. For every adaptive source that supports persistence, each emitted record must carry some form of positional metadata that indicates where the next read should continue from.

This position can take different forms depending on the system:

  • Kafka: a position consists of the topic name, partition number, and a monotonically increasing message index within that partition.
  • Large file streaming: a position is literally a byte cursor, indicating where the next read should begin.
  • Lakehouse systems (e.g. Delta Lake): a position is a version number, because all changes within a version are conceptually transactional, so keeping the version is sufficient to resume correctly.
An example for a frontier based on Kafka offsets for three partitions

An instance of such a position is called an offset in the codebase. It is data type that encapsulates various ways of expressing the current read boundary.

This is why the read method returns not only data, but also an offset. The implementation is free to buffer or not buffer data as it wishes, but every row (or event) it returns must also include an offset. With that, the system ensures it knows where it stopped and where to resume.

Seeking

So far, the tutorial has described the read method and how reading produces both content (bytes, tokens, etc.) and an offset, position marker required for persistence. With this, the system can store checkpoints that represent progress.

Now consider a different scenario: the program starts, discovers a checkpoint in persistent storage, and must restore from it. The restoration process includes several steps, not all of which relate to reading from sources, but connector-level logic must also participate in this process.

During recovery, the reader must not start from the beginning. Instead, it must jump to the position encoded in the checkpoint. This is the purpose of the seek method.

The seek method accepts an offset antichain, which is simply a container holding all relevant offsets. Informally, an offset antichain represents the minimal set of offsets sufficient to describe the current recovery frontier. More formally, an antichain is a partially ordered set where there is no pair of two comparable elements.

But why does this structure is needed at all? That's because if the engine naïvely stored every seen offset, the set of offsets would grow without bound, making seeking both wasteful and complicated. Instead, the antichain maintains a minimal representation.

For example, consider Kafka. Reading from a Kafka topic introduces two degrees of an offset freedom:

  • partition
  • monotonically increasing index within the partition

To avoid storing more than necessary, the offsets can be decomposed into an offset-key and an offset-value. Either may be empty, depending on the source. These pairs are stored in a container structurally similar to a hash map: for each offset-key it stores store the latest offset-value observed.

Thus, the seek method in the described interface has to: for each offset-key, position the reader at the corresponding offset-value. After this, the reader is prepared to resume from the correct position during recovery.

It should be noted that the scenario with multiple workers is implemented in such a way that each worker writes its own Write-Ahead Log (WAL) to the persistent storage. During recovery, the engine determines the last fully covered transaction by examining the WALs of all workers. In the recovery process, it may be necessary to read the offsets saved by all workers, and then merge them to obtain the boundaries up to which reading previously progressed, taking into account the distributed execution.

Since, after a restart, the work may be partitioned differently (for example, if the number of workers has changed), there is a need to obtain the complete frontier that takes into account the partial frontiers from all of the workers. A standard offset antichain merging implementation is provided for the connector, but if the merging logic for a certain offset is different, can be overridden.

Intermediate Summary

Above is described a simplified, yet fully functional architecture of input connectors in Pathway.

Suppose you want to implement a connector that reads data from a new source. At a minimum, you need to:

  1. Implement a class for your connector that implements the Reader trait. The model described above is intentionally simplified. In the real trait, there are additional methods (e.g., related to retry behavior or how the source is logged), but they are optional to override, as default implementations are provided.
  2. Implement the read method, which returns read events and offsets. Each read result must wrap the data in one of the available ReaderContext formats, if needed.
  3. Implement the seek method, which uses the set of offsets to position the reader at the correct location for recovery.

The full cycle of reading in Pathway

Once this is done, you have two options:

  • You can combine the result with one of the existing parsers in the system.
  • Or you can emit fully parsed key/value pairs directly from the read method.

Also note that implementing new parsers is also possible. For this, the Parser trait should be implemented, containing a parse method. It will receive a ReaderContext as input, and your task is to return the parsed data as a key and a value.

Output Connectors

For completeness, it is worth describing how output connectors are implemented in Pathway. Note that this process is completely independent from input connectors: you may implement only the input part or only the output part for a given system. They do not depend on each other.

The design and implementation of output connectors are generally simpler, because there is no offset tracking. All that is required from an output connector is to take the data it receives and persist it to an external system.

A simplified trait for an output connector might look as follows.

pub trait Writer: Send {
    fn write(&mut self, data: FormatterContext) -> Result<(), WriteError>;
    fn flush(&mut self) -> Result<(), WriteError> {
        Ok(())  // Assumes there is no buffering. Needs to be overridden in the buffering scenarios.
    }
}

There are two main methods that are recommended to implement: write and flush.

write method

This method receives data ready for output. It may either write the data immediately to the target system or buffer it.

Buffering is often a good idea. For example, when writing to a database, it can be more efficient to collect multiple rows and write them in a single transaction. This mechanism should be implemented directly in the write method. Some libraries already do buffering on their end: in this case it may be redundant to implement a buffer yourself in the connector.

In case buffering is used, directly by you or by using an existing buffer mechanism, you also need to make sure all the data is dump into the target destination. This is done using the flush method.

flush method

The flush method is optional but strongly recommended: as you can see, by default, the implementation is empty. It is invoked to ensure that all buffered data is persisted to the target system, similar in semantics to file system flush operations.

If your write method writes data immediately and guarantees its availability, flush may not be necessary. However, when buffering is used, flush is the place to commit all buffered data to the external system.

It may be asked how often flush is called. There is no need for a separate thread to trigger it at intervals. In Pathway, flush is called once per transactional mini-batch. For example, when data for a given time T has been processed and all outputs for that mini-batch are ready, flush is invoked to persist all data for that mini-batch.

Thus, the frequency of flush calls corresponds directly to the frequency of transactional mini-batch updates described in the previous sections.

Recap: What it Takes to Implement a Native Output Connector

So suppose you want to implement an output connector. Here is what you need:

  1. Implement your connector class implementing trait Write, including at least the write method and, most likely, the flush method. The write method receives a context and saves it to a database or a different external storage. Alternatively, you may buffer data in write and commit it to the external system in flush.
  2. Other methods from the trait may also be implemented. These are not required but can control retry logic and error handling policies in general, logging, or other behaviors if needed.
  3. Similar to reading, the context is simply an object containing data, prepared in one of the available formats. You may implement a custom formatter by implementing a Formatter trait. This split is similar to the one you've seen in the input case, and it allows you to apply common transformations to a group of connectors before writing to output, without duplicating the logic in each connector: for example, create JSON documents before they're sent anywhere.

The full cycle of output in Pathway

In brief, you can turn a buffering Rust wrapper over an external system into a functional and performant connector.

Python Bridge

Connectors in Pathway are written in Rust, as is the Pathway runtime. This raises the question: how are these Rust connectors exposed to Python?

Pathway uses a bridging layer implemented with PyO3, which allows certain Rust classes to be visible and usable from Python. This layer is implemented on the Rust side in the Python API module. There, all wrappers that are exposed using special decorators and can be utilized in Python can be found. As for Python, the interface for accessing these wrappers is described in a separate .pyi file.

These wrapper classes:

  • Store user-provided parameters in Python.
  • Pass the parsed data to Rust to create the actual objects used by the Rust engine for reading or writing data.

Finally, it's possible to conclude the full cycle of the connector layers:

  1. User-facing Python API. The user creates an object in Python (e.g., pw.my_source.read(...) or pw.my_destination.write(...)) and provides all required connection parameters. The corresponding accessor is done Python-wise.
  2. Wrapper construction. Inside the connector, two Rust-side wrapper classes (exposed to Python via PyO3) are instantiated: one for the data storage parameters, and one for the format parameters. These classes must contain everything that is needed to connect to your system. These classes are done in Rust, and exposed to Python via PyO3.
  3. Rust-side parsing. The Rust engine parses the wrapper class and recognizes that it corresponds to your custom storage type or parser/formatter.
  4. Connector execution. Finally, the Rust class you implemented is instantiated and performs the logic you defined for reading or writing data.

It should be noted that performing calls across different runtimes using tools such as PyO3 or FFI bindings in general is a relatively expensive operation for the hot path, as it involves costly cross-boundary calls and, in the case of Python, Global Interpreter Lock (GIL) acquisition. However, this integration does not occur on the hot path. Instead, it is only required when constructing the connector object before the data processing pipeline even starts. Once the data pipeline is running, no interaction with the foreign runtime occurs, so it does not interfere with the hot data path.

It is also important to note that the framework contains scenarios in which Python may be called from different execution threads. A typical example is when a User-Defined Function (UDF) is implemented in Python. In these scenarios, cross-runtime calls are unavoidable. However, they are not part of the typical connector execution path, and, furthermore, such calls are batched and organized into chunks. For these batched requests, the GIL is acquired once, and the entire batch is executed under the same lock. This significantly reduces the number of lock acquisitions and expensive cross-runtime transitions.

Putting Everything Together

The sections above describe how classes are implemented to either read data from a source or write data to a destination. In principle, this is already sufficient for implementing a native connector. However, it is also useful to understand how this works together inside the platform.

A Pathway program consists of several parts, and for the purposes of this article two of them are relevant:

  • A connector, receiving data with a read() or producing it with a write(...) method.
  • A computational worker, which operates on computation graphs constructed by user and translated into Differential Dataflow. This part will be described in more detail in a separate article. This part is purely CPU-bound.

The computational worker schedules computations of a given graph and can perform these computations for a specified time interval. The graph itself must receive some input data, and this receiving of input data is also part of the graph.

Here a problem arises: there are many different sources, and their robustness is not controlled by the engine. For example, one may request data from a remote database, but it may be overloaded and not respond. It would be wrong to mix such uncontrollable operations into a strictly computational and timed task that the system controls internally.

For this reason, connectors run in separate threads. The mechanism works as follows: a separate thread is created for the source, and that thread continuously receives data from the source and pushes it into a queue. Then computational worker, if the queue is not empty, takes elements from the it and processes them, performing the required computations. This way, computational threads are dedicated to computations only, executing a graph that is predefined and fully controlled.

A similar scheme is used for output tasks as well: when data needs to be sent somewhere, the process of sending data may take a significant amount of time. Therefore, these tasks also run in a separate thread that performs that sending logic, while being connected to the computational worker by a queue. This isolates the connector and protects computation from unstable external storages.

If you're interested in the implementations of the current connectors, or want to draft a new native connector and need an example to start with, feel free to check the trait Reader and trait Writer implementors in Pathway's open-source repository.

I/O Performance

Before discussing performance benchmarks, it is important to note that input performance depends on how the data source is configured. If the upstream system is slow, performance can be partially improved through Pathway's internal algorithms, but the bottleneck may still reside in the storage itself.

To get an estimate for an I/O-bound problem, you can refer to Pathway performance benchmarks that were done to compare the framework to other stream-processing systems such as Spark Streaming and Flink. One of the benchmark scenarios was the classical WordCount task, which measures the frequency of word occurrences. The article also describes the setup, the hardware used and the methodology for an end-to-end latency measurement.

Results

On this task, Pathway demonstrated a throughput of up to 2 million messages per second while keeping the 95th percentile latency below 50 ms, a result that surpassed the performance of other compared systems. In all of the benchmarked systems, four pinned cores were used for the computation processes.

Notes on Tail Latency Improvements

Earlier microbenchmarks showed issues with increasing tail latency due to frequent context switches between the streaming thread and the compute worker thread, when small autocommit duration was used. To address this, a mechanism was implemented to minimize the number of such switches when the incoming data rate is low. As a result, users no longer need to manually tune autocommit behavior to mitigate tail latency effects.

Tooling Automatically Provided for Each Connector

The connector architecture is designed to be quite minimalist. It should be noted that only the read and seek methods are needed in case of reading, and the write and flush methods are used for writing. A few optional helper methods can also provided. A benefit is offered by this minimalist architecture in that a set of common tools can be utilized, and they are made available across all connectors, including those implemented in this manner. Several sections are presented below that describe various topics and mechanisms that are shared.

Backpressure Control

Having the input mechanisms described, one can see how the backpressure control is put in place. The backpressure control is configurable by a max_backlog_size parameter, present in every input connector parameters. This parameter indicates the maximum number of data entries (or, otherwise, rows) that may be simultaneously participating in computations at a given time.

By default, this parameter is set to None. This means that reading is not limited. However, this also implies that if there are spikes in input volume — for example, if the program initially processes a large number of data batches and only later receives a small number of incremental updates — memory usage may grow without bound. Therefore, in many cases it's a good idea to restrict it.

The correct value of this parameter depends on the use case. There is no universal setting, because there is a significant difference between, for example, a program that performs a full join with quadratic memory complexity and a program that simply reads data, applies a simple map or filter, and forwards it. For this reason, the parameter must be tuned based on workload characteristics.

At the engine level, the mechanism works as follows. As stated before, incoming data is divided into transactional mini-batches. The system tracks which of these batches are currently in computation. A batch is considered "in computation" if at least one of its related outputs haven't been produced. Once all expected outputs for that batch have been emitted, the batch is considered to have exited the engine, as it no longer participates in any computation.

During execution, the total size of all batches currently in computation is tracked. If introducing a new batch would cause this total to exceed the configured limit, reading is paused. Reading resumes only when at least one batch exits the engine and frees capacity for new data to be admitted.

Observability

Pathway provides services and tools for observability, allowing users to monitor how the system operates and detect issues or failures. In particular, dashboards exist over Grafana and OpenTelemetry combination, and browser-based tools are available for monitoring computational graph as a whole and its performance over time.

These observability tools maintain several internal metrics to help understand the behavior of the system. For example, the system tracks and logs the amounts of entries that have been ingested and emitted each five-second time periods. This is available in all kinds of logs: the native console dashboard, the standard Pathway logs, as well as in the more advanced tools. These numbers will help you to quickly understand if the input is happening and at what rate.

It is also important to pay attention to the latency metrics, which indicate the difference between the wall clock time and the logical minibatch time when the earliest not yet processed data was ingested. In other words, these metrics track the time lag between the arrival of data and its processing in the engine.

If you are trying to understand if the pipeline falls behind, the most important metric is latency. Under normal operation, latency should remain bounded, because the system is able to process incoming data before the next batch arrives. If latency grows, it indicates that requests are accumulating within the system. Without proper limits, described in the previous section, this can lead to increased memory usage.

It is also normal to observe temporary spikes in latency when a surge of input data occurs. As the engine processes this surge, latency gradually decreases, and memory is released progressively as data is processed and freed: the resource usage will adapt dynamically to workload fluctuations.

Persistence

Since Pathway is designed to operate online for extended periods and process data over days or months, it is important to minimize state recomputation and allow recovery after restarts. In practice, this means that after a restart, the system only needs to load the latest checkpoint and continue execution from the point where it left off.

Checkpointing is implemented on top of a set of key-value storage systems: generally, a new storage can be added as a persistence backend if it has the key-value interface and certain guarantees. The checkpointing can generally be split in two parts:

  • The checkpoint captures the internal state of computations. This part happens outside of the connector logic.
  • The checkpoint also includes the positions in all of the input sources where the computation has left off, so that upon restart, a seek to the corresponding positions is possible.

The latter is achieved by collecting all offsets returned by the read method, then selecting an antichain the minimal set of offsets that guarantees consistency (briefly mentioned in the Seeking section) — such that removing any element from this set would change the input frontier and make it incorrect.

These offsets are then stored in the chosen persistent storage.

The system processes data in operational minibatches. When input arrives, it's cut in those minibatches all the time, and when a termination happens, for whatever reason, there are minibatches that have been fully processed and have the output reflected in the destination storage, and those that are not. Only when a batch is fully processed, the system writes the corresponding checkpoint to persistent storage.

The checkpoint stores:

  1. The difference between the current and the previous saved computational state.
  2. A set of offset antichains for all data that has been read, exactly up to the point where the corresponding minibatch has stopped.

To save the complex diffs and ensure integrity, a two-phase commit is used: this way it is ensured that the checkpoint that has been chosen for the recovery had been written in full.

You can learn how to use persistence in Pathway here.

Backtesting with Multiple Actors

In complex data streams, it is often necessary to run experiments. For example, if a program changes some logic, you may want to run it on historical data to see how it would behave as if the data had arrived in real time. This allows evaluating program results on historical inputs.

This is possible using mechanisms that are common to all connectors, called synchronization groups.

By default, you could simply restart your program and begin reading from the start. However, this is not always feasible, especially when multiple sources or connectors must be synchronized. Event rates across sources may vary significantly:

  • Some sources may produce tens or hundreds of thousands of events per second.
  • Other sources may produce events only once per hour.

If a component is run without synchronization, sources with rare events will be read immediately, while high-frequency sources may lag, resulting in misaligned data. Historical data from high-volume sources could be processed along with the future data, taken from low-volume sources that are small, and therefore are fast to load.

You can address this by:

  • Assigning connectors to a synchronization group.
  • Specifying a time interval, ensuring that no source advances ahead of the others by more than the allowed delay.

For more advanced cases, priority mechanisms exist to define the order in which new values from connectors are processed, ensuring that some events always occur before others.

Finally, a unified approach can be defined, for a seamless switch from the backtest into the stream of the realtime data. To handle the switch gracefully, you can also introduce the idle durations for each source in the group. For example, if a high-priority source temporarily has no data, it can be excluded from the group until new data arrives, preventing unnecessary computational stalling for other sources.

These mechanisms allow testing program changes with realistic scaling of data consistent with real-world scenarios, and also enable seamless switching from processing historical data to processing live, real-time data.

Conclusion

At this point, you have a clear understanding of how connectors are integrated into Pathway and how the platform handles them. The information from the first part of this overview must already be enough for you to write and contribute your own native Pathway connector, if you want to.

While implementing something in the system is not a frequent scenario, the article also covers the main questions that you may have in relation to I/O in the framework.

Broadly, it has covered:

  • How the system ingests data,
  • What capabilities it provides,
  • General recommendations, and
  • The architectural principles it follows.

If you have any questions, feel free to open an issue on our GitHub or reach out via our Discord!