pw.io.kafka

class SchemaRegistryHeader(key, value)

[source]

Represents an additional header to be used in Confluent Schema Registry HTTP requests.

Parameters
- key (str) – The header key.
- value (str) – The header value.
Returns
The constructed header object

class SchemaRegistrySettings(urls, token_authorization=None, username=None, password=None, headers=None, proxy=None, timeout=None)

[source]

Connection settings for the Confluent Schema Registry.

Parameters
- urls (list[str]) – A list of URLs for connecting to the schema registry. If multiple URLs are provided, they will be used in the specified order.
- token_authorization (str | None) – Token used for token-based authorization.
- username (str | None) – Username for simple authorization.
- password (str | None) – Password for simple authorization. If specified, a username must also be provided.
- headers (list[SchemaRegistryHeader] | None) – Additional headers to include in HTTP requests to the schema registry.
- proxy (str | None) – Proxy address for registry requests.
- timeout (timedelta | None) – Timeout duration for network requests, in seconds.
Returns
The configuration object.

read(rdkafka_settings, topic=None, *, schema=None, mode='streaming', format='raw', schema_registry_settings=None, debug_data=None, autocommit_duration_ms=1500, json_field_paths=None, autogenerate_key=False, with_metadata=False, start_from_timestamp_ms=None, parallel_readers=None, name=None, max_backlog_size=None, **kwargs)

sourceGeneralized method to read the data from the given topic in Kafka.

There are three formats currently supported: "plaintext", "raw", and "json". If the "raw" format is chosen, the key and the payload are read from the topic as raw bytes and used in the table “as is”. If you choose the "plaintext" option, however, they are parsed from the UTF-8 into the plaintext entries. In both cases, the table consists of a primary key and two columns "key" and "data", denoting the key and the payload read.

If "json" is chosen, the connector first parses the payload of the message according to the JSON format and then creates the columns corresponding to the schema defined by the schema parameter. The values of these columns are taken from the respective parsed JSON fields.

Parameters
- rdkafka_settings (dict) – Connection settings in the format of librdkafka.
- topic (str | list[str] | None) – Name of topic in Kafka from which the data should be read.
- schema (type[Schema] | None) – Schema of the resulting table.
- mode (Literal['streaming', 'static']) – Specifies how the engine retrieves data from the topic. The default value is "streaming", which means the engine will constantly wait for new messages, process them as they arrive, and send them into the engine. Alternatively, if set to "static", the engine will only read and process the data that is already available at the time of execution.
- format (Literal['plaintext', 'raw', 'json']) – format of the input data, "raw", "plaintext", or "json".
- schema_registry_settings (SchemaRegistrySettings | None) – settings for connecting to the Confluent Schema Registry, if this type of registry is used.
- debug_data – Static data replacing original one when debug mode is active.
- autocommit_duration_ms (int | None) – the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are committed and pushed into Pathway’s computation graph.
- json_field_paths (dict[str, str] | None) – If the format is JSON, this field allows to map field names into path in the field. For the field which require such mapping, it should be given in the format <field_name>: <path to be mapped>, where the path to be mapped needs to be a JSON Pointer (RFC 6901).
- autogenerate_key (bool) – If True, Pathway automatically generates unique primary key for the entries read. Otherwise it first tries to use the key from the message. This parameter is used only if the format is “raw” or “plaintext”.
- with_metadata (bool) – When set to True, the connector will add an additional column named _metadata to the table. This column will be a JSON field. It’ll contain an optional field timestamp_millis denoting the UNIX timestamp of a record in milliseconds, if available. It will also contain fields topic, partition and offset denoting the topic, partition and offset respectively, that correspond to the Kafka message that produced this row.
- start_from_timestamp_ms (int | None) – If defined, the read starts from entries with the given timestamp in the past, specified in milliseconds.
- parallel_readers (int | None) – number of copies of the reader to work in parallel. In case the number is not specified, min{pathway_threads, total number of partitions} will be taken. This number also can’t be greater than the number of Pathway engine threads, and will be reduced to the number of engine threads, if it exceeds.
- name (str | None) – A unique name for the connector. If provided, this name will be used in logs and monitoring dashboards. Additionally, if persistence is enabled, it will be used as the name for the snapshot that stores the connector’s progress.
- max_backlog_size (int | None) – Limit on the number of entries read from the input source and kept in processing at any moment. Reading pauses when the limit is reached and resumes as processing of some entries completes. Useful with large sources that emit an initial burst of data to avoid memory spikes.
Returns
Table – The table read.

When using the format "raw" or "plaintext", the connector will produce a two-column table: all the payloads are saved into a column named data, while the keys are saved into a column key.

For other formats, the schema is required and defines the columns.

Example:

Consider a Kafka queue running locally on port 9092. For demonstration purposes, our queue uses simple SASL/PLAIN authentication. You can set up a Kafka cluster with similar parameters in Confluent Cloud or run it locally using Docker or Docker Compose. The rdkafka settings in our example will look as follows:

import os
rdkafka_settings = {
    "bootstrap.servers": "localhost:9092",
    "security.protocol": "sasl_ssl",
    "sasl.mechanism": "PLAIN",
    "sasl.username": os.environ["KAFKA_USERNAME"],
    "sasl.password": os.environ["KAFKA_PASSWORD"]
}

To connect to the topic “animals” and accept messages, the connector must be used as follows, depending on the format:

Raw version:

import pathway as pw
t = pw.io.kafka.read(
    rdkafka_settings,
    topic="animals",
    format="raw",
)

All the payload data will be accessible in the column data, the keys of the messages will be stored in the column key.

JSON version:

class InputSchema(pw.Schema):
    owner: str
    pet: str
t = pw.io.kafka.read(
    rdkafka_settings,
    topic="animals",
    format="json",
    schema=InputSchema,
)

For the JSON connector, you can send these two messages:

{"owner": "Alice", "pet": "cat"}
{"owner": "Bob", "pet": "dog"}

This way, you get a table which looks as follows:

pw.debug.compute_and_print(t, include_id=False)

Now consider that the data about pets come in a more sophisticated way. For instance you have an owner, kind and name of an animal, along with some physical measurements.

The JSON payload in this case may look as follows:

{
    "name": "Jack",
    "pet": {
        "animal": "cat",
        "name": "Bob",
        "measurements": [100, 200, 300]
    }
}

Suppose you need to extract a name of the pet and the height, which is the 2nd (1-based) or the 1st (0-based) element in the array of measurements. Then, you use JSON Pointer and do a connector, which gets the data as follows:

class InputSchema(pw.Schema):
    pet_name: str
    pet_height: int
t = pw.io.kafka.read(
    rdkafka_settings,
    topic="animals",
    format="json",
    schema=InputSchema,
    json_field_paths={
        "pet_name": "/pet/name",
        "pet_height": "/pet/measurements/1"
    },
)

Note that a Kafka message contains a key and a payload. By default, the schema fields are parsed from the payload, but this behavior can be changed. To do that, you need to specify the source_component parameter for the target fields. For example, if the schema is similar to the example above, but there is also a unique pet ID stored in the key JSON at the path /pet/identification/id, you can read it by first modifying the schema:

class InputSchema(pw.Schema):
    pet_id: int = pw.column_definition(primary_key=True, source_component="key")
    pet_name: str
    pet_height: int

And then by providing a JSONPath to this field as well in the read method:

t = pw.io.kafka.read(
    rdkafka_settings,
    topic="animals",
    format="json",
    schema=InputSchema,
    json_field_paths={
        "pet_id": "/pet/identification/id",
        "pet_name": "/pet/name",
        "pet_height": "/pet/measurements/1"
    },
)

Note that you would not need to provide the JSONPath for pet_id if it is at the top level of the key JSON.

write(table, rdkafka_settings, topic_name, *, format='json', schema_registry_settings=None, subject=None, delimiter=',', key=None, value=None, headers=None, name=None, sort_by=None)

sourceWrite a table to a given topic on a Kafka instance.

The produced messages consist of the key, corresponding to row’s key, the value, corresponding to the values of the table that are serialized according to the chosen format and two headers: pathway_time, corresponding to the logical time of the entry and pathway_diff that is either 1 or -1. Both header values are provided as UTF-8 encoded strings.

There are several serialization formats supported: ‘json’, ‘dsv’, ‘plaintext’ and ‘raw’. The format defines how the message is formed. In case of JSON and DSV (delimiter separated values), the message is formed in accordance with the respective data format.

If the selected format is either ‘plaintext’ or ‘raw’, you also need to specify, which columns of the table correspond to the key and the value of the produced Kafka message. It can be done by providing key and value parameters. In order to output extra values from the table in these formats, Kafka headers can be used. You can specify the column references in the headers parameter, which leads to serializing the extracted fields into UTF-8 strings and passing them as additional Kafka headers.

Parameters
- table (Table) – the table to output.
- rdkafka_settings (dict) – Connection settings in the format of librdkafka.
- topic_name (str | ColumnReference) – The Kafka topic where data will be written. This can be a specific topic name or a reference to a column whose values will be used as the topic for each message. If using a column reference, the column must contain string values.
- format (Literal['raw', 'plaintext', 'json', 'dsv']) – format in which the data is put into Kafka. Currently “json”, “plaintext”, “raw” and “dsv” are supported. If the “raw” format is selected, table must either contain exactly one binary column that will be dumped as it is into the Kafka message, or the reference to the target binary column must be specified explicitly in the value parameter. Similarly, if “plaintext” is chosen, the table should consist of a single column of the string type, or the reference to the target string column must be specified explicitly in the value parameter.
- schema_registry_settings (SchemaRegistrySettings | None) – settings for connecting to the Confluent Schema Registry, if this type of registry is used.
- subject (str | None) – the subject name for the schema in the Confluent Schema Registry, if the registry is used.
- delimiter (str) – field delimiter to be used in case of delimiter-separated values format ‘dsv’.
- key (ColumnReference | None) – reference to the column that should be used as a key in the produced message. If left empty, an internal primary key will be used.
- value (ColumnReference | None) – reference to the column that should be used as a value in the produced message in ‘plaintext’ or ‘raw’ format. It can be deduced automatically if the table has exactly one column. Otherwise it must be specified directly. It also has to be explicitly specified, if key is set. The type of the column must correspond to the format used: str for the ‘plaintext’ format and binary for the ‘raw’ format.
- headers (Optional[Iterable[ColumnReference]]) – references to the table fields that must be provided as message headers. These headers are named in the same way as fields that are forwarded and correspond to the string representations of the respective values encoded in UTF-8. If a binary column is requested, it will be produced “as is” in the respective header.
- name (str | None) – A unique name for the connector. If provided, this name will be used in logs and monitoring dashboards.
- sort_by (Optional[Iterable[ColumnReference]]) – If specified, the output will be sorted in ascending order based on the values of the given columns within each minibatch. When multiple columns are provided, the corresponding value tuples will be compared lexicographically.
Returns
None

Examples:

import os
rdkafka_settings = {
    "bootstrap.servers": "localhost:9092",
    "security.protocol": "sasl_ssl",
    "sasl.mechanism": "PLAIN",
    "sasl.username": os.environ["KAFKA_USERNAME"],
    "sasl.password": os.environ["KAFKA_PASSWORD"]
}

You want to send a Pathway table t to the Kafka instance.

import pathway as pw
t = pw.debug.table_from_markdown("age owner pet \n 1 10 Alice dog \n 2 9 Bob cat \n 3 8 Alice cat")

To connect to the topic “animals” and send messages, the connector must be used as follows, depending on the format:

JSON version:

pw.io.kafka.write(
    t,
    rdkafka_settings,
    "animals",
    format="json",
)

All the updates of table t will be sent to the Kafka instance.

Another thing to be demonstated is the usage of ‘raw’ format in the output. Please note that the same rules will be applicable for the ‘plaintext’ with the only difference being the requirement for the columns to have the string type.

Now consider that a table t2 contains two binary columns foo and bar, and a numerical column baz. That is, the schema of this table looks as follows:

class T2Schema(pw.Schema):
    foo: bytes
    bar: bytes
    baz: int

This table can be generated with a Python input connector as follows:

class T2GenerationSubject(pw.io.python.ConnectorSubject):
    def run(self) -> None:
        # TODO: define generation logic
        pass
t2 = pw.io.python.read(T2GenerationSubject(), schema=T2Schema)

Since is more than one column, you need to specify which one you want to use in the output, when using the ‘raw’ format. If this is the column foo, you may output this table as follows:

pw.io.kafka.write(
    t2,
    rdkafka_settings,
    "test",
    format="raw",
    value=t2.foo,
)

If at the same time you would prefer to have the key of the produced messages to be defined by the value of another binary column bar, you can use the key parameter as follows:

pw.io.kafka.write(
    t2,
    rdkafka_settings,
    "test",
    format="raw",
    key=t2.bar,
    value=t2.foo,
)

Still, the table has three fields and the field baz is not produced. You can do it with the usage of headers. To pass it to the header with the same name baz, you need to specify it:

pw.io.kafka.write(
    t2,
    rdkafka_settings,
    "test",
    format="raw",
    key=t2.bar,
    value=t2.foo,
    headers=[t2.baz],
)