Automatic Generation of Schema in Pathway Live Data Framework

In this article, you will learn how to easily generate Schemas, which are used to define the structure of a Pathway Live Data Framework Table.

In Pathway Live Data Framework, Schemas are used to define the structure of a Table, that is, names and types of columns. To learn the basics of Schemas in Pathway Live Data Framework, and how you can create them by writing a class definition in Python, read our user guide. In this tutorial, you will learn alternative ways to create Schemas, either by providing a dictionary of column definitions or sample input data in CSV format.

To run the examples in this tutorial, you can download the needed files by uncommenting and executing the following code:

%%capture --no-display
# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/schema-generation-sample-data.csv -O data.csv
# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/schema-generation-schema.json -O schema.json

Schema generation from JSON

In Pathway Live Data Framework, you can build a Schema from a dictionary using the schema_builder function, described in our user guide. The values in the dictionary given as an input to pw.schema_builder, however, are instances of pw.column_definition, which has some limitations, e.g. they can't be easily serialized in a JSON file. For this purpose, in Pathway Live Data Framework there is a pw.schema_from_dict function, which takes as an input a dictionary whose keys are names of the columns, and values are either:

type of a column
a dictionary with keys "dtype", "primary_key", "default_value", which define respectively the type of the column, whether it is a primary key of the Table, and what is its default value.

To see a working example of pw.schema_from_dict, start by creating a JSON file with a definition of the Schema you want to generate.

{
    "key": {
        "dtype": "int",
        "primary_key": true
    },
    "name": {
        "dtype": "str",
        "default_value": ""
    },
    "value": "int"
}

Now, you need to load it into Python, where it will be represented as a dict type.

import json

with open("schema.json") as f:
    schema_definition = json.load(f)

Then all you need to do is pass it as an argument to pw.schema_from_dict which will return a Schema object.

import pathway as pw


schema = pw.schema_from_dict(schema_definition)

# Check the resulting Schema
schema, schema.primary_key_columns(), schema.default_values()

(<pathway.Schema types={'key': <class 'int'>, 'name': <class 'str'>, 'value': <class 'int'>}, id_type=pathway.engine.Pointer[int]>,
 ['key'],
 {'name': ''})

You can now use the created Schema as an argument to the connector you want to use.

Schema generation from data in a CSV file

If you have a CSV file with sample data, you can use it to generate a schema in Pathway Live Data Framework. The name of columns will be taken from the header of the CSV file, whereas types of columns are inferred by checking if values in the given column can be parsed to int or float. This method of Schema generation does not support choosing primary keys or setting default values. Still, once you generate a Schema, you can generate a class definition using the method described later in this tutorial and then edit it to suit your needs.

To generate a Schema based on a CSV file, use a pw.schema_from_csv function. To see how it works, you will use the following example data in CSV format:

age,owner,pet
10,Alice,dog
9,Bob,dog
8,Alice,cat
7,Bob,dog

To use pw.schema_from_csv, you only need to provide a path of CSV file, but it also has some number of optional arguments:

name - the name of the resulting Schema class
properties - an instance of SchemaProperties
delimiter - the delimiter used in the CSV file
comment_character - the character used to denote that a row is a comment
escape - the escape character used in the CSV file
quote - the character used to quote fields
enable_double_quote_escapes - enables escaping quotes by using double quotes
num_parsed_rows - how many rows should be parsed. If None, all rows will be parsed.

For the example, you can use the default values of the optional arguments, so the Python code is:

schema = pw.schema_from_csv("data.csv")

# Check the resulting Schema
schema

<pathway.Schema types={'age': <class 'int'>, 'owner': <class 'str'>, 'pet': <class 'str'>}, id_type=<class 'pathway.engine.Pointer'>>

Persisting generated Schema as a Python class definition

If you have a Schema object, you may choose to generate a class definition, either to make the codebase independent of other files you need to generate the Schema or to change it, for example, adding default values to a Schema generated based on a CSV file.

To do that, you can use a method in the Schema class called generate_class to generate a string with a class definition or generate_class_to_file to generate a class definition and save it to a file.

Let's go through an example of using generate_class and generate_class_to_file. In the example, you will work on the schema generated in the Section on schema_from_dict.

with open("schema.json") as f:
    schema_definition = json.load(f)
schema = pw.schema_from_dict(schema_definition)

The first method - generate_class - has no required arguments, and two optional argument - class_name, is the name of the class with the generated Schema and generate_imports specifies if imports of modules used in the Class definition should be included in the beginning of the string. If class_name is not provided, the schema's name will be used, or if it is not a correct identifier, the default name CustomSchema will be used.

print(schema.generate_class(class_name="MySchema"))

class MySchema(pw.Schema):
    key: int = pw.column_definition(primary_key=True)
    name: str = pw.column_definition(default_value='')
    value: int

Method generate_class_to_file has one required argument, which is the path where class definition is to be saved. Its optional arguments class_name and generate_includes are the same as for generate_class method.

schema.generate_class_to_file(
    "myschema.py", class_name="MySchema", generate_imports=True
)
with open("myschema.py") as f:
    print(f.read())

import pathway as pw

class MySchema(pw.Schema):
    key: int = pw.column_definition(primary_key=True)
    name: str = pw.column_definition(default_value='')
    value: int