Data Types and Schemas
In this guide, you will explore how to effectively utilize data types and schemas.
Understanding Data Types and Schemas
In Pathway, data is represented in the form of tables. The structure of each table is defined by a schema, which serves as a blueprint for the data. The schema ensures that the column types are correctly preserved, regardless of variations in the incoming data.
Typically, Pathway automatically infers the schema, but there are cases where enforcing a specific schema for input proves useful.
Here is a simple example on how to define a schema in Pathway:
import pathway as pw
class InputSchema(pw.Schema):
colA: int
colB: float
colC: str
Schema Usage in Pathway
Schemas play an important role in Pathway by allowing you to declare constraints on tables via input connectors. You can declare the following attributes within a schema:
- Columns: Select the desired columns for your table; any undeclared columns will be ignored by the connectors.
- Columns Types: Define the data type for each column, Pathway will automatically convert the input data accordingly.
- Primary Keys: Set primary keys to determine the indexes. If no primary keys are defined, indexes will be generated automatically.
- Default values: Specify default values for columns, making it easier to handle missing data.
How to Define and Use a Schema
To create a schema, you need to define a class that inherits from pathway.Schema
or pw.Schema
, depending on your import.
In the following we will use pw.Schema
.
Each column is declared as an attribute of the class. The schema is then passed as a parameter to the input connector:
class InputSchema(pw.Schema):
value: int
table = pw.io.csv.read("./input/", schema=InputSchema)
The above example defines a table with only one column named value
of type int
.
Defining Multiple Columns
You can declare multiple columns in a schema by simply adding them as attributes to your class:
class MyFirstTwoColumnsSchema(pw.Schema):
colA: int
colB: int
Typing the Columns
To assign data types to columns, simply specify the desired types for the associated attributes in your class:
class TypedSchema(pw.Schema):
colA: int
colB: float
With pw.Schema
, you have to type the columns.
If you don't know the type of the input column, you can type the column as typing.Any
:
class TypedSchema(pw.Schema):
colA: typing.Any
⚠️ While tempting, any
is not a Python type but a function.
Be careful to use typing.Any
as using any
will raise a ValueError
.
Defining Primary Keys
To designate primary keys, use the column_definition
function and the primary_key
parameter:
class PrimarySchema(pw.Schema):
colA: int = pw.column_definition(primary_key=True)
colB: float
In this example, the index will be based on the colA
column.
You can select multiple columns to be a part of primary key:
class MultiplePrimarySchema(pw.Schema):
colA: int = pw.column_definition(primary_key=True)
colB: float
colC: str = pw.column_definition(primary_key=True)
Defining Default Values
Similar to primary keys, you can set default values using the column_definition
function and the default_value
parameter:
class DefaultValueSchema(pw.Schema):
colA: int = pw.column_definition(default_value=0)
colB: float
colC: str = pw.column_definition(default_value="Empty")
Inline Schemas Definitions
When it may not be practical to define a class, when automating schema definitions for example, Pathway offers an alternative approaches using inline schema definition.
Schema from Dictionary
You can define a schema using a dictionary through the schema_builder
function.schema_builder
takes as argument a parameter columns
which is a dictionary that maps column names to column definitions created using the column_definition
function.
The dtype
parameter of column_definition
is used to specify the types, if not provided it defaults to typing.Any
.
Additionally, if desired, you can assign a name to the schema using the optional name
parameter.
schema = pw.schema_builder(columns={
'key': pw.column_definition(dtype=int, primary_key=True),
'data': pw.column_definition(dtype=int, default_value=0)
}, name="my_schema")
This resulting schema is equivalent to the following class-based schema:
class InputSchema(pw.Schema):
key: int = pw.column_definition(primary_key=True)
data: int = pw.column_definition(default_value=0)
With schema_builder
, defining the type is optional. The default type is typing.Any
:
schema = pw.schema_builder(columns={
'key': pw.column_definition(dtype=int, primary_key=True),
'data': pw.column_definition()
}, name="my_schema")
table = pw.io.csv.read("./input/", schema=schema)
print(table.typehints())
{'key': <class 'int'>, 'data': <class 'typing.Any'>}
Schema from Types
For the simple cases where you only need to define types and not default values nor primary keys, you can use schema_from_types
.
schema_from_types
simply takes the types as field=type
kwargs:
schema = pw.schema_from_types(key=int, data=int)
This resulting schema is equivalent to the following class-based schema:
class InputSchema(pw.Schema):
key: int
data: int
Accessing Table Types
During debugging, you may need to assess the schema of a table.
You can achieve this by printing its typehints
:
print(table.typehints())
This will display the data types of each column in the table. For example:
{'age': <class 'int'>, 'owner': <class 'str'>, 'pet': <class 'str'>}
You can also print the schema of a single column by using schema
:
print(table.schema['age'])
<class 'int'>
⚠️ Please note that these functions are executed during the creation of the pipeline before any computation is launched by pw.run().
Type Casting an Existing Table
You can also want to cast the data of an existing table.
This can be done using cast
:
table = table.select(value = pw.cast(int, pw.this.value))
This will cast the values of the column value
to int
.
Typing a Column Created with apply
If Pathway fails to infer the correct column's type created with apply
,
you can enforce the resulting type with apply_with_type
:
table = table.select(
value = pw.apply_with_type(lambda x: int(x)+1, int, pw.this.value)
)
This will cast the values to integers and increment each value by one,
resulting in the value column being of type int
.
This is only a workaround since Pathway should be able to correctly infer your data type.
Conclusions
Mastering data types and schemas is essential for effectively managing Tables. By leveraging schemas, you can define the structure of your tables, improving the efficiency of your data pipeline. If you encounter typing issues, please contact us on discord, so we can help you.