Basic Data Types

This guide is about basic data types in Pathway: it covers the list of basic data types that can be used in Pathway, explores several available conversion methods, and wraps up with examples of operators that require a column of specific data type as input.

Currently, Pathway allows using the following basic Python types: bool, str, int, float, and bytes. Additionally, there is support for types datetime and duration from datetime module, distinguishing between utc datetimes and naive datetimes. Finally, Pathway also introduces an additional type for columns representing pointers, called Pointer. Below, you can find an example table with six columns, one example column for each of the basic Python types, and one column of type Pointer. The complex types (as datetime) need some conversion, and they are covered in the later part of the article. The standard way to define a type in Pathway is to use a schema (you can learn more about schemas in this article):

import datetime

import pathway as pw



class SimpleTypesInputSchema(pw.Schema):
    bool_column: bool
    str_column: str
    bytes_column: bytes
    int_column: int
    float_column: float

example_table = pw.debug.table_from_markdown(
    '''
      | bool_column | str_column | bytes_column | int_column | float_column
    1 | True        | example    | example      | 42         | 42.16
    2 | False       | text       | text         | 16         | -16.42
    ''', schema = SimpleTypesInputSchema
).with_columns(id_in_column = pw.this.id)

pw.debug.compute_and_print(example_table, include_id = False)
bool_column | str_column | bytes_column | int_column | float_column | id_in_column
False       | text       | b'text'      | 16         | -16.42       | ^Z3QWT29...
True        | example    | b'example'   | 42         | 42.16        | ^YYY4HAB...
print(example_table.schema)
id          | bool_column | str_column | bytes_column | int_column | float_column | id_in_column
ANY_POINTER | BOOL        | STR        | BYTES        | INT        | FLOAT        | ANY_POINTER 

Implicit Typing

By default, you don't need to worry about they types of columns created with select or with_columns. The expressions used in those operators have defined output type, Pathway knows it, and assigns the types of new columns automatically. In the example below, the new column is of type float, as it is a result of multiplication of int column with a float constant.

example_table = pw.debug.table_from_markdown(
    '''
      | int_number
    1 | 16
    2 | 42
    '''
)
example_table += example_table.select(should_be_float_number = example_table.int_number*0.5)
pw.debug.compute_and_print(example_table, include_id=False)
int_number | should_be_float_number
16         | 8.0
42         | 21.0

As you can see, the type of int_number is int, and the new column is of type float.

print(example_table.schema)
id          | int_number | should_be_float_number
ANY_POINTER | INT        | FLOAT                 

Similarly, the special columns produced by some of the Pathway operators (examples in the later part of the article) have fixed types and as such, you don't need to bother with the types of those columns.

Apply With Type

Sometimes you may want to compute a value of a column, using e.g. a function from an external library, that does not define the output type explicitly. In this case, you can use either pw.apply or pw.apply_with_type. The first creates a new column of type any and the other requires you to specify the type of the output of function that is applied.

Data types for columns storing text and unstructured data

In Pathway you can store unstructured data either as str or as bytes. Both can be converted to other data types, either by built in methods (some examples in this article) or by user defined functions (i.e. via pw.apply or pw.apply_with_type).

Type str

class StrExampleInputSchema(pw.Schema):
    text: str

str_table = pw.debug.table_from_markdown(
    '''
      | text
    1 | cd
    2 | dd
    ''', schema = StrExampleInputSchema
)

Below is an example of conversion from str to bytes. Currently, there is no built-in conversion method. The recommended way is to use apply_with_type.

str_table = str_table.with_columns(text_as_bytes = pw.apply_with_type(lambda x: x.encode("utf8"), bytes, str_table.text))

pw.debug.compute_and_print(str_table, include_id=False)
text | text_as_bytes
cd   | b'cd'
dd   | b'dd'
print(str_table.schema)
id          | text | text_as_bytes
ANY_POINTER | STR  | BYTES        

Module str

Furthermore, Pathway provides a string module containing string operations. Among other things, it provides several methods that allow parsing converting str to other simple types, accessible via the str namespace of column (e.g. table_name.column_name.str.parse_*). You can find examples of usage of those methods in the remaining part of this article.

Type bytes

class BytesExampleInputSchema(pw.Schema):
    bytes_from_markdown: bytes

bytes_table = pw.debug.table_from_markdown(
    '''
      | bytes_from_markdown
    1 | cd
    2 | dd
    ''', schema = BytesExampleInputSchema
)

Below is an example of conversion from bytes to str. Currently, there is no built-in conversion method. The recommended way is to use apply_with_type. Remark: the to_string function does not decode the bytes, but shows a string representation of byte numbers.

bytes_table = bytes_table.with_columns(
    text_from_bytes = pw.apply_with_type(lambda x: x.decode("utf8"), str, bytes_table.bytes_from_markdown),
    text_representation_of_bytes = bytes_table.bytes_from_markdown.to_string()
)

pw.debug.compute_and_print(bytes_table, include_id=False)
bytes_from_markdown | text_from_bytes | text_representation_of_bytes
b'cd'               | cd              | [99, 100]
b'dd'               | dd              | [100, 100]
print(bytes_table.schema)
id          | bytes_from_markdown | text_from_bytes | text_representation_of_bytes
ANY_POINTER | BYTES               | STR             | STR                         

Numerical Data Types

Pathway supports operations on Python int and float types, and on their numpy counterparts. Below, you can find a few short examples that read and convert numbers in Pathway.

Type int

class IntExampleInputSchema(pw.Schema):
    int_number: int

int_table = pw.debug.table_from_markdown(
        '''
    | int_number
  1 | 2
  2 | 3
  ''', schema = IntExampleInputSchema
)

Similarly, as in the conversion between str and bytes, you can use apply_with_type to convert a column of type int into a column of type float. Furthermore, it can be expressed in a more concise way, with apply. Moreover, in this case you can also use the built-in cast function. All mentioned examples can be found in the code snippet below:

int_table = int_table.with_columns(
    int_as_float = pw.apply_with_type(lambda x: float(x), float, int_table.int_number),
    int_as_float_via_constructor = pw.apply(float, int_table.int_number),
    int_as_float_casted = pw.cast(float, int_table.int_number)
)


pw.debug.compute_and_print(int_table, include_id=False)
int_number | int_as_float | int_as_float_via_constructor | int_as_float_casted
2          | 2.0          | 2.0                          | 2.0
3          | 3.0          | 3.0                          | 3.0
print(int_table.schema)
id          | int_number | int_as_float | int_as_float_via_constructor | int_as_float_casted
ANY_POINTER | INT        | FLOAT        | FLOAT                        | FLOAT              

Type float

class FloatExampleInputSchema(pw.Schema):
    float_number: float
    another_float_number: float

float_table = pw.debug.table_from_markdown(
        '''
    | float_number | another_float_number
  1 | 2            | -5.7
  2 | 3            | 6.6
  ''', schema = FloatExampleInputSchema
)

As in the case of conversion from int to float, you can use pw.cast to convert data from type float to int.

float_table = float_table.with_columns(another_number_as_int = pw.cast(int, float_table.another_float_number))
print(float_table.schema)
id          | float_number | another_float_number | another_number_as_int
ANY_POINTER | FLOAT        | FLOAT                | INT                  

Parse numbers from str

Below, you can find an application of the parsing methods from the str namespace (parse_int and parse_float) to parse ints and floats for columns of type str.

class StrNumberExampleInputSchema(pw.Schema):
    number: str

str_number_table = pw.debug.table_from_markdown(
        '''
    | number
  1 | 2
  2 | 3
  ''', schema = StrNumberExampleInputSchema
)

str_number_table = str_number_table.with_columns(
    number_as_int = str_number_table.number.str.parse_int(),
    number_as_float = str_number_table.number.str.parse_float(),
    number_with_extra_text = str_number_table.number + "a"
)

pw.debug.compute_and_print(str_number_table)
            | number | number_as_int | number_as_float | number_with_extra_text
^YYY4HAB... | 2      | 2             | 2.0             | 2a
^Z3QWT29... | 3      | 3             | 3.0             | 3a

As you can see, the schema shows that the original column was of type str, and each new column has a different type, as expected.

print(str_number_table.schema)
id          | number | number_as_int | number_as_float | number_with_extra_text
ANY_POINTER | STR    | INT           | FLOAT           | STR                   

Numerical Module

In case you need to use some basic operations on columns of numerical type, Pathway provides a module containing functions over numerical data types such as abs or round.

Temporal Data Types

In Pathway, temporal data types (datetime.datetime) are complex data types with some representation as some simple type (as int or str). As such, you first need to load the input as simple type, and only then convert it to temporal type. Similarly to Python, Pathway distinguishes between naive datetime (not aware of timezones) and UTC datetime (aware of time zones). Below, you can find examples of reading both kinds of datetime, initially provided as str and int, using methods from the Pathway dt module:

class DatetimeNaiveExampleInputSchema(pw.Schema):
    t1: str
    t2: int

naive_datetime = pw.debug.table_from_markdown(
        """
      |         t1          |      t2
    0 | 2023-05-15T10:13:00 | 1684138380000
    """, schema = DatetimeNaiveExampleInputSchema
)
fmt = "%Y-%m-%dT%H:%M:%S"
naive_datetime = naive_datetime.with_columns(
    dt1 = naive_datetime.t1.dt.strptime(fmt=fmt),
    dt2 = naive_datetime.t2.dt.from_timestamp("ms")
)

naive_datetime = naive_datetime.with_columns(
    difference = naive_datetime.dt1 - naive_datetime.dt2
)

pw.debug.compute_and_print(naive_datetime)

print(naive_datetime.schema)
            | t1                  | t2            | dt1                 | dt2                 | difference
^X1MXHYY... | 2023-05-15T10:13:00 | 1684138380000 | 2023-05-15 10:13:00 | 2023-05-15 08:13:00 | 0 days 02:00:00
id          | t1  | t2  | dt1             | dt2             | difference
ANY_POINTER | STR | INT | DATE_TIME_NAIVE | DATE_TIME_NAIVE | DURATION  
utc_datetime = pw.debug.table_from_markdown(
        """
      |         t1                |      t2
    0 | 2023-05-15T10:13:00+01:00 | 1684138380000
    """, schema = DatetimeNaiveExampleInputSchema
)

fmt = "%Y-%m-%dT%H:%M:%S%z"
utc_datetime = utc_datetime.with_columns(
    dt1 = utc_datetime.t1.dt.strptime(fmt=fmt),
    dt2 = utc_datetime.t2.dt.utc_from_timestamp("ms")
)

utc_datetime = utc_datetime.with_columns(
    difference = utc_datetime.dt1 - utc_datetime.dt2
)

pw.debug.compute_and_print(utc_datetime)

print(utc_datetime.schema)
            | t1                        | t2            | dt1                       | dt2                       | difference
^X1MXHYY... | 2023-05-15T10:13:00+01:00 | 1684138380000 | 2023-05-15 09:13:00+00:00 | 2023-05-15 08:13:00+00:00 | 0 days 01:00:00
id          | t1  | t2  | dt1           | dt2           | difference
ANY_POINTER | STR | INT | DATE_TIME_UTC | DATE_TIME_UTC | DURATION  

Type bool

Below, you can find a piece of code reading and converting boolean data.

class BoolExampleInputSchema(pw.Schema):
    boolean_column: bool

bool_table = pw.debug.table_from_markdown(
        '''
    | boolean_column
  1 | True
  2 | False
  ''', schema = BoolExampleInputSchema
)


bool_table = bool_table.with_columns(bool_as_str = bool_table.boolean_column.to_string())
bool_table = bool_table.with_columns(bool_as_str_as_bool_parse = bool_table.bool_as_str.str.parse_bool())
pw.debug.compute_and_print(bool_table, include_id=False)
print(bool_table.schema)
boolean_column | bool_as_str | bool_as_str_as_bool_parse
False          | False       | False
True           | True        | True
id          | boolean_column | bool_as_str | bool_as_str_as_bool_parse
ANY_POINTER | BOOL           | STR         | BOOL                     

Warning: please do not use cast to convert boolean data type. While it is possible to call it, its behavior is counterintuitive and will be deprecated. Below, we demonstrate the odd behavior.

bool_table = bool_table.with_columns(bool_as_str_as_bool_cast = pw.cast(bool, bool_table.bool_as_str))
pw.debug.compute_and_print(bool_table, include_id=False)
print(bool_table.schema)
boolean_column | bool_as_str | bool_as_str_as_bool_parse | bool_as_str_as_bool_cast
False          | False       | False                     | True
True           | True        | True                      | True
id          | boolean_column | bool_as_str | bool_as_str_as_bool_parse | bool_as_str_as_bool_cast
ANY_POINTER | BOOL           | STR         | BOOL                      | BOOL                    

Optional Data Types

Sometimes, you don't have a guarantee that the data is always present. To accommodate for such columns, Pathway provides support for the Optional data type. More precisely, whenever you expect the column to have values of type T, but not necessarily always present, the type of Pathway column to store this data should be Optional[T] which can also be denoted as T | None. Below, you can find a short example of the column with optional floats and two conversion methods.

class OptInputSchema(pw.Schema):
    opt_float_num: float | None

t = pw.debug.table_from_markdown(
    """
    | opt_float_num
1   | 1
2   | 2
3   | None
""",
    schema=OptInputSchema,
)

pw.debug.compute_and_print(t, include_id=False)
print(t.schema)
opt_float_num
None
1.0
2.0
id          | opt_float_num
ANY_POINTER | Optional(FLOAT)

To obtain a column with a non-optional type, you can filter the non-empty values using filter and is_not_none:

t1 = t.filter(t.opt_float_num.is_not_none()).rename_columns(float_num = t.opt_float_num)
pw.debug.compute_and_print(t1, include_id=False)
print(t1.schema)
float_num
1.0
2.0
id          | float_num
ANY_POINTER | FLOAT    

The more general way of making the type non-optional is via unwrap. The code below is equivalent to the application of filter and is_not_none() above.


t2 = t.filter(t.opt_float_num != None)
t2 = t2.with_columns(float_num = pw.unwrap(t2.opt_float_num)).without(t2.opt_float_num)
pw.debug.compute_and_print(t2, include_id=False)
print(t2.schema)
float_num
1.0
2.0
id          | float_num
ANY_POINTER | FLOAT    

Operators with Type Constraints

Pathway provides several operators requiring input columns to have specific types. The input types are constrained because the functions are not defined for all types, e.g., temporal operators require time-like input columns, sort operator requires data to be sortable, and diff requires that we can subtract two elements of considered type.

Temporal operators

An example of a temporal operator is the windowby operator. Its first argument is time_expr - the operator uses this column to store time associated with each row and then uses it according to window type and temporal behavior defined in other parameters. Since this column is supposed to represent time, we accept the types int, float, datetime, as they can be reasonably used to do so. In the example below, the windowby operator uses a column with naive datetime.

fmt = "%Y-%m-%dT%H:%M:%S"

table = pw.debug.table_from_markdown(
    """
    | time                  | number
 0  | 2023-06-22T09:12:34   | 2
 1  | 2023-06-22T09:23:56   | 2
 2  | 2023-06-22T09:45:20   | 1
 3  | 2023-06-22T09:06:30   | 1
 4  | 2023-06-22T10:11:42   | 2
"""
).with_columns(time=pw.this.time.dt.strptime(fmt))

result = table.windowby(
    table.time,
    window=pw.temporal.tumbling(duration=datetime.timedelta(minutes=30)),
).reduce(
    window_start = pw.this._pw_window_start,
    chocolate_bars=pw.reducers.sum(pw.this.number),
)

pw.debug.compute_and_print(result, include_id=False)
window_start        | chocolate_bars
2023-06-22 09:00:00 | 5
2023-06-22 09:30:00 | 1
2023-06-22 10:00:00 | 2

Sorting Operator

Another example of an operator that accepts type-constrained columns is sort. It requires that the values in the column can be sorted (i.e., the column has type with total order). Currently, it can be used with all simple types, however please take into account that comparing elements of type str or bytes may be slow, so it's generally not recommended.

table_to_sort = pw.debug.table_from_markdown('''
    value  | value_str
    1      | de
    2      | fg
    3      | cd
    4      | ab
    5      | ef
    6      | bc
''')

sorted_by_value = table_to_sort.sort(table_to_sort.value) + table_to_sort
print(sorted_by_value.schema)
id          | prev                  | next                  | value | value_str
ANY_POINTER | Optional(ANY_POINTER) | Optional(ANY_POINTER) | INT   | STR      
sorted_by_value_str = table_to_sort.sort(table_to_sort.value_str) + table_to_sort
print(sorted_by_value_str.schema)
id          | prev                  | next                  | value | value_str
ANY_POINTER | Optional(ANY_POINTER) | Optional(ANY_POINTER) | INT   | STR      

Diff

Below are a few examples demonstrating the diff operator. Essentially, it sorts the table with respect to one column, and then, for each row and some other column, it subtracts the previous value from the current value. As such, it has two types of constrained columns, one with constraints for the sort operator, and the other requires that we can subtract the elements. Currently, among simple types, the subtraction can be done on elements of type int, float and datetime.

table = pw.debug.table_from_markdown('''
    timestamp | values | values_str
    1         | 1      | fg
    2         | 2      | ef
    3         | 4      | de
    4         | 7      | cd
    5         | 11     | bc
    6         | 16     | ab
    ''')
table1 = table + table.diff(pw.this.timestamp, pw.this.values)
print(table1.schema)
id          | timestamp | values | values_str | diff_values
ANY_POINTER | INT       | INT    | STR        | Optional(INT)
pw.debug.compute_and_print(table1, include_id=False)
timestamp | values | values_str | diff_values
1         | 1      | fg         |
2         | 2      | ef         | 1
3         | 4      | de         | 2
4         | 7      | cd         | 3
5         | 11     | bc         | 4
6         | 16     | ab         | 5
table = table.with_columns(date = table.values.dt.from_timestamp("ms"))

table2 = table + table.diff(pw.this.timestamp, pw.this.date)
print(table2.schema)
id          | timestamp | values | values_str | date            | diff_date
ANY_POINTER | INT       | INT    | STR        | DATE_TIME_NAIVE | Optional(DURATION)
pw.debug.compute_and_print(table2, include_id=False)
timestamp | values | values_str | date                       | diff_date
1         | 1      | fg         | 1970-01-01 00:00:00.001000 |
2         | 2      | ef         | 1970-01-01 00:00:00.002000 | 0 days 00:00:00.001000
3         | 4      | de         | 1970-01-01 00:00:00.004000 | 0 days 00:00:00.002000
4         | 7      | cd         | 1970-01-01 00:00:00.007000 | 0 days 00:00:00.003000
5         | 11     | bc         | 1970-01-01 00:00:00.011000 | 0 days 00:00:00.004000
6         | 16     | ab         | 1970-01-01 00:00:00.016000 | 0 days 00:00:00.005000
table3 = table + table.diff(pw.this.values_str, pw.this.values)
print(table3.schema)
id          | timestamp | values | values_str | date            | diff_values
ANY_POINTER | INT       | INT    | STR        | DATE_TIME_NAIVE | Optional(INT)
pw.debug.compute_and_print(table3, include_id=False)
timestamp | values | values_str | date                       | diff_values
1         | 1      | fg         | 1970-01-01 00:00:00.001000 | -1
2         | 2      | ef         | 1970-01-01 00:00:00.002000 | -2
3         | 4      | de         | 1970-01-01 00:00:00.004000 | -3
4         | 7      | cd         | 1970-01-01 00:00:00.007000 | -4
5         | 11     | bc         | 1970-01-01 00:00:00.011000 | -5
6         | 16     | ab         | 1970-01-01 00:00:00.016000 |

In particular, calling diff on elements from values_str, which cannot be subtracted, causes the following error:

TypeError: Pathway does not support using binary operator sub on columns of types <class 'str'>, <class 'str'>.