Using SQL with Pathway

Perform SQL commands using Pathway's pw.sql function.


Pathway provides a very simple way to use SQL commands directly in your Pathway application: the use of pw.sql. Pathway is significantly different from a usual SQL database, and not all SQL operations are available in Pathway. In the following, we present the SQL operations which are compatible with Pathway and how to use pw.sql.

This article is a summary of dos and don'ts on how to use Pathway to execute SQL queries, this is not an introduction to SQL.

Usage

You can very easily execute a SQL command by doing the following:

pw.sql(query, tab=t)

This will execute the SQL command query where the Pathway table t (Python local variable) can be referred to as tab (SQL table name) inside query. More generally, you can pass an arbitrary number of tables associations name, table using **kwargs: pw.sql(query, tab1=t1, tab2=t2,.., tabn=tn).

Example

import pathway as pw

t = pw.debug.table_from_markdown(
    """
    | a | b
 1  | 1 | 2
 2  | 4 | 3
 3  | 4 | 7
 """
)

ret = pw.sql("SELECT * FROM tab WHERE a<b", tab=t)
pw.debug.compute_and_print(ret)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^YYY4HAB... | 1 | 2
^3CZ78B4... | 4 | 7

Column names

Unlike in Pathway, column names are not case sensitive in SQL so column names are NOT case sensitive when using pw.sql. To improve the compatibility with Pathway, we encourage the use of the standard Python naming convention for column and table names: no special character other than "_", alphanumeric characters, and not starting with a number. Using a space will work in the SQL layer (e.g. returning a column called "a column"), though it may be impractical afterwards, when using the dot notation in Pathway: t.a column will raise an error. You can still use the t["a column"] syntax in Pathway.

List of Available SQL Operations

SELECT

Using select, you can access the different columns of a table:

result_select = pw.sql("SELECT a FROM tab", tab=t)
pw.debug.compute_and_print(result_select)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a
^YYY4HAB... | 1
^Z3QWT29... | 4
^3CZ78B4... | 4

⚠️ Pathway does not preserve the order of columns!

Star notation

Pathway supports the star notation * to select all the columns:

result_star = pw.sql("SELECT * FROM tab", tab=t)
pw.debug.compute_and_print(result_star)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^YYY4HAB... | 1 | 2
^Z3QWT29... | 4 | 3
^3CZ78B4... | 4 | 7

⚠️ Every Pathway table has a special column id: this column is NOT captured by * expressions in SQL.

WHERE

In a SELECT query, the WHERE clause can be used to select rows satisfying a given condition:

result_where = pw.sql("SELECT a,b FROM tab WHERE b>2", tab=t)
pw.debug.compute_and_print(result_where)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^Z3QWT29... | 4 | 3
^3CZ78B4... | 4 | 7

Boolean and Arithmetic Expressions

With the SELECT ... and WHERE ... clauses, you can use the following operators:

  • boolean operators: AND, OR, NOT
  • arithmetic operators: +, -, *, /, DIV, MOD, ==, !=, <, >, <=, >=, <>
  • NULL
result_bool = pw.sql("SELECT a,b FROM tab WHERE b-a>0 AND a>3", tab=t)
pw.debug.compute_and_print(result_bool)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^3CZ78B4... | 4 | 7

Both != and <> can be used to check non-equality.

result_neq = pw.sql("SELECT a,b FROM tab WHERE a != 4 OR b <> 3", tab=t)
pw.debug.compute_and_print(result_neq)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^YYY4HAB... | 1 | 2
^3CZ78B4... | 4 | 7

NULL can be used to filter out rows with missing values:

t_null = pw.debug.table_from_markdown(
    """
    | a | b
 1  | 1 | 2
 2  | 4 |
 3  | 4 | 7
 """
)

result_null = pw.sql("SELECT a, b FROM tab WHERE b IS NOT NULL ", tab=t_null)
pw.debug.compute_and_print(result_null)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | b
^YYY4HAB... | 1 | 2
^3CZ78B4... | 4 | 7

You can use single row result subqueries in the WHERE clause to filter a table based on the subquery results:

t_subqueries = pw.debug.table_from_markdown(
    """
    | employee | salary
 1  | 1        | 10
 2  | 2        | 11
 3  | 3        | 12
 """
)
result_subqueries = pw.sql(
    "SELECT employee, salary FROM t WHERE salary >= (SELECT AVG(salary) FROM t)",
    t=t_subqueries,
)
pw.debug.compute_and_print(result_subqueries)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | employee | salary
^Z3QWT29... | 2        | 11
^3CZ78B4... | 3        | 12

⚠️ For now, only single row result subqueries are supported. Correlated subqueries and the associated operations ANY, NONE, and EVERY (or its alias ALL) are currently not supported.

GROUP BY

You can use GROUP BY to group rows with the same value for a given column, and to use an aggregate function over the grouped rows.

result_groupby = pw.sql("SELECT a, SUM(b) FROM tab GROUP BY a", tab=t)
pw.debug.compute_and_print(result_groupby)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | _col_1
^YYY4HAB... | 1 | 2
^3HN31E1... | 4 | 10

⚠️ GROUP BY and JOIN should not be used together in a single SELECT.

Aggregation functions

With GROUP BY, you can use the following aggregation functions:

  • AVG
  • COUNT
  • MAX
  • MIN
  • SUM

⚠️ Pathway reducers (pw.count, pw.sum, etc.) aggregate over None values, while traditional SQL aggregate functions skip NULL values: be careful to remove all the undefined values before using an aggregate function.

HAVING

result_having = pw.sql("SELECT a, SUM(b) FROM tab GROUP BY a HAVING SUM(b)>5", tab=t)
pw.debug.compute_and_print(result_having)
[2024-04-23T20:17:56]:INFO:Preparing Pathway computation


            | a | _col_1
^3HN31E1... | 4 | 10

AS (alias)

Pathway supports both notations: old_name as new_name and old_name new_name.

result_alias = pw.sql("SELECT b, a AS c FROM tab", tab=t)
pw.debug.compute_and_print(result_alias)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | b | c
^YYY4HAB... | 2 | 1
^Z3QWT29... | 3 | 4
^3CZ78B4... | 7 | 4
result_alias = pw.sql("SELECT b, a c FROM tab", tab=t)
pw.debug.compute_and_print(result_alias)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | b | c
^YYY4HAB... | 2 | 1
^Z3QWT29... | 3 | 4
^3CZ78B4... | 7 | 4

UNION

Pathway provides the standard UNION SQL operator. Note that UNION requires matching column names.

t_union = pw.debug.table_from_markdown(
    """
    | a | b
 4  | 9 | 3
 5  | 2 | 7
 """
)

result_union = pw.sql("SELECT * FROM tab UNION SELECT * FROM tab2", tab=t, tab2=t_union)
pw.debug.compute_and_print(result_union)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | a | b
^KYCVNKF... | 1 | 2
^856GZ16... | 2 | 7
^H3J0A0V... | 4 | 3
^GX1QVN0... | 4 | 7
^7HC68KR... | 9 | 3

INTERSECT

Pathway provides the standard INTERSECT SQL operator. Note that INTERSECT requires matching column names.

t_inter = pw.debug.table_from_markdown(
    """
    | a | b
 4  | 9 | 3
 5  | 2 | 7
 6  | 1 | 2
 """
)

result_inter = pw.sql(
    "SELECT * FROM tab INTERSECT SELECT * FROM tab2", tab=t, tab2=t_inter
)
pw.debug.compute_and_print(result_inter)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | a | b
^KYCVNKF... | 1 | 2

⚠️ INTERSECT does not support INTERSECT ALL (coming soon).

JOIN

Pathway provides different join operations: INNER JOIN, LEFT JOIN (or LEFT OUTER JOIN), RIGHT JOIN (or RIGHT OUTER JOIN), SELF JOIN, and CROSS JOIN.

t_join = pw.debug.table_from_markdown(
    """
    | b | c
 4  | 4 | 9
 5  | 3 | 4
 6  | 7 | 5
 """
)
result_join = pw.sql(
    "SELECT * FROM left_table INNER JOIN right_table ON left_table.b==right_table.b",
    left_table=t,
    right_table=t_join,
)
pw.debug.compute_and_print(result_join)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | a | b | c
^J1AVR2S... | 4 | 3 | 4
^8V184A9... | 4 | 7 | 5

⚠️ GROUP BY and JOIN should not be used together in a single SELECT.

⚠️ NATURAL JOIN and FULL JOIN are not supported (coming soon).

WITH

In addition to being placed inside a WHERE clause, subqueries can also be performed using the WITH keyword:

result_with = pw.sql(
    "WITH group_table (a, sumB) AS (SELECT a, SUM(b) FROM tab GROUP BY a) SELECT sumB FROM group_table",
    tab=t,
)
pw.debug.compute_and_print(result_with)
[2024-04-23T20:17:57]:INFO:Preparing Pathway computation


            | sumB
^YYY4HAB... | 2
^3HN31E1... | 10

Differences from the SQL standard

First of all, not all SQL queries can be executed in Pathway. This stems mainly from the fact that Pathway is built to process streaming and dynamic data efficiently.

No ordering

In Pathway, indexes are separately generated and maintained by the engine, which does not guarantee any row order: SQL operations like LIMIT, ORDER BY or SELECT TOP don't always make sense in this context. In the future, we will support an ORDER BY ... LIMIT ... keyword combination, which is typically meaningful in Pathway. The column id is reserved and should not be used as a column name, this column is not captured by * expressions.

Furthermore, there is no order on the columns and the column order used in a SELECT query need not be preserved.

Immutability

Pathway tables are immutable: operations such as INSERT INTO are not supported.

Limits

Correlated subqueries are currently not supported and keywords such as LIKE, ANY, ALL, or EXISTS are not supported. COALESCE andIFNULL are not supported but should be soon. We strongly suggest not to use anonymous columns: they might work but we cannot guarantee their behavior.

Conclusion

Pathway provides a powerful API to ease the transition of SQL data transformations and pipelines into Pathway. However, Pathway and SQL serve different purposes. To benefit from all the possibilities Pathway has to offer we strongly encourage you to use the Python syntax directly, as much as you can. Most of the time, this syntax is at least as easy to follow as SQL - see for example our join and groupby manuals.