Run In Colab  View in Github

Using SQL with Pathway

Perform SQL commands using Pathway's pw.sql function.


⚠️ As of January 9, 2023, the pw.sql feature is experimental; please bear with us for updates.


Pathway provides a very simple way to use SQL commands directly in your Pathway application: the use of pw.sql. Pathway is significantly different from a usual SQL database, and not all SQL operations are available in Pathway. In the following, we present which SQL operations are compatible with Pathway and how to use pw.sql.

This article is a summary of dos and don'ts on how to use Pathway to execute SQL queries, this is not an introduction to SQL.

Usage

We can very easily execute a SQL command by doing the following:

pw.sql(query, tab=t)

This will execute the SQL command query where the Pathway table t (Python local variable) can be referred to as tab (SQL table name) inside query. More generally, we can pass an arbitrary number of tables associations name, table using **kwargs: pw.sql(query, tab1=t1, tab2=t2,.., tabn=tn).

Example

import pathway as pwt = pw.debug.table_from_markdown(    """    | a | b 1  | 1 | 2 2  | 4 | 3 3  | 4 | 7 """)ret = pw.sql("SELECT * FROM tab WHERE a<b", tab=t)pw.debug.compute_and_print(ret)
            | a | b
^2TMTFGY... | 1 | 2
^SERVYWW... | 4 | 7

Column names

In Pathway, column names are case sensitive and pw.sql preserves the case so column names are case sensitive when using pw.sql. To improve the compatibility with Pathway, we encourage the use of the standard Python naming convention for column and table names: no special character other than "_", alphanumeric characters, and not starting with a number. Using a space will work in the SQL layer (e.g. returning a column called "a column"), though it may be impractical afterwards, when using the dot notation in Pathway: t.a column will raise an error. You can still use the t["a column"] syntax in Pathway.

List of Available SQL Operations

SELECT

Using select, you can access the different columns of a table:

result_select = pw.sql("SELECT a FROM tab", tab=t)pw.debug.compute_and_print(result_select)
            | a
^2TMTFGY... | 1
^YHZBTNY... | 4
^SERVYWW... | 4

⚠️ Pathway does not preserve the order of columns!

Star notation

Pathway supports the star notation * to select all the columns:

result_star = pw.sql("SELECT * FROM tab", tab=t)pw.debug.compute_and_print(result_star)
            | a | b
^2TMTFGY... | 1 | 2
^YHZBTNY... | 4 | 3
^SERVYWW... | 4 | 7

⚠️ Every Pathway table has a special column id: this column is NOT captured by * expressions in SQL.

WHERE

In a SELECT query, the WHERE clause can be used to select rows satisfying a given condition:

result_where = pw.sql("SELECT a,b FROM tab WHERE b>2", tab=t)pw.debug.compute_and_print(result_where)
            | a | b
^YHZBTNY... | 4 | 3
^SERVYWW... | 4 | 7

Boolean and Arithmetic Expressions

With the SELECT ... and WHERE ... clauses, we can use the following operators:

  • boolean operators: AND, OR, NOT
  • arithmetic operators: +, -, *, /, DIV, MOD, ==, !=, <, >, <=, >=, <>
  • NULL
result_bool = pw.sql("SELECT a,b FROM tab WHERE b-a>0 AND a>3", tab=t)pw.debug.compute_and_print(result_bool)
            | a | b
^SERVYWW... | 4 | 7

Both != and <> can be used to check non-equality.

result_neq = pw.sql("SELECT a,b FROM tab WHERE a != 4 OR b <> 3", tab=t)pw.debug.compute_and_print(result_neq)
            | a | b
^2TMTFGY... | 1 | 2
^SERVYWW... | 4 | 7

NULL can be used to filter out rows with missing values:

t_null = pw.debug.table_from_markdown(    """    | a | b 1  | 1 | 2 2  | 4 |  3  | 4 | 7 """)result_null = pw.sql("SELECT a, b FROM tab WHERE b IS NOT NULL ", tab=t_null)pw.debug.compute_and_print(result_null)
            | a | b
^2TMTFGY... | 1 | 2
^SERVYWW... | 4 | 7

Subqueries in a WHERE clause

⚠️ Subquery support is limited (fragile). If Pathway is unable to parse a given query, it should be rephrased to simplify or avoid subqueries.

We can use subqueries in the WHERE clause to filter a table based on the subquery results:

t_Q = pw.debug.table_from_markdown(    """    | q | Q 1  | 1 | 10 2  | 2 | 11 3  | 3 | 12 """)t_A = pw.debug.table_from_markdown(    """    | q | A 1  | 2 | 100 2  | 3 |  """)result_sub = pw.sql("SELECT * FROM questions WHERE q = (SELECT q FROM answers)", questions=t_Q, answers=t_A)pw.debug.compute_and_print(result_sub)
            | q | Q
^89XK2SN... | 2 | 11
^VP9768X... | 3 | 12

GROUP BY

You can use GROUP BY to group rows with the same value for a given column, and to use an aggregate function over the grouped rows.

result_groupby = pw.sql("SELECT a, SUM(b) FROM tab GROUP BY a", tab=t)pw.debug.compute_and_print(result_groupby)
            | a | _col_1
^2TMTFGY... | 1 | 2
^8GR6BSX... | 4 | 10

⚠️ GROUP BY and JOIN should not be used together in a single SELECT.

Agregation functions

With GROUP BY, you can use the following aggregation functions:

  • AVG
  • COUNT
  • MAX
  • MIN
  • SUM

⚠️ Pathway reducers (pw.count, pw.sum, etc.) aggregate over None values, while tranditional SQL aggregate functions skip NULL values: be careful to remove all the undefined values before using an aggregate function.

HAVING

result_having = pw.sql("SELECT a, SUM(b) FROM tab GROUP BY a HAVING SUM(b)>5", tab=t)pw.debug.compute_and_print(result_having)
            | a | _col_1
^8GR6BSX... | 4 | 10

AS (alias)

Pathway supports both notations: old_name as new_name and old_name new_name.

result_alias = pw.sql("SELECT b, a AS c FROM tab", tab=t)pw.debug.compute_and_print(result_alias)
            | b | c
^2TMTFGY... | 2 | 1
^YHZBTNY... | 3 | 4
^SERVYWW... | 7 | 4
result_alias = pw.sql("SELECT b, a c FROM tab", tab=t)pw.debug.compute_and_print(result_alias)
            | b | c
^2TMTFGY... | 2 | 1
^YHZBTNY... | 3 | 4
^SERVYWW... | 7 | 4

UNION

Pathway provides the standard UNION SQL operator. Note that UNION requires matching column names.

t_union = pw.debug.table_from_markdown(    """    | a | b 4  | 9 | 3 5  | 2 | 7 """)result_union = pw.sql("SELECT * FROM tab UNION SELECT * FROM tab2", tab=t, tab2=t_union)pw.debug.compute_and_print(result_union)
            | a | b
^PK2K47A... | 1 | 2
^B7PGE3E... | 2 | 7
^0A3551Q... | 4 | 3
^QKQ86JV... | 4 | 7
^08XM0XT... | 9 | 3

INTERSECT

Pathway provides the standard INTERSECT SQL operator. Note that INTERSECT requires matching column names.

t_inter = pw.debug.table_from_markdown(    """    | a | b 4  | 9 | 3 5  | 2 | 7 6  | 1 | 2 """)result_inter = pw.sql("SELECT * FROM tab INTERSECT SELECT * FROM tab2", tab=t, tab2=t_inter)pw.debug.compute_and_print(result_inter)
            | a | b
^PK2K47A... | 1 | 2

⚠️ INTERSECT does not support INTERSECT ALL (coming soon).

JOIN

Pathway provides different join operations: INNER JOIN, LEFT JOIN (or LEFT OUTER JOIN), RIGHT JOIN (or RIGHT OUTER JOIN), SELF JOIN, and CROSS JOIN.

t_join = pw.debug.table_from_markdown(    """    | b | c 4  | 4 | 9 5  | 3 | 4 6  | 7 | 5 """)result_join = pw.sql("SELECT * FROM left_table INNER JOIN right_table ON left_table.b==right_table.b", left_table=t, right_table=t_join)pw.debug.compute_and_print(result_join)
            | a | b | c
^1Y14E07... | 4 | 3 | 4
^7RXQ0Z1... | 4 | 7 | 5

⚠️ GROUP BY and JOIN should not be used together in a single SELECT.

⚠️ NATURAL JOIN and FULL JOIN are not supported (coming soon).

WITH

In addition to being placed inside a WHERE clause, subqueries can also be performed using the WITH keyword:

result_with = pw.sql("WITH group_table (a, sumB) AS (SELECT a, SUM(b) FROM tab GROUP BY a) SELECT sumB FROM group_table", tab=t)pw.debug.compute_and_print(result_with)
            | sumB
^2TMTFGY... | 2
^8GR6BSX... | 10

Differences from the SQL standard

First of all, not all SQL queries can be executed in Pathway. This stems mainly from the fact that Pathway is built to process streaming and dynamic data efficiently.

No ordering

In Pathway, indexes are separately generated and maintained by the engine, which does not guarantee any row order: SQL operations like LIMIT, ORDER BY or SELECT TOP don't always make sense in this context. In the future, we will support an ORDER BY ... LIMIT ... keyword combination, which is typically meaningful in Pathway. The column id is reserved and should not be used as a column name, this column is not captured by * expressions.

Furthermore, there is no order on the columns and the column order used in a SELECT query need not be preserved.

Immutability

Pathway tables are immutable: operations such as INSERT INTO are not supported.

Limits of subqueries and existing operations

Subqueries are suppported but are fragile. Similarly, LIKE, ANY, ALL, EXISTS keywords are not formally supported even though they may work in some cases. COALESCE andIFNULL are not supported but should be soon. We strongly suggest not to use anonymous columns: they might work but we cannot guarantee their behavior.

Conclusion

Pathway provides a powerful API to ease the transition of SQL data transformations and pipelines into Pathway. However, Pathway and SQL serve different purposes. To benefit from all the possibilities Pathway has to offer we strongly encourage you to use the Python syntax directly, as much as you can. Most of the time, this syntax is at least as easy to follow as SQL - see for example our join and groupby manuals.