pathway.stdlib.indexing package

class pw.indexing.BruteForceKnn(data_column, metadata_column, *, dimensions, reserved_space, auxiliary_space=131072, metric, embedder=None)

[source]

Interface for a brute force implementation of a nearest neighbors index.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
    • metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, brute force knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.BruteForceKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)

[source]

Factory for creating BruteForceKnn indices.

  • Parameters
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
    • metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class pw.indexing.DataIndex(data_table, inner_index)

[source]

A class that given an implementation of an index provides methods that augment the search results with supplementary data.

  • Parameters
    • data_table (pw.Table) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response of inner_index)
    • inner_index (InnerIndex) – a data structure that accepts data from some data_column and for each query answers with a list of IDs, one ID per matched row from data_column. The IDs are taken from the table that contains the data_column column

query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The answers to the old queries are updated when the state of the index changes. To work properly, the inner_index has to be an instance of InnerIndex supporting query.

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The index answers according to the current state of the data structure and does not revisit old answers. To to work properly, the inner_index has to be an instance of InnerIndex supporting query (all predefined indices support it, this is an information for third party extensions).

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

class pw.indexing.HybridIndex(retrievers, k=60)

[source]

Hybrid Index that composes any number of other indices and combines them using the Reciprocal Rank Fusion (RRF). It queries each index, and each retrieved row d is assigned score 1/(k+rank(d)), which is then summed over all indices. HybridIndex returns best rows from indexed data according to this score.

  • Parameters
    • retrievers (list[InnerIndex]) – list of indices to be used to compose the hybrid index.
    • k (float) – constant used for calculating ranking score.

query(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.HybridIndexFactory(retriever_factories, k=60)

[source]

Factory for creating hybrid indices.

  • Parameters
    • retriever_factories (list[InnerIndexFactory]) – list of factories of indices that will be used in the hybrid index
    • k (float) – constant used for calculating ranking score.

class pw.indexing.LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean', embedder=None)

[source]

Interface for Pathway’s implementation of KNN via LSH.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of metadata as dictionary, in JSON format.
    • dimensions (int) – number of dimensions in the data
    • n_or (int) – number of ORs
    • n_and (int) – number of ANDs
    • bucket_length (float) – bucket length (after projecting on a line)
    • distance_type (str) – “euclidean” and “cosine” metrics are supported.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression [int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression[int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

class pw.indexing.LshKnnFactory(*, dimensions=None, embedder=None, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')

[source]

Factory for creating LshKnn indices.

  • Parameters
    • dimensions (int) – number of dimensions in the data
    • n_or (int) – number of ORs
    • n_and (int) – number of ANDs
    • bucket_length (float) – bucket length (after projecting on a line)
    • distance_type (str) – “euclidean” and “cosine” metrics are supported.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class pw.indexing.SortedIndex

[source]

clear()None. Remove all items from D.

copy()a shallow copy of D

fromkeys()

Create a new dictionary with keys from iterable and values set to value.

get()

Return the value for key if key is in the dictionary, else default.

items()a set-like object providing a view on D's items

keys()a set-like object providing a view on D's keys

pop(k, )v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(**F)None. Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()an object providing a view on D's values

class pw.indexing.TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)

[source]

Interface for full text index based on BM25, provided via tantivy.

  • Parameters
    • data_column (pw.ColumnExpression[str]) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression[str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
    • in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, tantivy bm25 index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.TantivyBM25Factory(ram_budget=52428800, in_memory_index=True)

[source]

Factory for creating a TantivyBM25 index.

  • Parameters
    • ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
    • in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

class pw.indexing.USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=0, expansion_add=0, expansion_search=0, embedder=None)

[source]

Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm white paper.

To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • metric (USearchMetricKind) – metric kind that is used to determine distance
    • connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
    • expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
    • expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, usearch knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.UsearchKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, metric=<pathway.engine.USearchMetricKind object>, connectivity=0, expansion_add=0, expansion_search=0)

[source]

Factory for creating UsearchKNN indices.

  • Parameters
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • metric (USearchMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
    • connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
    • expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
    • expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

pw.indexing.default_brute_force_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of BruteForceKnn. This method chooses some parameters of BruteForceKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of BruteForceKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of BruteForceKnn to see the parameters that can be adjusted.

pw.indexing.default_full_text_document_index(data_column, data_table, *, metadata_column=None)

sourceReturns an instance of DataIndex (DataIndex), with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports text queries), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.

pw.indexing.default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of LshKnn. This method chooses some parameters of LshKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of LshKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of LshKnn to see the parameters that can be adjusted.

pw.indexing.default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of USearchKnn. This method chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of USearchKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of USearchKnn to see the parameters that can be adjusted.

pw.indexing.default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports queries on vectors), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.

pw.indexing.retrieve_prev_next_values(ordered_table, value=None)

sourceRetrieve, for each row, a pointer to the first row in the ordered_table that contains a non-“None” value, based on the orders defined by the prev and next columns.

  • Parameters
    • ordered_table (pw.Table) – Table with three columns: value, prev, next. The prev and next columns contain pointers to other rows.
    • value (Optional[pw.ColumnReference]) – Column reference pointing to the column containing values. If not provided, assumes the column name is “value”.
  • Returns
    pw.Table
    Table with two columns: prev_value and next_value.
      The prev_value column contains the values of the first row, according                   to the order defined by the column next, with a value different from None.
      The next_value column contains the values of the first row, according                   to the order defined by the column prev, with a value different from None.
    

Submodules

pathway.stdlib.indexing.bm25 module

class pw.indexing.bm25.TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)

[source]

Interface for full text index based on BM25, provided via tantivy.

  • Parameters
    • data_column (pw.ColumnExpression[str]) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression[str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
    • in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, tantivy bm25 index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.bm25.TantivyBM25Factory(ram_budget=52428800, in_memory_index=True)

[source]

Factory for creating a TantivyBM25 index.

  • Parameters
    • ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
    • in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

pathway.stdlib.indexing.data_index module

class pw.indexing.data_index.DataIndex(data_table, inner_index)

[source]

A class that given an implementation of an index provides methods that augment the search results with supplementary data.

  • Parameters
    • data_table (pw.Table) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response of inner_index)
    • inner_index (InnerIndex) – a data structure that accepts data from some data_column and for each query answers with a list of IDs, one ID per matched row from data_column. The IDs are taken from the table that contains the data_column column

query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The answers to the old queries are updated when the state of the index changes. To work properly, the inner_index has to be an instance of InnerIndex supporting query.

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The index answers according to the current state of the data structure and does not revisit old answers. To to work properly, the inner_index has to be an instance of InnerIndex supporting query (all predefined indices support it, this is an information for third party extensions).

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

class pw.indexing.data_index.GeneralJoin(*args, **kwargs)

[source]

class pw.indexing.data_index.IdScoreSchema

[source]

class pw.indexing.data_index.InnerIndex(data_column, metadata_column)

[source]

Abstract class representing a data structure that accepts data (in self.data_column) with optional metadata (in self.metadata_column), and answers queries with a set of ‘matching’ IDs from the data structure (optionally filtered with JMESPath query run against stored metadata). The IDs are taken from the table that contains the data_column column. Which IDs are considered as matched is defined in particular implementations of subclasses of this class. Can be used as index argument of DataIndex, which is a wrapper that augments the matching IDs with some additional data.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.

abstract query(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

abstract query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

pw.indexing.full_text_document_index.default_full_text_document_index(data_column, data_table, *, metadata_column=None)

sourceReturns an instance of DataIndex (DataIndex), with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports text queries), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.

pathway.stdlib.indexing.hybrid_index module

class pw.indexing.hybrid_index.HybridIndex(retrievers, k=60)

[source]

Hybrid Index that composes any number of other indices and combines them using the Reciprocal Rank Fusion (RRF). It queries each index, and each retrieved row d is assigned score 1/(k+rank(d)), which is then summed over all indices. HybridIndex returns best rows from indexed data according to this score.

  • Parameters
    • retrievers (list[InnerIndex]) – list of indices to be used to compose the hybrid index.
    • k (float) – constant used for calculating ranking score.

query(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.hybrid_index.HybridIndexFactory(retriever_factories, k=60)

[source]

Factory for creating hybrid indices.

  • Parameters
    • retriever_factories (list[InnerIndexFactory]) – list of factories of indices that will be used in the hybrid index
    • k (float) – constant used for calculating ranking score.

class pw.indexing.nearest_neighbors.BruteForceKnn(data_column, metadata_column, *, dimensions, reserved_space, auxiliary_space=131072, metric, embedder=None)

[source]

Interface for a brute force implementation of a nearest neighbors index.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
    • metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, brute force knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.nearest_neighbors.BruteForceKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)

[source]

Factory for creating BruteForceKnn indices.

  • Parameters
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
    • metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class pw.indexing.nearest_neighbors.KnnIndexFactory(*, dimensions=None, embedder=None)

[source]

class pw.indexing.nearest_neighbors.LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean', embedder=None)

[source]

Interface for Pathway’s implementation of KNN via LSH.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of metadata as dictionary, in JSON format.
    • dimensions (int) – number of dimensions in the data
    • n_or (int) – number of ORs
    • n_and (int) – number of ANDs
    • bucket_length (float) – bucket length (after projecting on a line)
    • distance_type (str) – “euclidean” and “cosine” metrics are supported.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression [int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression[int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

class pw.indexing.nearest_neighbors.LshKnnFactory(*, dimensions=None, embedder=None, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')

[source]

Factory for creating LshKnn indices.

  • Parameters
    • dimensions (int) – number of dimensions in the data
    • n_or (int) – number of ORs
    • n_and (int) – number of ANDs
    • bucket_length (float) – bucket length (after projecting on a line)
    • distance_type (str) – “euclidean” and “cosine” metrics are supported.
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class pw.indexing.nearest_neighbors.USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=0, expansion_add=0, expansion_search=0, embedder=None)

[source]

Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm white paper.

To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch.

  • Parameters
    • data_column (pw.ColumnExpression) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • metric (USearchMetricKind) – metric kind that is used to determine distance
    • connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
    • expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
    • expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, usearch knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.nearest_neighbors.UsearchKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, metric=<pathway.engine.USearchMetricKind object>, connectivity=0, expansion_add=0, expansion_search=0)

[source]

Factory for creating UsearchKNN indices.

  • Parameters
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in the number of entries) of the index
    • metric (USearchMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
    • connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
    • expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
    • expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
    • embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

pathway.stdlib.indexing.sorting module

class pw.indexing.sorting.Candidate

[source]

class pw.indexing.sorting.Hash

[source]

class pw.indexing.sorting.Instance

[source]

class pw.indexing.sorting.Key

[source]

class pw.indexing.sorting.LeftRight

[source]

class pw.indexing.sorting.Node

[source]

class pw.indexing.sorting.Parent

[source]

class pw.indexing.sorting.PrevNext

[source]

class pw.indexing.sorting.SortedIndex

[source]

clear()None. Remove all items from D.

copy()a shallow copy of D

fromkeys()

Create a new dictionary with keys from iterable and values set to value.

get()

Return the value for key if key is in the dictionary, else default.

items()a set-like object providing a view on D's items

keys()a set-like object providing a view on D's keys

pop(k, )v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(**F)None. Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()an object providing a view on D's values

pw.indexing.sorting.retrieve_prev_next_values(ordered_table, value=None)

sourceRetrieve, for each row, a pointer to the first row in the ordered_table that contains a non-“None” value, based on the orders defined by the prev and next columns.

  • Parameters
    • ordered_table (pw.Table) – Table with three columns: value, prev, next. The prev and next columns contain pointers to other rows.
    • value (Optional[pw.ColumnReference]) – Column reference pointing to the column containing values. If not provided, assumes the column name is “value”.
  • Returns
    pw.Table
    Table with two columns: prev_value and next_value.
      The prev_value column contains the values of the first row, according                   to the order defined by the column next, with a value different from None.
      The next_value column contains the values of the first row, according                   to the order defined by the column prev, with a value different from None.
    

pathway.stdlib.indexing.typecheck_utils module

pathway.stdlib.indexing.vector_document_index module

pw.indexing.vector_document_index.default_brute_force_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of BruteForceKnn. This method chooses some parameters of BruteForceKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of BruteForceKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of BruteForceKnn to see the parameters that can be adjusted.

pw.indexing.vector_document_index.default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of LshKnn. This method chooses some parameters of LshKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of LshKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of LshKnn to see the parameters that can be adjusted.

pw.indexing.vector_document_index.default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of USearchKnn. This method chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parametrized instance of USearchKnn. Look up DataIndex constructor to see how to make data index parametrized by custom data structure, and the constructor of USearchKnn to see the parameters that can be adjusted.

pw.indexing.vector_document_index.default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports queries on vectors), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.