pw.indexing

class BruteForceKnn(data_column, metadata_column, *, dimensions, reserved_space, auxiliary_space=131072, metric, embedder=None)

Interface for a brute force implementation of a nearest neighbors index.

Parameters
- data_column (pw.ColumnExpression) – the column expression representing the data.
- metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
- dimensions (int) – number of dimensions of vectors that are used by the index and queries
- reserved_space (int) – initial capacity (in the number of entries) of the index
- auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
- metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, brute force knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class BruteForceKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)

[source]

Factory for creating BruteForceKnn indices.

Parameters
- dimensions (int) – number of dimensions of vectors that are used by the index and queries. This is only needed if the embedder is not provided.
- reserved_space (int) – initial capacity (in the number of entries) of the index
- auxiliary_space (int) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in case auxiliary_space is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored)
- metric (BruteForceKnnMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class DataIndex(data_table, inner_index)

[source]

A class that given an implementation of an index provides methods that augment the search results with supplementary data.

Parameters
- data_table (pw.Table) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response of inner_index)
- inner_index (InnerIndex) – a data structure that accepts data from some data_column and for each query answers with a list of IDs, one ID per matched row from data_column. The IDs are taken from the table that contains the data_column column

query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The answers to the old queries are updated when the state of the index changes. To work properly, the inner_index has to be an instance of InnerIndex supporting query.

Parameters
- query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
- number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
- collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
- metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The index answers according to the current state of the data structure and does not revisit old answers. To to work properly, the inner_index has to be an instance of InnerIndex supporting query (all predefined indices support it, this is an information for third party extensions).

Parameters
- query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
- number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
- collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
- metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

class DefaultKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)

[source]

Default factory for creating Knn index - uses the BruteForceKnn index.

Parameters
- dimensions (int) – number of dimensions of vectors that are used by the index and queries
- reserved_space (int) – initial capacity (in the number of entries) of the index
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class HybridIndex(retrievers, k=60)

[source]

Hybrid Index that composes any number of other indices and combines them using the Reciprocal Rank Fusion (RRF). It queries each index, and each retrieved row d is assigned score 1/(k+rank(d)), which is then summed over all indices. HybridIndex returns best rows from indexed data according to this score.

Parameters
- retrievers (list[InnerIndex]) – list of indices to be used to compose the hybrid index.
- k (float) – constant used for calculating ranking score.

query(query_column, *, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class HybridIndexFactory(retriever_factories, k=60)

[source]

Factory for creating hybrid indices.

Parameters
- retriever_factories (list[InnerIndexFactory]) – list of factories of indices that will be used in the hybrid index
- k (float) – constant used for calculating ranking score.

class LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean', embedder=None)

[source]

Interface for Pathway’s implementation of KNN via LSH.

Parameters
- data_column (pw.ColumnExpression) – the column expression representing the data.
- metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of metadata as dictionary, in JSON format.
- dimensions (int) – number of dimensions in the data
- n_or (int) – number of ORs
- n_and (int) – number of ANDs
- bucket_length (float) – bucket length (after projecting on a line)
- distance_type (str) – “euclidean” and “cosine” metrics are supported.
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

source

Parameters
- query_column (pw.ColumnExpression) – column containing data that is used to query the index;
- number_of_matches (pw.ColumnExpression [int] | int) – number of nearest neighbors in the index response; defaults to 3
- metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

source

Parameters
- query_column (pw.ColumnExpression) – column containing data that is used to query the index;
- number_of_matches (pw.ColumnExpression[int] | int) – number of nearest neighbors in the index response; defaults to 3
- metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

class LshKnnFactory(*, dimensions=None, embedder=None, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')

[source]

Factory for creating LshKnn indices.

Parameters
- dimensions (int) – number of dimensions in the data. This is only needed if the embedder is not provided.
- n_or (int) – number of ORs
- n_and (int) – number of ANDs
- bucket_length (float) – bucket length (after projecting on a line)
- distance_type (str) – “euclidean” and “cosine” metrics are supported.
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

class TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)

[source]

Interface for full text index based on BM25, provided via tantivy.

Parameters
- data_column (pw.ColumnExpression[str]) – the column expression representing the data.
- metadata_column (pw.ColumnExpression[str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
- ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
- in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, tantivy bm25 index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class TantivyBM25Factory(ram_budget=52428800, in_memory_index=True)

[source]

Factory for creating a TantivyBM25 index.

Parameters
- ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
- in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage

class USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=0, expansion_add=0, expansion_search=0, embedder=None)

[source]

Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm white paper.

To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch.

Parameters
- data_column (pw.ColumnExpression) – the column expression representing the data.
- metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
- dimensions (int) – number of dimensions of vectors that are used by the index and queries
- reserved_space (int) – initial capacity (in the number of entries) of the index
- metric (USearchMetricKind) – metric kind that is used to determine distance
- connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
- expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
- expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

query(query_column, number_of_matches=3, metadata_filter=None)

sourceCurrently, usearch knn index is supported only in the as-of-now variant

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class UsearchKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, metric=<pathway.engine.USearchMetricKind object>, connectivity=0, expansion_add=0, expansion_search=0)

[source]

Factory for creating UsearchKNN indices.

Parameters
- dimensions (int) – number of dimensions of vectors that are used by the index and queries. This is only needed if the embedder is not provided.
- reserved_space (int) – initial capacity (in the number of entries) of the index
- metric (USearchMetricKind) – metric kind that is used to determine distance. Defaults to cosine similarity.
- connectivity (int) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own
- expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own
- expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own
- embedder (UDF | None) – UDF used for calculating embeddings of string. It is needed, if index is used for indexing texts.

default_brute_force_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of BruteForceKnn. This method chooses some parameters of BruteForceKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parameterized instance of BruteForceKnn. Look up DataIndex constructor to see how to make data index parameterized by custom data structure, and the constructor of BruteForceKnn to see the parameters that can be adjusted.

default_full_text_document_index(data_column, data_table, *, metadata_column=None)

sourceReturns an instance of DataIndex (DataIndex), with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports text queries), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.

default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of LshKnn. This method chooses some parameters of LshKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parameterized instance of LshKnn. Look up DataIndex constructor to see how to make data index parameterized by custom data structure, and the constructor of LshKnn to see the parameters that can be adjusted.

default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) that is an instance of USearchKnn. This method chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: the arbitrarily chosen configuration of the index may change (whenever tests suggest some better default values). To have fixed configuration, you can use DataIndex with a parameterized instance of USearchKnn. Look up DataIndex constructor to see how to make data index parameterized by custom data structure, and the constructor of USearchKnn to see the parameters that can be adjusted.

default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex, with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports queries on vectors), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.