pw.indexing

class pw.indexing.DataIndex(data_table, inner_index, embedder=None)

[source]
A class that given an implementation of an index provides methods that augment the search results with supplementary data.
  • Parameters
    • data_table (pw.Table) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response of inner_index)
    • inner_index (InnerIndex) – a data structure that accepts data from some data_column and for each query answers with a list of IDs, one ID per matched row from data_column. The IDs are taken from the table that contains the data_column column
    • embedder (pw.UDF | None) – optional, if set, the index applies the embedder on the column passed to inner_index;

query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The answers to the old queries are updated when the state of the index changes. To work properly, the inner_index has to be an instance of InnerIndex supporting query.

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)

sourceThis method takes the query from query_column, optionally applies self.embedder on it and passes it to inner index to obtain matching entries stored in the InnerIndex (being a match depends on the implementation and the internal state of the InnerIndex).

For each query and for each column in self.data_table it computes a tuple of values that are in the rows that have IDs indicated by the response of the InnerIndex. It returns a JoinResult of a left join between query table (a table that holds query_column) and the mentioned table of tuples (exactly one row per query, with values not present if set of matching IDs is empty).

Optionally, the method can skip the tupling step, and return a JoinResult of a left join between query table, and self.data_table, using the result of InnerIndex to indicate when the IDs match (exactly one row per match plus one row per query with no matches).

The index answers according to the current state of the data structure and does not revisit old answers. To to work properly, the inner_index has to be an instance of InnerIndex supporting query (all predefined indices support it, this is an information for third party extensions).

  • Parameters
    • query_column (pw.ColumnReference) – A column containing the queries, needs to be in the format compatible with self.inner_index (or self.embedder).
    • number_of_matches (pw.ColumnExpression | int) – The maximum number of matches returned for each query.
    • collapse_rows (bool) – Indicates the format of the output. If set to True, the resulting table has exactly one row for each query, each column of the right side of the resulting JoinResult contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set to False, the result is a left join between the table holding the query_column and self.data_index, using the results from self.inner_index to indicate the matches between the IDs.
    • metadata_filter (pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None) – Optional, contains a boolean JMESPath query that is used to filter the potential answers inside self.inner_index - matching entries are included only when the filter function specified in metadata_filter` returns True, when run against data in inner_index.metadata_column, in a potentially matched row. Passing None as value in the column defined in the parameter metadata_filter indicates that all possible matches corresponding to this query pass the filtering step.

class pw.indexing.LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')

[source]
Interface for Pathway’s implementation of KNN via LSH.
  • Parameters
    • data_column (pw.ColumnExpression [list[float]]) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of metadata as dictionary, in JSON format.
    • dimensions (int) – number of dimensions in the data
    • n_or (int) – number of ORs
    • n_and (int) – number of ANDs
    • bucket_length (float) – bucket length (after projecting on a line)
    • distance_type (str) – “euclidean” and “cosine” metrics are supported.

query(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression [list[float]]) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression [int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

source

  • Parameters
    • query_column (pw.ColumnExpression [list[float]]) – column containing data that is used to query the index;
    • number_of_matches (pw.ColumnExpression[int] | int) – number of nearest neighbors in the index response; defaults to 3
    • metadata_filter (pw.ColumnExpression [str] | None) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.

class pw.indexing.SortedIndex

[source]

clear()None. Remove all items from D.

copy()a shallow copy of D

fromkeys(value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items()a set-like object providing a view on D's items

keys()a set-like object providing a view on D's keys

pop(k, )v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(**F)None. Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()an object providing a view on D's values

class pw.indexing.TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)

[source]

in_memory_index: `bool` = True

//en.wikipedia.org/wiki/Okapi_BM25>`_, provided via tantivy.

  • Parameters
    • data_column (pw.ColumnExpression[list[float]]) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression[str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • ram_budget (int) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost)
    • in_memory_index (bool) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage
  • Type
    Interface for full text index based on
    `
    

    BM25 <https

query(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

class pw.indexing.USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=2, expansion_add=2, expansion_search=2)

[source]
Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm [white paper](https://arxiv.org/abs/1603.09320).

To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch

  • Parameters
    • data_column (pw.ColumnExpression [list[float]]) – the column expression representing the data.
    • metadata_column (pw.ColumnExpression [str] | None) – optional column expression, string representation of some auxiliary data, in JSON format.
    • dimensions (int) – number of dimensions of vectors that are used by the index and queries
    • reserved_space (int) – initial capacity (in number of entries) of the index
    • metric (USearchMetricKind) – metric kind that is used to determine distance
    • connectivity (int) – maximum number of edges for a node in the HNSW index
    • expansion_add (int) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work)
    • expansion_search (int) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work)

query(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).

The resulting table with results needs contain a column _pw_index_reply (name defined in _INDEX_REPLY), in which the resulting tuples are stored.

query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)

sourceAn abstract method. Any implementation of query_as_of_now in a subclass for each entry in query_column is supposed to return a tuple containing pairs, each pair consisting of the matched ID and the score indicating quality of the match (all that taking into account number_of_matches and metadata_filter parameters).

The implementation of the index should not update the answers to the old queries, when its internal state is modified.

The resulting table with results needs contain a column _pw_index_reply (name defined in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting tuples are stored.

pw.indexing.default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex (DataIndex), with inner index (data structure) that is an instance of LshKnn. This method chooses some parameters of LshKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: you can use LshKnn. Look up LshKnn to see the parameters that can be adjusted.

pw.indexing.default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex ( DataIndex), with inner ~ index (data structure) that is an instance of USearchKnn. This method chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice that works well in any scenario (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project, etc.

Remark: you can use USearchKnn. Look up USearchKnn to see the parameters that can be adjusted.

pw.indexing.default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)

sourceReturns an instance of DataIndex ( DataIndex), with inner index (data structure) of our choosing. This method chooses an arbitrary implementation of InnerIndex (that supports queries on vectors), but it’s not necessarily the best choice of index and its parameters (each usecase may need slightly different configuration). As such, it is meant to be used for development, demonstrations, starting point of larger project etc.

pw.indexing.retrieve_prev_next_values(ordered_table, value=None)

sourceRetrieve, for each row, a pointer to the first row in the ordered_table that contains a non-“None” value, based on the orders defined by the prev and next columns.

  • Parameters
    • ordered_table (pw.Table) – Table with three columns: value, prev, next. The prev and next columns contain pointers to other rows.
    • value (Optional[pw.ColumnReference]) – Column reference pointing to the column containing values. If not provided, assumes the column name is “value”.
  • Returns
    pw.Table
    Table with two columns: prev_value and next_value.
      The prev_value column contains the values of the first row, according                   to the order defined by the column next, with a value different from None.
      The next_value column contains the values of the first row, according                   to the order defined by the column prev, with a value different from None.