pathway.stdlib.indexing package
class pw.indexing.BruteForceKnn(data_column, metadata_column, *, dimensions, reserved_space, auxiliary_space=131072, metric, embedder=None)
[source]Interface for a brute force implementation of a nearest neighbors index.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - auxiliary_space (
int
) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in caseauxiliary_space
is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored) - metric (
BruteForceKnnMetricKind
) – metric kind that is used to determine distance - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, brute force knn index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.BruteForceKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)
[source]Factory for creating BruteForceKnn indices.
- Parameters
- dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - auxiliary_space (
int
) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in caseauxiliary_space
is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored) - metric (
BruteForceKnnMetricKind
) – metric kind that is used to determine distance. Defaults to cosine similarity. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
class pw.indexing.DataIndex(data_table, inner_index)
[source]A class that given an implementation of an index provides methods that augment the search results with supplementary data.
- Parameters
- data_table (
pw.Table
) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response ofinner_index
) - inner_index (
InnerIndex
) – a data structure that accepts data from somedata_column
and for each query answers with a list of IDs, one ID per matched row fromdata_column
. The IDs are taken from the table that contains thedata_column
column
- data_table (
query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)
sourceThis method takes the query from query_column
, optionally applies
self.embedder
on it and passes it to inner index to obtain matching entries
stored in the InnerIndex
(being a match
depends on the implementation and the internal state of the InnerIndex
).
For each query and for each column in self.data_table
it computes a tuple of values
that are in the rows that have IDs indicated by the response of the InnerIndex
.
It returns a JoinResult
of a left join between query table
(a table that holds query_column
) and the mentioned table of tuples (exactly
one row per query, with values not present if set of matching IDs is empty).
Optionally, the method can skip the tupling step, and return a JoinResult
of a left
join between query table, and self.data_table
, using the result of InnerIndex
to indicate
when the IDs match (exactly one row per match plus one row per query with no matches).
The answers to the old queries are updated when the state of the index changes. To work
properly, the inner_index
has to be an instance of InnerIndex
supporting query
.
- Parameters
- query_column (
pw.ColumnReference
) – A column containing the queries, needs to be in the format compatible withself.inner_index
(orself.embedder
). - number_of_matches (
pw.ColumnExpression | int
) – The maximum number of matches returned for each query. - collapse_rows (
bool
) – Indicates the format of the output. If set toTrue
, the resulting table has exactly one row for each query, each column of the right side of the resultingJoinResult
contains a tuple consisting of values from matched rows of corresponding column inself.data_table
. If set toFalse
, the result is a left join between the table holding thequery_column
andself.data_index
, using the results fromself.inner_index
to indicate the matches between the IDs. - metadata_filter (
pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None
) – Optional, contains a boolean JMESPath query that is used to filter the potential answers insideself.inner_index
- matching entries are included only when the filter function specified in metadata_filter` returnsTrue
, when run against data ininner_index.metadata_column
, in a potentially matched row. PassingNone
as value in the column defined in the parametermetadata_filter
indicates that all possible matches corresponding to this query pass the filtering step.
- query_column (
query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)
sourceThis method takes the query from query_column
, optionally applies
self.embedder on it and passes it to inner index to obtain matching entries
stored in the InnerIndex
(being a match
depends on the implementation and the internal state of the InnerIndex
).
For each query and for each column in self.data_table
it computes a tuple of values
that are in the rows that have IDs indicated by the response of the InnerIndex
.
It returns a JoinResult
of a left join between query table (a table that holds
query_column
) and the mentioned table of tuples (exactly one row per query,
with values not present if set of matching IDs is empty).
Optionally, the method can skip the tupling step, and return a JoinResult
of a left
join between query table, and self.data_table, using the result of InnerIndex
to indicate
when the IDs match (exactly one row per match plus one row per query with no matches).
The index answers according to the current state of the data structure and does not
revisit old answers. To to work properly, the inner_index
has to be an instance
of InnerIndex
supporting query
(all predefined indices support it, this is an
information for third party extensions).
- Parameters
- query_column (
pw.ColumnReference
) – A column containing the queries, needs to be in the format compatible withself.inner_index
(orself.embedder
). - number_of_matches (
pw.ColumnExpression | int
) – The maximum number of matches returned for each query. - collapse_rows (
bool
) – Indicates the format of the output. If set toTrue
, the resulting table has exactly one row for each query, each column of the right side of the resultingJoinResult
contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set toFalse
, the result is a left join between the table holding thequery_column
andself.data_index
, using the results fromself.inner_index
to indicate the matches between the IDs. - metadata_filter (
pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None
) – Optional, contains a boolean JMESPath query that is used to filter the potential answers insideself.inner_index
- matching entries are included only when the filter function specified in metadata_filter` returnsTrue
, when run against data ininner_index.metadata_column
, in a potentially matched row. PassingNone
as value in the column defined in the parametermetadata_filter
indicates that all possible matches corresponding to this query pass the filtering step.
- query_column (
class pw.indexing.HybridIndex(retrievers, k=60)
[source]Hybrid Index that composes any number of other indices and combines them using
the Reciprocal Rank Fusion (RRF). It queries each index, and each retrieved row d
is assigned
score 1/(k+rank(d))
, which is then summed over all indices. HybridIndex
returns
best rows from indexed data according to this score.
- Parameters
- retrievers (
list
[InnerIndex
]) – list of indices to be used to compose the hybrid index. - k (
float
) – constant used for calculating ranking score.
- retrievers (
query(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).
The resulting table with results needs contain a column _pw_index_reply
(name defined
in _INDEX_REPLY
), in which the resulting
tuples are stored.
query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.HybridIndexFactory(retriever_factories, k=60)
[source]Factory for creating hybrid indices.
- Parameters
- retriever_factories (
list
[InnerIndexFactory
]) – list of factories of indices that will be used in the hybrid index - k (
float
) – constant used for calculating ranking score.
- retriever_factories (
class pw.indexing.LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean', embedder=None)
[source]Interface for Pathway’s implementation of KNN via LSH.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of metadata as dictionary, in JSON format. - dimensions (
int
) – number of dimensions in the data - n_or (
int
) – number of ORs - n_and (
int
) – number of ANDs - bucket_length (
float
) – bucket length (after projecting on a line) - distance_type (
str
) – “euclidean” and “cosine” metrics are supported. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
- Parameters
- query_column (
pw.ColumnExpression
) – column containing data that is used to query the index; - number_of_matches (
pw.ColumnExpression [int] | int
) – number of nearest neighbors in the index response; defaults to 3 - metadata_filter (
pw.ColumnExpression [str] | None
) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.
- query_column (
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
- Parameters
- query_column (
pw.ColumnExpression
) – column containing data that is used to query the index; - number_of_matches (
pw.ColumnExpression[int] | int
) – number of nearest neighbors in the index response; defaults to 3 - metadata_filter (
pw.ColumnExpression [str] | None
) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.
- query_column (
class pw.indexing.LshKnnFactory(*, dimensions=None, embedder=None, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')
[source]Factory for creating LshKnn indices.
- Parameters
- dimensions (
int
) – number of dimensions in the data - n_or (
int
) – number of ORs - n_and (
int
) – number of ANDs - bucket_length (
float
) – bucket length (after projecting on a line) - distance_type (
str
) – “euclidean” and “cosine” metrics are supported. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
class pw.indexing.SortedIndex
[source]clear()None. Remove all items from D.
copy()a shallow copy of D
fromkeys()
Create a new dictionary with keys from iterable and values set to value.
get()
Return the value for key if key is in the dictionary, else default.
items()a set-like object providing a view on D's items
keys()a set-like object providing a view on D's keys
pop(k, )v, remove specified key and return the corresponding value.
If the key is not found, return the default if given; otherwise, raise a KeyError.
popitem()
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
setdefault()
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
update(**F)None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values()an object providing a view on D's values
class pw.indexing.TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)
[source]Interface for full text index based on BM25, provided via tantivy.
- Parameters
- data_column (
pw.ColumnExpression[str]
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression[str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - ram_budget (
int
) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost) - in_memory_index (
bool
) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, tantivy bm25 index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.TantivyBM25Factory(ram_budget=52428800, in_memory_index=True)
[source]Factory for creating a TantivyBM25 index.
- Parameters
- ram_budget (
int
) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost) - in_memory_index (
bool
) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage
- ram_budget (
class pw.indexing.USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=0, expansion_add=0, expansion_search=0, embedder=None)
[source]Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm white paper.
To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - metric (
USearchMetricKind
) – metric kind that is used to determine distance - connectivity (
int
) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own - expansion_add (
int
) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own - expansion_search (
int
) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, usearch knn index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.UsearchKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, metric=<pathway.engine.USearchMetricKind object>, connectivity=0, expansion_add=0, expansion_search=0)
[source]Factory for creating UsearchKNN indices.
- Parameters
- dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - metric (
USearchMetricKind
) – metric kind that is used to determine distance. Defaults to cosine similarity. - connectivity (
int
) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own - expansion_add (
int
) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own - expansion_search (
int
) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
pw.indexing.default_brute_force_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
,
with inner index (data structure) that is an instance of
BruteForceKnn
. This method
chooses some parameters of BruteForceKnn
arbitrarily, but it’s not necessarily a choice
that works well in any scenario (each usecase may need slightly different
configuration). As such, it is meant to be used for development, demonstrations,
starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
BruteForceKnn
.
Look up DataIndex
constructor to see how
to make data index parametrized by custom data structure, and the constructor
of BruteForceKnn
to see the
parameters that can be adjusted.
pw.indexing.default_full_text_document_index(data_column, data_table, *, metadata_column=None)
sourceReturns an instance of DataIndex
(DataIndex
),
with inner index (data structure) of our choosing. This method chooses an arbitrary
implementation of InnerIndex
(that supports text queries), but it’s not necessarily the best choice of
index and its parameters (each usecase may need slightly different configuration).
As such, it is meant to be used for development, demonstrations, starting point
of larger project etc.
pw.indexing.default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
, with inner index
(data structure) that is an instance of LshKnn
.
This method chooses some parameters of LshKnn arbitrarily, but it’s not
necessarily a choice that works well in any scenario (each usecase may need
slightly different configuration). As such, it is meant to be used for development,
demonstrations, starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
LshKnn
.
Look up DataIndex
constructor to see
how to make data index parametrized by custom data structure, and the constructor
of LshKnn
to see
the parameters that can be adjusted.
pw.indexing.default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
, with inner
index (data structure) that is an instance of
USearchKnn
. This method
chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice
that works well in any scenario (each usecase may need slightly different
configuration). As such, it is meant to be used for development, demonstrations,
starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
USearchKnn
.
Look up DataIndex
constructor to see how
to make data index parametrized by custom data structure, and the constructor
of USearchKnn
to see the
parameters that can be adjusted.
pw.indexing.default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
,
with inner index (data structure) of our choosing. This method chooses an arbitrary
implementation of InnerIndex
(that supports queries on vectors), but it’s not necessarily the best choice of
index and its parameters (each usecase may need slightly different configuration).
As such, it is meant to be used for development, demonstrations, starting point
of larger project etc.
pw.indexing.retrieve_prev_next_values(ordered_table, value=None)
sourceRetrieve, for each row, a pointer to the first row in the ordered_table that contains a non-“None” value, based on the orders defined by the prev and next columns.
- Parameters
- ordered_table (
pw.Table
) – Table with three columns: value, prev, next. The prev and next columns contain pointers to other rows. - value (
Optional[pw.ColumnReference]
) – Column reference pointing to the column containing values. If not provided, assumes the column name is “value”.
- ordered_table (
- Returns
pw.Table –
Table with two columns: prev_value and next_value.The prev_value column contains the values of the first row, according to the order defined by the column next, with a value different from None. The next_value column contains the values of the first row, according to the order defined by the column prev, with a value different from None.
Submodules
pathway.stdlib.indexing.bm25 module
class pw.indexing.bm25.TantivyBM25(data_column, metadata_column, ram_budget=52428800, in_memory_index=True)
[source]Interface for full text index based on BM25, provided via tantivy.
- Parameters
- data_column (
pw.ColumnExpression[str]
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression[str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - ram_budget (
int
) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost) - in_memory_index (
bool
) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, tantivy bm25 index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.bm25.TantivyBM25Factory(ram_budget=52428800, in_memory_index=True)
[source]Factory for creating a TantivyBM25 index.
- Parameters
- ram_budget (
int
) – maximum capacity in bytes. When reached, the index moves a block of data to storage (hence, larger budget means faster index operations, but higher memory cost) - in_memory_index (
bool
) – indicates, whether the whole index is stored in RAM; if set to false, the index is stored in some default Pathway disk storage
- ram_budget (
pathway.stdlib.indexing.data_index module
class pw.indexing.data_index.DataIndex(data_table, inner_index)
[source]A class that given an implementation of an index provides methods that augment the search results with supplementary data.
- Parameters
- data_table (
pw.Table
) – table containing supplementary data, using match-by-id ( ID from data_table and ID from the response ofinner_index
) - inner_index (
InnerIndex
) – a data structure that accepts data from somedata_column
and for each query answers with a list of IDs, one ID per matched row fromdata_column
. The IDs are taken from the table that contains thedata_column
column
- data_table (
query(query_column, *, number_of_matches=3, collapse_rows=True, metadata_filter=None)
sourceThis method takes the query from query_column
, optionally applies
self.embedder
on it and passes it to inner index to obtain matching entries
stored in the InnerIndex
(being a match
depends on the implementation and the internal state of the InnerIndex
).
For each query and for each column in self.data_table
it computes a tuple of values
that are in the rows that have IDs indicated by the response of the InnerIndex
.
It returns a JoinResult
of a left join between query table
(a table that holds query_column
) and the mentioned table of tuples (exactly
one row per query, with values not present if set of matching IDs is empty).
Optionally, the method can skip the tupling step, and return a JoinResult
of a left
join between query table, and self.data_table
, using the result of InnerIndex
to indicate
when the IDs match (exactly one row per match plus one row per query with no matches).
The answers to the old queries are updated when the state of the index changes. To work
properly, the inner_index
has to be an instance of InnerIndex
supporting query
.
- Parameters
- query_column (
pw.ColumnReference
) – A column containing the queries, needs to be in the format compatible withself.inner_index
(orself.embedder
). - number_of_matches (
pw.ColumnExpression | int
) – The maximum number of matches returned for each query. - collapse_rows (
bool
) – Indicates the format of the output. If set toTrue
, the resulting table has exactly one row for each query, each column of the right side of the resultingJoinResult
contains a tuple consisting of values from matched rows of corresponding column inself.data_table
. If set toFalse
, the result is a left join between the table holding thequery_column
andself.data_index
, using the results fromself.inner_index
to indicate the matches between the IDs. - metadata_filter (
pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None
) – Optional, contains a boolean JMESPath query that is used to filter the potential answers insideself.inner_index
- matching entries are included only when the filter function specified in metadata_filter` returnsTrue
, when run against data ininner_index.metadata_column
, in a potentially matched row. PassingNone
as value in the column defined in the parametermetadata_filter
indicates that all possible matches corresponding to this query pass the filtering step.
- query_column (
query_as_of_now(query_column, number_of_matches=3, collapse_rows=True, metadata_filter=None)
sourceThis method takes the query from query_column
, optionally applies
self.embedder on it and passes it to inner index to obtain matching entries
stored in the InnerIndex
(being a match
depends on the implementation and the internal state of the InnerIndex
).
For each query and for each column in self.data_table
it computes a tuple of values
that are in the rows that have IDs indicated by the response of the InnerIndex
.
It returns a JoinResult
of a left join between query table (a table that holds
query_column
) and the mentioned table of tuples (exactly one row per query,
with values not present if set of matching IDs is empty).
Optionally, the method can skip the tupling step, and return a JoinResult
of a left
join between query table, and self.data_table, using the result of InnerIndex
to indicate
when the IDs match (exactly one row per match plus one row per query with no matches).
The index answers according to the current state of the data structure and does not
revisit old answers. To to work properly, the inner_index
has to be an instance
of InnerIndex
supporting query
(all predefined indices support it, this is an
information for third party extensions).
- Parameters
- query_column (
pw.ColumnReference
) – A column containing the queries, needs to be in the format compatible withself.inner_index
(orself.embedder
). - number_of_matches (
pw.ColumnExpression | int
) – The maximum number of matches returned for each query. - collapse_rows (
bool
) – Indicates the format of the output. If set toTrue
, the resulting table has exactly one row for each query, each column of the right side of the resultingJoinResult
contains a tuple consisting of values from matched rows of corresponding column in self.data_table. If set toFalse
, the result is a left join between the table holding thequery_column
andself.data_index
, using the results fromself.inner_index
to indicate the matches between the IDs. - metadata_filter (
pw.ColumnExpression [str | None] | pw.ColumnExpression [str] | None
) – Optional, contains a boolean JMESPath query that is used to filter the potential answers insideself.inner_index
- matching entries are included only when the filter function specified in metadata_filter` returnsTrue
, when run against data ininner_index.metadata_column
, in a potentially matched row. PassingNone
as value in the column defined in the parametermetadata_filter
indicates that all possible matches corresponding to this query pass the filtering step.
- query_column (
class pw.indexing.data_index.GeneralJoin(*args, **kwargs)
[source]class pw.indexing.data_index.IdScoreSchema
[source]class pw.indexing.data_index.InnerIndex(data_column, metadata_column)
[source]Abstract class representing a data structure that accepts data (in self.data_column
)
with optional metadata (in self.metadata_column
), and answers queries with a set of
‘matching’ IDs from the data structure (optionally filtered with JMESPath query run
against stored metadata). The IDs are taken from the table that contains
the data_column column. Which IDs are considered as matched is defined in particular
implementations of subclasses of this class. Can be used as index
argument of
DataIndex
,
which is a wrapper that augments the matching IDs with some additional data.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format.
- data_column (
abstract query(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).
The resulting table with results needs contain a column _pw_index_reply
(name defined
in _INDEX_REPLY
), in which the resulting
tuples are stored.
abstract query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
pw.indexing.full_text_document_index.default_full_text_document_index(data_column, data_table, *, metadata_column=None)
sourceReturns an instance of DataIndex
(DataIndex
),
with inner index (data structure) of our choosing. This method chooses an arbitrary
implementation of InnerIndex
(that supports text queries), but it’s not necessarily the best choice of
index and its parameters (each usecase may need slightly different configuration).
As such, it is meant to be used for development, demonstrations, starting point
of larger project etc.
pathway.stdlib.indexing.hybrid_index module
class pw.indexing.hybrid_index.HybridIndex(retrievers, k=60)
[source]Hybrid Index that composes any number of other indices and combines them using
the Reciprocal Rank Fusion (RRF). It queries each index, and each retrieved row d
is assigned
score 1/(k+rank(d))
, which is then summed over all indices. HybridIndex
returns
best rows from indexed data according to this score.
- Parameters
- retrievers (
list
[InnerIndex
]) – list of indices to be used to compose the hybrid index. - k (
float
) – constant used for calculating ranking score.
- retrievers (
query(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
Whenever the index changes (via new entries in self.data_column), it should adjust all old answers to the queries (which is a default behavior of pathway code, as long as it does not use operators telling that it is not the case).
The resulting table with results needs contain a column _pw_index_reply
(name defined
in _INDEX_REPLY
), in which the resulting
tuples are stored.
query_as_of_now(query_column, *, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.hybrid_index.HybridIndexFactory(retriever_factories, k=60)
[source]Factory for creating hybrid indices.
- Parameters
- retriever_factories (
list
[InnerIndexFactory
]) – list of factories of indices that will be used in the hybrid index - k (
float
) – constant used for calculating ranking score.
- retriever_factories (
class pw.indexing.nearest_neighbors.BruteForceKnn(data_column, metadata_column, *, dimensions, reserved_space, auxiliary_space=131072, metric, embedder=None)
[source]Interface for a brute force implementation of a nearest neighbors index.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - auxiliary_space (
int
) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in caseauxiliary_space
is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored) - metric (
BruteForceKnnMetricKind
) – metric kind that is used to determine distance - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, brute force knn index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.nearest_neighbors.BruteForceKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, auxiliary_space=131072, metric=<pathway.engine.BruteForceKnnMetricKind object>)
[source]Factory for creating BruteForceKnn indices.
- Parameters
- dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - auxiliary_space (
int
) – auxiliary space (in the number of entries), the maximum number of distances that are stored in memory, while evaluating queries, in caseauxiliary_space
is set to a value smaller than the current number of entries in the index, it is still proportional to the size of the index (the value given in this parameter is ignored) - metric (
BruteForceKnnMetricKind
) – metric kind that is used to determine distance. Defaults to cosine similarity. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
class pw.indexing.nearest_neighbors.KnnIndexFactory(*, dimensions=None, embedder=None)
[source]class pw.indexing.nearest_neighbors.LshKnn(data_column, metadata_column, *, dimensions, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean', embedder=None)
[source]Interface for Pathway’s implementation of KNN via LSH.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of metadata as dictionary, in JSON format. - dimensions (
int
) – number of dimensions in the data - n_or (
int
) – number of ORs - n_and (
int
) – number of ANDs - bucket_length (
float
) – bucket length (after projecting on a line) - distance_type (
str
) – “euclidean” and “cosine” metrics are supported. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
- Parameters
- query_column (
pw.ColumnExpression
) – column containing data that is used to query the index; - number_of_matches (
pw.ColumnExpression [int] | int
) – number of nearest neighbors in the index response; defaults to 3 - metadata_filter (
pw.ColumnExpression [str] | None
) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.
- query_column (
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
- Parameters
- query_column (
pw.ColumnExpression
) – column containing data that is used to query the index; - number_of_matches (
pw.ColumnExpression[int] | int
) – number of nearest neighbors in the index response; defaults to 3 - metadata_filter (
pw.ColumnExpression [str] | None
) – optional, column expression evaluating to the text representation of a boolean JMESPath query. The index will consider only the entries with metadata that satisfies the condition in the filter.
- query_column (
class pw.indexing.nearest_neighbors.LshKnnFactory(*, dimensions=None, embedder=None, n_or=20, n_and=10, bucket_length=10.0, distance_type='euclidean')
[source]Factory for creating LshKnn indices.
- Parameters
- dimensions (
int
) – number of dimensions in the data - n_or (
int
) – number of ORs - n_and (
int
) – number of ANDs - bucket_length (
float
) – bucket length (after projecting on a line) - distance_type (
str
) – “euclidean” and “cosine” metrics are supported. - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
class pw.indexing.nearest_neighbors.USearchKnn(data_column, metadata_column, *, dimensions, reserved_space, metric, connectivity=0, expansion_add=0, expansion_search=0, embedder=None)
[source]Interface for usearch nearest neighbors index, an implementation of k nearest neighbors based on HNSW algorithm white paper.
To understand meaning of the explanation of some of the parameters, you might need some familiarity with either HNSW algorithm or its implementation provided by USearch.
- Parameters
- data_column (
pw.ColumnExpression
) – the column expression representing the data. - metadata_column (
pw.ColumnExpression [str] | None
) – optional column expression, string representation of some auxiliary data, in JSON format. - dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - metric (
USearchMetricKind
) – metric kind that is used to determine distance - connectivity (
int
) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own - expansion_add (
int
) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own - expansion_search (
int
) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- data_column (
query(query_column, number_of_matches=3, metadata_filter=None)
sourceCurrently, usearch knn index is supported only in the as-of-now variant
query_as_of_now(query_column, number_of_matches=3, metadata_filter=None)
sourceAn abstract method. Any implementation of query_as_of_now
in a subclass for each
entry in query_column
is supposed to return a tuple containing pairs, each pair
consisting of the matched ID and the score indicating quality of the match
(all that taking into account number_of_matches
and metadata_filter
parameters).
The implementation of the index should not update the answers to the old queries, when its internal state is modified.
The resulting table with results needs contain a column _pw_index_reply
(name defined
in pathway.stdlib.indexing.colnames._INDEX_REPLY), in which the resulting
tuples are stored.
class pw.indexing.nearest_neighbors.UsearchKnnFactory(*, dimensions=None, embedder=None, reserved_space=400, metric=<pathway.engine.USearchMetricKind object>, connectivity=0, expansion_add=0, expansion_search=0)
[source]Factory for creating UsearchKNN indices.
- Parameters
- dimensions (
int
) – number of dimensions of vectors that are used by the index and queries - reserved_space (
int
) – initial capacity (in the number of entries) of the index - metric (
USearchMetricKind
) – metric kind that is used to determine distance. Defaults to cosine similarity. - connectivity (
int
) – maximum number of edges for a node in the HNSW index, setting this value to 0 tells usearch to configure it on its own - expansion_add (
int
) – indicates amount of work spent while adding elements to the index (higher = more accurate placement, more work), setting this value to 0 tells usearch to configure it on its own - expansion_search (
int
) – indicates amount of work spent while searching for elements in the index (higher = more accurate results, more work), setting this value to 0 tells usearch to configure it on its own - embedder (
UDF
|None
) –UDF
used for calculating embeddings of string. It is needed, if index is used for indexing texts.
- dimensions (
pathway.stdlib.indexing.sorting module
class pw.indexing.sorting.Candidate
[source]class pw.indexing.sorting.Hash
[source]class pw.indexing.sorting.Instance
[source]class pw.indexing.sorting.Key
[source]class pw.indexing.sorting.LeftRight
[source]class pw.indexing.sorting.Node
[source]class pw.indexing.sorting.Parent
[source]class pw.indexing.sorting.PrevNext
[source]class pw.indexing.sorting.SortedIndex
[source]clear()None. Remove all items from D.
copy()a shallow copy of D
fromkeys()
Create a new dictionary with keys from iterable and values set to value.
get()
Return the value for key if key is in the dictionary, else default.
items()a set-like object providing a view on D's items
keys()a set-like object providing a view on D's keys
pop(k, )v, remove specified key and return the corresponding value.
If the key is not found, return the default if given; otherwise, raise a KeyError.
popitem()
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
setdefault()
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
update(**F)None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values()an object providing a view on D's values
pw.indexing.sorting.retrieve_prev_next_values(ordered_table, value=None)
sourceRetrieve, for each row, a pointer to the first row in the ordered_table that contains a non-“None” value, based on the orders defined by the prev and next columns.
- Parameters
- ordered_table (
pw.Table
) – Table with three columns: value, prev, next. The prev and next columns contain pointers to other rows. - value (
Optional[pw.ColumnReference]
) – Column reference pointing to the column containing values. If not provided, assumes the column name is “value”.
- ordered_table (
- Returns
pw.Table –
Table with two columns: prev_value and next_value.The prev_value column contains the values of the first row, according to the order defined by the column next, with a value different from None. The next_value column contains the values of the first row, according to the order defined by the column prev, with a value different from None.
pathway.stdlib.indexing.typecheck_utils module
pathway.stdlib.indexing.vector_document_index module
pw.indexing.vector_document_index.default_brute_force_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
,
with inner index (data structure) that is an instance of
BruteForceKnn
. This method
chooses some parameters of BruteForceKnn
arbitrarily, but it’s not necessarily a choice
that works well in any scenario (each usecase may need slightly different
configuration). As such, it is meant to be used for development, demonstrations,
starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
BruteForceKnn
.
Look up DataIndex
constructor to see how
to make data index parametrized by custom data structure, and the constructor
of BruteForceKnn
to see the
parameters that can be adjusted.
pw.indexing.vector_document_index.default_lsh_knn_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
, with inner index
(data structure) that is an instance of LshKnn
.
This method chooses some parameters of LshKnn arbitrarily, but it’s not
necessarily a choice that works well in any scenario (each usecase may need
slightly different configuration). As such, it is meant to be used for development,
demonstrations, starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
LshKnn
.
Look up DataIndex
constructor to see
how to make data index parametrized by custom data structure, and the constructor
of LshKnn
to see
the parameters that can be adjusted.
pw.indexing.vector_document_index.default_usearch_knn_document_index(data_column, data_table, dimensions, *, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
, with inner
index (data structure) that is an instance of
USearchKnn
. This method
chooses some parameters of USearchKnn arbitrarily, but it’s not necessarily a choice
that works well in any scenario (each usecase may need slightly different
configuration). As such, it is meant to be used for development, demonstrations,
starting point of larger project, etc.
Remark: the arbitrarily chosen configuration of the index may change (whenever tests
suggest some better default values). To have fixed configuration, you can use
DataIndex
with a parametrized instance of
USearchKnn
.
Look up DataIndex
constructor to see how
to make data index parametrized by custom data structure, and the constructor
of USearchKnn
to see the
parameters that can be adjusted.
pw.indexing.vector_document_index.default_vector_document_index(data_column, data_table, *, dimensions, embedder=None, metadata_column=None)
sourceReturns an instance of DataIndex
,
with inner index (data structure) of our choosing. This method chooses an arbitrary
implementation of InnerIndex
(that supports queries on vectors), but it’s not necessarily the best choice of
index and its parameters (each usecase may need slightly different configuration).
As such, it is meant to be used for development, demonstrations, starting point
of larger project etc.