pw.xpacks.llm.vector_store
Pathway vector search server and client.
The server reads source documents and build a vector index over them, then starts serving HTTP requests.
The client queries the server and returns matching documents.
class SlidesVectorStoreServer(*docs, embedder, parser=None, splitter=None, doc_post_processors=None)
[source]Accompanying vector index server for the slide-search demo.
Builds a document indexing pipeline and starts an HTTP REST server.
Modifies the VectorStoreServer’s pw_list_document endpoint to return set of
metadata after the parsing and document post processing stages.
class InputsResultSchema
[source]classmethod from_langchain_components(*docs, embedder, parser=None, splitter=None, **kwargs)
sourceInitializes VectorStoreServer by using LangChain components.
Embeddings embedder: Langchain component for embedding documents
Callable[[bytes], list[tuple[str, dict]]] | None parser: callable that parses file contents into a list of documents
BaseDocumentTransformer | None splitter: Langchaing component for splitting documents into parts
classmethod from_llamaindex_components(*docs, transformations, parser=None, **kwargs)
sourceInitializes VectorStoreServer by using LlamaIndex TransformComponents.
list[TransformComponent]
 transformations: list of LlamaIndex components. The last component in this list
is required to inherit from LlamaIndex BaseEmbedding
- Parameters
 parser (Callable[[bytes],list[tuple[str,dict]]] |None) – callable that parses file contents into a list of documents
inputs_query(input_queries)
sourceQuery DocumentStore for the list of input documents.
retrieve_query(retrieval_queries)
sourceQuery DocumentStore for the list of closest texts to a given query.
run_server(host, port, threaded=False, with_cache=True, cache_backend=pw.persistence.Backend.filesystem('./Cache'), **kwargs)
sourceBuilds the document processing pipeline and runs it.
- Parameters- host – host to bind the HTTP listener
- port – to bind the HTTP listener
- threaded (bool) – if True, run in a thread. Else block computation
- with_cache (bool) – if True, embedding requests for the same contents are cached
- cache_backend (Backend|None) – the backend to use for caching if it is enabled. The default is the disk cache, hosted locally in the folder./Cache. You can useBackendclass of the [persistence API](/developers/api-docs/persistence-api/#pathway.persistence.Backend) to override it.
- kwargs – optional parameters to be passed to run().
 
- Returns
 If threaded, return the Thread object. Else, does not return.
statistics_query(info_queries)
sourceQuery DocumentStore for statistics about indexed documents. It returns the number
of indexed texts, time of last modification, and time of last indexing of input document.
class VectorStoreClient(*args, **kwargs)
[source]A client you can use to query VectorStoreServer.
Please provide either the "url", or "host" and "port".
- Parameters- host – host on which VectorStoreServer listens
- port – port on which VectorStoreServer listens
- url – url at which VectorStoreServer listens
- timeout – timeout for the post requests in seconds
 
get_input_files(metadata_filter=None, filepath_globpattern=None, return_status=False)
sourceFetch information on documents in the the vector store.
- Parameters- metadata_filter (str|None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
- filepath_globpattern (str|None) – optional glob pattern specifying which documents will be searched for this query.
- return_status (bool) – flag telling whether _indexing_status should be returned for each document
 
- metadata_filter (
get_vectorstore_statistics()
sourceFetch basic statistics about the vector store.
query(query, k=3, metadata_filter=None, filepath_globpattern=None)
sourcePerform a query to the vector store and fetch results.
- Parameters- query (str) –
- k (int) – number of documents to be returned
- metadata_filter (str|None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
- filepath_globpattern (str|None) – optional glob pattern specifying which documents will be searched for this query.
 
- query (
class VectorStoreServer(*docs, embedder, parser=None, splitter=None, doc_post_processors=None)
[source]Builds a document indexing pipeline and starts an HTTP REST server for nearest neighbors queries.
- Parameters- docs (Table) – pathway tables typically coming out of connectors which contain source documents.
- embedder (UDF) – callable that embeds a single document
- parser (Callable[[bytes],list[tuple[str,dict]]] |UDF|None) – callable that parses file contents into a list of documents
- splitter (Callable[[str],list[tuple[str,dict]]] |UDF|None) – callable that splits long documents
- doc_post_processors (list[Callable[[str,dict],tuple[str,dict]]] |None) – optional list of callables that modify parsed files and metadata. any callable takes two arguments (text: str, metadata: dict) and returns them as a tuple.
 
- docs (
class InputsResultSchema
[source]classmethod from_langchain_components(*docs, embedder, parser=None, splitter=None, **kwargs)
sourceInitializes VectorStoreServer by using LangChain components.
Embeddings embedder: Langchain component for embedding documents
Callable[[bytes], list[tuple[str, dict]]] | None parser: callable that parses file contents into a list of documents
BaseDocumentTransformer | None splitter: Langchaing component for splitting documents into parts
classmethod from_llamaindex_components(*docs, transformations, parser=None, **kwargs)
sourceInitializes VectorStoreServer by using LlamaIndex TransformComponents.
list[TransformComponent]
 transformations: list of LlamaIndex components. The last component in this list
is required to inherit from LlamaIndex BaseEmbedding
- Parameters
 parser (Callable[[bytes],list[tuple[str,dict]]] |None) – callable that parses file contents into a list of documents
inputs_query(input_queries)
sourceQuery DocumentStore for the list of input documents.
retrieve_query(retrieval_queries)
sourceQuery DocumentStore for the list of closest texts to a given query.
run_server(host, port, threaded=False, with_cache=True, cache_backend=pw.persistence.Backend.filesystem('./Cache'), **kwargs)
sourceBuilds the document processing pipeline and runs it.
- Parameters- host – host to bind the HTTP listener
- port – to bind the HTTP listener
- threaded (bool) – if True, run in a thread. Else block computation
- with_cache (bool) – if True, embedding requests for the same contents are cached
- cache_backend (Backend|None) – the backend to use for caching if it is enabled. The default is the disk cache, hosted locally in the folder./Cache. You can useBackendclass of the [persistence API](/developers/api-docs/persistence-api/#pathway.persistence.Backend) to override it.
- kwargs – optional parameters to be passed to run().
 
- Returns
 If threaded, return the Thread object. Else, does not return.
statistics_query(info_queries)
sourceQuery DocumentStore for statistics about indexed documents. It returns the number
of indexed texts, time of last modification, and time of last indexing of input document.