pw.xpacks.llm.document_store
Pathway Document Store for processing and indexing documents.
The document store reads source documents and build a vector index over them, and exposes multiple methods for querying.
class DocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)
[source]Builds a document indexing pipeline for processing documents and querying closest documents to a query according to a specified index.
- Parameters
- docs (
Union
[Table
,Iterable
[Table
]]) – pathway tables typically coming out of connectors which contain source documents. The table needs to contain adata
column of type bytes - usually by setting format of the connector to be"raw""
. Optionally, it can contain a_metadata
column containing a dictionary with metadata which is then used for filters. Some connectors offerwith_metadata
argument for returning_metadata
column. - retriever_factory (
AbstractRetrieverFactory
) – factory for building an index, which will be provided texts by theDocumentStore
. - parser (
Callable
[[bytes
],list
[tuple
[str
,dict
]]] |UDF
|None
) – callable that parses file contents into a list of documents. - splitter (
Callable
[[str
],list
[tuple
[str
,dict
]]] |UDF
|None
) – callable that splits long documents. - doc_post_processors (
list
[Callable
[[str
,dict
],tuple
[str
,dict
]]] |None
) – optional list of callables that modify parsed files and metadata. Each doc_post_processor is a Callable that takes two arguments (text: str, metadata: dict) and returns them as a tuple.
- docs (
classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)
sourceInitializes DocumentStore by using LangChain components.
- Parameters
- docs (
Union
[Table
,Iterable
[Table
]]) – pathway tables typically coming out of connectors which contain source documents - retriever_factory (
AbstractRetrieverFactory
) – factory for building an index, which will be provided texts by theDocumentStore
. - parser (
Callable
[[bytes
],list
[tuple
[str
,dict
]]] |None
) – callable that parses file contents into a list of documents - splitter (
BaseDocumentTransformer
|None
) – Langchain component for splitting documents into parts
- docs (
classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)
sourceInitializes DocumentStore by using LlamaIndex TransformComponents.
- Parameters
- docs (
Union
[Table
,Iterable
[Table
]]) – pathway tables typically coming out of connectors which contain source documents - retriever_factory (
AbstractRetrieverFactory
) – factory for building an index, which will be provided texts by theDocumentStore
. - transformations (
list
[TransformComponent
]) – list of LlamaIndex components. - parser (
Callable
[[bytes
],list
[tuple
[str
,dict
]]] |None
) – callable that parses file contents into a list of documents
- docs (
inputs_query(input_queries)
sourceQuery DocumentStore
for the list of input documents.
retrieve_query(retrieval_queries)
sourceQuery DocumentStore
for the list of closest texts to a given query
.
statistics_query(info_queries)
sourceQuery DocumentStore
for statistics about indexed documents. It returns the number
of indexed texts, time of last modification, and time of last indexing of input document.
class DocumentStoreClient(host=None, port=None, url=None, timeout=15, additional_headers=None)
[source]A client you can use to query DocumentStore.
Please provide either the "url"
, or "host"
and "port"
.
- Parameters
- host (
str
|None
) – host on which VectorStoreServer listens - port (
int
|None
) – port on which VectorStoreServer listens - url (
str
|None
) – url at which VectorStoreServer listens - timeout (
int
|None
) – timeout for the post requests in seconds
- host (
get_input_files(metadata_filter=None, filepath_globpattern=None)
sourceFetch information on documents in the the vector store.
- Parameters
- metadata_filter (
str
|None
) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering. - filepath_globpattern (
str
|None
) – optional glob pattern specifying which documents will be searched for this query.
- metadata_filter (
get_vectorstore_statistics()
sourceFetch basic statistics about the vector store.
query(query, k=3, metadata_filter=None, filepath_globpattern=None)
sourcePerform a query to the vector store and fetch results.
- Parameters
- query (
str
) – - k (
int
) – number of documents to be returned - metadata_filter (
str
|None
) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering. - filepath_globpattern (
str
|None
) – optional glob pattern specifying which documents will be searched for this query.
- query (
class SlidesDocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)
[source]Document store for the slide-search
application.
Builds a document indexing pipeline and starts an HTTP REST server.
Adds to the DocumentStore
a new method parsed_documents
a set of
documents metadata after the parsing and document post processing stages.
classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)
sourceInitializes DocumentStore by using LangChain components.
- Parameters
- docs (
Union
[Table
,Iterable
[Table
]]) – pathway tables typically coming out of connectors which contain source documents - retriever_factory (
AbstractRetrieverFactory
) – factory for building an index, which will be provided texts by theDocumentStore
. - parser (
Callable
[[bytes
],list
[tuple
[str
,dict
]]] |None
) – callable that parses file contents into a list of documents - splitter (
BaseDocumentTransformer
|None
) – Langchain component for splitting documents into parts
- docs (
classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)
sourceInitializes DocumentStore by using LlamaIndex TransformComponents.
- Parameters
- docs (
Union
[Table
,Iterable
[Table
]]) – pathway tables typically coming out of connectors which contain source documents - retriever_factory (
AbstractRetrieverFactory
) – factory for building an index, which will be provided texts by theDocumentStore
. - transformations (
list
[TransformComponent
]) – list of LlamaIndex components. - parser (
Callable
[[bytes
],list
[tuple
[str
,dict
]]] |None
) – callable that parses file contents into a list of documents
- docs (
inputs_query(input_queries)
sourceQuery DocumentStore
for the list of input documents.
parsed_documents_query(parse_docs_queries)
sourceQuery the SlidesDocumentStore for the list of documents with the associated metadata after the parsing stage.
retrieve_query(retrieval_queries)
sourceQuery DocumentStore
for the list of closest texts to a given query
.
statistics_query(info_queries)
sourceQuery DocumentStore
for statistics about indexed documents. It returns the number
of indexed texts, time of last modification, and time of last indexing of input document.