pw.xpacks.llm.document_store

Pathway Document Store for processing and indexing documents.

The document store reads source documents and build a vector index over them, and exposes multiple methods for querying.

class DocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)

[source]

Builds a document indexing pipeline for processing documents and querying closest documents to a query according to a specified index.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents. The table needs to contain a data column of type bytes - usually by setting format of the connector to be "raw"". Optionally, it can contain a _metadata column containing a dictionary with metadata which is then used for filters. Some connectors offer with_metadata argument for returning _metadata column.
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | UDF | None) – callable that parses file contents into a list of documents.
    • splitter (Callable[[str], list[tuple[str, dict]]] | UDF | None) – callable that splits long documents.
    • doc_post_processors (list[Callable[[str, dict], tuple[str, dict]]] | None) – optional list of callables that modify parsed files and metadata. Each doc_post_processor is a Callable that takes two arguments (text: str, metadata: dict) and returns them as a tuple.

classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)

sourceInitializes DocumentStore by using LangChain components.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents
    • splitter (BaseDocumentTransformer | None) – Langchain component for splitting documents into parts

classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)

sourceInitializes DocumentStore by using LlamaIndex TransformComponents.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • transformations (list[TransformComponent]) – list of LlamaIndex components.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents

inputs_query(input_queries)

sourceQuery DocumentStore for the list of input documents.

retrieve_query(retrieval_queries)

sourceQuery DocumentStore for the list of closest texts to a given query.

statistics_query(info_queries)

sourceQuery DocumentStore for statistics about indexed documents. It returns the number of indexed texts, time of last modification, and time of last indexing of input document.

class DocumentStoreClient(host=None, port=None, url=None, timeout=15, additional_headers=None)

[source]

A client you can use to query DocumentStore.

Please provide either the "url", or "host" and "port".

  • Parameters

get_input_files(metadata_filter=None, filepath_globpattern=None)

sourceFetch information on documents in the the vector store.

  • Parameters
    • metadata_filter (str | None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
    • filepath_globpattern (str | None) – optional glob pattern specifying which documents will be searched for this query.

get_vectorstore_statistics()

sourceFetch basic statistics about the vector store.

query(query, k=3, metadata_filter=None, filepath_globpattern=None)

sourcePerform a query to the vector store and fetch results.

  • Parameters
    • query (str) –
    • k (int) – number of documents to be returned
    • metadata_filter (str | None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
    • filepath_globpattern (str | None) – optional glob pattern specifying which documents will be searched for this query.

class SlidesDocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)

[source]

Document store for the slide-search application. Builds a document indexing pipeline and starts an HTTP REST server.

Adds to the DocumentStore a new method parsed_documents a set of documents metadata after the parsing and document post processing stages.

classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)

sourceInitializes DocumentStore by using LangChain components.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents
    • splitter (BaseDocumentTransformer | None) – Langchain component for splitting documents into parts

classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)

sourceInitializes DocumentStore by using LlamaIndex TransformComponents.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • transformations (list[TransformComponent]) – list of LlamaIndex components.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents

inputs_query(input_queries)

sourceQuery DocumentStore for the list of input documents.

parsed_documents_query(parse_docs_queries)

sourceQuery the SlidesDocumentStore for the list of documents with the associated metadata after the parsing stage.

retrieve_query(retrieval_queries)

sourceQuery DocumentStore for the list of closest texts to a given query.

statistics_query(info_queries)

sourceQuery DocumentStore for statistics about indexed documents. It returns the number of indexed texts, time of last modification, and time of last indexing of input document.