
Pathway Document Store for processing and indexing documents.

The document store reads source documents and build a vector index over them, and exposes multiple methods for querying.

class DocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)


Builds a document indexing pipeline for processing documents and querying closest documents to a query according to a specified index.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents. The table needs to contain a data column of type bytes - usually by setting format of the connector to be "raw"". Optionally, it can contain a _metadata column containing a dictionary with metadata which is then used for filters. Some connectors offer with_metadata argument for returning _metadata column.
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | UDF | None) – callable that parses file contents into a list of documents.
    • splitter (Callable[[str], list[tuple[str, dict]]] | UDF | None) – callable that splits long documents.
    • doc_post_processors (list[Callable[[str, dict], tuple[str, dict]]] | None) – optional list of callables that modify parsed files and metadata. Each doc_post_processor is a Callable that takes two arguments (text: str, metadata: dict) and returns them as a tuple.

classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)

sourceInitializes DocumentStore by using LangChain components.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents
    • splitter (BaseDocumentTransformer | None) – Langchain component for splitting documents into parts

classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)

sourceInitializes DocumentStore by using LlamaIndex TransformComponents.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • transformations (list[TransformComponent]) – list of LlamaIndex components.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents


sourceQuery DocumentStore for the list of input documents.


sourceQuery DocumentStore for the list of closest texts to a given query.


sourceQuery DocumentStore for statistics about indexed documents. It returns the number of indexed texts, time of last modification, and time of last indexing of input document.

class DocumentStoreClient(host=None, port=None, url=None, timeout=15, additional_headers=None)


A client you can use to query DocumentStore.

Please provide either the "url", or "host" and "port".

  • Parameters

get_input_files(metadata_filter=None, filepath_globpattern=None)

sourceFetch information on documents in the the vector store.

  • Parameters
    • metadata_filter (str | None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
    • filepath_globpattern (str | None) – optional glob pattern specifying which documents will be searched for this query.


sourceFetch basic statistics about the vector store.

query(query, k=3, metadata_filter=None, filepath_globpattern=None)

sourcePerform a query to the vector store and fetch results.

  • Parameters
    • query (str) –
    • k (int) – number of documents to be returned
    • metadata_filter (str | None) – optional string representing the metadata filtering query in the JMESPath format. The search will happen only for documents satisfying this filtering.
    • filepath_globpattern (str | None) – optional glob pattern specifying which documents will be searched for this query.

class SlidesDocumentStore(docs, retriever_factory, parser=None, splitter=None, doc_post_processors=None)


Document store for the slide-search application. Builds a document indexing pipeline and starts an HTTP REST server.

Adds to the DocumentStore a new method parsed_documents a set of documents metadata after the parsing and document post processing stages.

classmethod from_langchain_components(docs, retriever_factory, parser=None, splitter=None, **kwargs)

sourceInitializes DocumentStore by using LangChain components.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents
    • splitter (BaseDocumentTransformer | None) – Langchain component for splitting documents into parts

classmethod from_llamaindex_components(docs, retriever_factory, transformations, parser=None, **kwargs)

sourceInitializes DocumentStore by using LlamaIndex TransformComponents.

  • Parameters
    • docs (Union[Table, Iterable[Table]]) – pathway tables typically coming out of connectors which contain source documents
    • retriever_factory (AbstractRetrieverFactory) – factory for building an index, which will be provided texts by the DocumentStore.
    • transformations (list[TransformComponent]) – list of LlamaIndex components.
    • parser (Callable[[bytes], list[tuple[str, dict]]] | None) – callable that parses file contents into a list of documents


sourceQuery DocumentStore for the list of input documents.


sourceQuery the SlidesDocumentStore for the list of documents with the associated metadata after the parsing stage.


sourceQuery DocumentStore for the list of closest texts to a given query.


sourceQuery DocumentStore for statistics about indexed documents. It returns the number of indexed texts, time of last modification, and time of last indexing of input document.