Pathway utilizes RAG (Retrieval-Augmented Generation) to power its document AI magic. Here's a breakdown of the pipeline:
1. Document Ingestion: Connect your document sources like Google Drive, SharePoint, local storage, or S3. Supported formats include text, PDF, DOCX, and HTML. Documents are NOT duplicated and indexed for efficient retrieval.
2. Retrieval Stage: When you ask a question, Pathway searches its document index using powerful algorithms. Relevant passages are retrieved based on their potential to answer your query.
3. Generation Stage: The retrieved passages are fed into a GPT model of choice, fine-tuned for document understanding. GPT model synthesizes the information and generates a clear, concise, and informative answer to your question.
4. Summarization (Optional): For multiple documents, Pathway can automatically summarize responses, presenting key points for quick review.
5. Continuous Learning: Pathway constantly learns and adapts. As your documents change, the retrieval and generation models get updated through an always refreshed vector index, ensuring your answers are always based on the latest information.
Benefits: Accurate & Insightful Answers: Get straight to the point with answers sourced directly from your documents. Effortless Maintenance: No separate data preprocessing needed. No need for a separate vector database. Pathway automatically keeps itself up-to-date. Highly Customizable: Adapt the pipeline to your specific needs, including prompts, search queries, and more. Seamless Integration: Works with various document sources and integrates smoothly with your existing workflows.
This is a basic service for a real-time document indexing pipeline powered by Pathway.
The capabilities of the service include:
Real-time document indexing from Microsoft 365 SharePoint
Real-time document indexing from Google Drive
Similarity search by user query
Filtering by the metadata according to the condition given in JMESPath format
Basic stats on the indexer's health
Supported document formats include plaintext, pdf, docx, and HTML. For the complete list, please refer to the supported formats of the unstructured library. In addition, this pipeline is capable of data removals: you can delete files and in a few seconds, a similarity search will undo the changes done to the index by their addition.
Please also keep in mind the following constraints and limitations:
The maximum supported file size is 4 MB and 100 Kb of the plaintext is obtained after parsing. Anything of the greater size will be ignored by the indexer
The files in the shared spaces are removed within 15 minutes after their addition
You hold responsibility for the contents of the files you upload