How it Works

This pipeline can use several Pathway connectors to read the data from the local drive, Google Drive, and Microsoft SharePoint sources. It allows you to poll the changes with low latency and to do the modifications tracking. So, if something changes in the tracked files, the corresponding change is reflected in the internal collections. The contents are read into a single Pathway Table as binary objects.

After that, those binary objects are parsed with “unstructured” library and split into chunks. With the usage of OpenAI API, the pipeline embeds the obtained chunks.

Finally, the embeddings are indexed with the capabilities of Pathway's machine-learning library. The user can then query the created index with simple HTTP requests to the endpoints mentioned above.

Understanding your RAG pipeline

This folder contains several objects:

app.py, the application code using Pathway and written in Python;
app.yaml, the file containing configuration stubs for the data sources. It needs to be customized if you want to use the Google Drive data source;
requirements.txt, the dependencies for your pipeline. It can be passed to pip install -r ... to install everything that is needed to launch the pipeline locally;
Dockerfile, the Docker configuration for running the pipeline in the container;
.env, a short environment variables configuration file where the OpenAI key must be stored;
data/, a folder with exemplary files that can be used for the test runs.

Let's understand your application code in app.py

Here in your app.py file you've followed a sequence of steps. Before looking at the code, let's give it a glance.

Set Up Your License Key: You ensure you have the necessary access to Pathway features.
Configure Logging: Set up logging to monitor what’s happening in your application.
Load Environment Variables: Manage sensitive data securely.
Define Data Sources Function: Handle data from various sources seamlessly.
Main Function with Click: Use command-line interaction to control your pipeline.
Initialize Embedder: Convert text to embeddings for further processing.
Initialize Chat Model: Set up your language model for generating responses.
Set Up Document Store: Manage and retrieve document chunks efficiently with an option of selecting an index.
Set Up RAG Application: Combine retrieval and generation for effective question answering.
Build and Run Server: Start your server to handle real-time requests.

app.py

import logging
import sys
import click
import pathway as pw
import yaml
from dotenv import load_dotenv
from pathway.udfs import DiskCache
from pathway.xpacks.llm.question_answering import BaseRAGQuestionAnswerer
from pathway.stdlib.indexing import BruteForceKnnFactory, HybridIndexFactory
from pathway.stdlib.indexing.bm25 import TantivyBM25Factory
from pathway.xpacks.llm import embedders, llms, parsers, splitters
from pathway.xpacks.llm.document_store import DocumentStore

# Set your Pathway license key here to use advanced features.
pw.set_license_key("demo-license-key-with-telemetry")

# Set up basic logging to capture key events and errors.
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(name)s %(levelname)s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

# Load environment variables (e.g., API keys) from the .env file.
load_dotenv()

# Command-line interface (CLI) function to run the app with a specified config file.
@click.command()
@click.option("--config_file", default="app.yaml", help="Config file to be used.")
def run(config_file: str = "app.yaml"):
    # Load the configuration from the YAML file.
    with open(config_file) as f:
        config = pw.load_yaml(f)

    sources = config["sources"]

    llm = llms.OpenAIChat(model="gpt-4o-mini", cache_strategy=DiskCache())

    # Initialize the OpenAI Embedder to handle embeddings with caching enabled.
    embedder = embedders.OpenAIEmbedder(
        model="text-embedding-ada-002",
        cache_strategy=DiskCache(),
    )

    parser = parsers.ParseUnstructured()

    index = HybridIndexFactory(
            [
                TantivyBM25Factory(),
                BruteForceKnnFactory(embedder=embedder),
            ]
        )

    text_splitter = splitters.TokenCountSplitter(max_tokens=400)

    # Host and port configuration for running the server.
    # host and port of the RAG app
    pathway_host: str = "0.0.0.0"
    pathway_port: int = 8000

    # Initialize the vector store for storing document embeddings in memory.
    # This vector store updates the index dynamically whenever the data source changes
    # and can scale to handle over a million documents.
    doc_store = DocumentStore(
            docs=sources,
            splitter=text_splitter,
            parser=parser,
            retriever_factory=index
        )

    # Create a RAG (Retrieve and Generate) question-answering application.
    rag_app = BaseRAGQuestionAnswerer(llm=llm, indexer=doc_store)

    # Build the server to handle requests at the specified host and port.
    rag_app.build_server(host=pathway_host, port=pathway_port)

    # Run the server with caching enabled, and handle errors without shutting down.
    rag_app.run_server(with_cache=True, terminate_on_error=False)

# Entry point to execute the app if the script is run directly.
if __name__ == "__main__":
    run()

Possible Modifications

Change Input Folders: Update paths to new data folders.
Modify LLM: Switch to a different language model
Change Embedder: Use an alternative embedder from embedders.
Update Index: Configure a different indexing method.
Host and Port: Adjust the host and port settings for different environments.
Run Options: Enable or disable caching and specify a new cache folder.

It is also possible to easily create new components by extending the pw.UDF class and implementing the __wrapped__ function.

For more details on building LLM/AI pipelines from scrath i.e. using Pathway's Live Data Framework entire SDK, refer LLM Xpacks here. The LLM xpack provides you all the tools you need to use Large Language Models in Pathway. Wrappers for most common LLM services and utilities are included, making working with LLMs as easy as it can be.

Conclusion

This demonstrates setting up a powerful RAG pipeline with always up-to-date knowledge. While we've only scratched the surface, there's more to explore:

Re-ranking: Prioritize the most relevant results for your specific query.
Knowledge Graphs: Leverage relationships between entities to improve understanding.
Hybrid Indexing: Combine different indexing strategies for optimal retrieval.
Adaptive Reranking: Iteratively enlarge the context for optimal accuracy, see our next tutorial around adaptive RAG.

Stay tuned for future examples exploring these RAG techniques with Pathway!

Enjoy building your RAG project! If you have any questions or need further assistance, feel free to reach out to the Pathway team or check with your peers from the bootcamp cohort.

What if you want to use a Multimodal LLM like GPT-4o

That's a great idea indeed. Multimodal LLMs like GPT-4o excel at parsing images, charts, etc. thereby significantly enhancing the accuracy for text-based use-cases as well.

For example, imagine if you're building a RAG project with Google Drive as a data source. But that Drive folder has several financial documents with charts, tables, etc. Below is an interesting example where you'll see how Pathway parsed tables as images and used GPT-4o to get a much more accurate response.

gpt_4o_multimodal_rag exampleGitHub

heroicons:chevron-right-16-solid