showcasellmcase-studyengineering

LlamaIndex and Pathway: RAG Apps with always-up-to-date knowledge

Pathway Team

·Published January 12, 2024·Updated January 12, 2024·0 min read

You can now use Pathway in your RAG applications which enables always up-to-date knowledge from your documents to LLMs with LlamaIndex integration.

Pathway is now available on LlamaIndex, a data framework for LLM-based applications to ingest, structure, and access private or domain-specific data. You can now query Pathway and access up-to-date documents for your RAG applications from LlamaIndex using Pathway Reader and Retriever.

With this new integration, you will be able to use Pathway vector store natively in LlamaIndex, which opens up endless new possibilities! In this article, you will have a quick dive into Pathway + LlamaIndex to explore how to create a simple, yet powerful RAG solution using PathwayRetriever.

Why Pathway?

Pathway offers an indexing solution that is always up to date without the need for traditional ETL pipelines, which are needed in regular VectorDBs. It can monitor several data sources (files, S3 folders, cloud storage) and provide the latest information to your LLM application.

Learning outcomes

You will learn how to create a simple RAG solution using Pathway and LlamaIndex.

This article consists of:

Create data sources. Define data sources Pathway will read and keep the vector store updated.
Creating a transformation pipeline (parsing, splitting, embedding) for loading documents into Vector store
Querying your data and getting answers from LlamaIndex.

Prerequisites

Installing Pathway and LlamaIndex.

pip install pathway
pip install llama-index
pip install llama-index-retrievers-pathway
pip install llama-index-embeddings-openai

Setting up a folder

To start, you need to create a folder Pathway will listen to. Feel free to skip this if you already have a folder on which you want to build your RAG application. You can also use Google Drive, Sharepoint, or any other source from pathway-io.

mkdir -p 'data/'

Set up OpenAI API Key

import getpass
import os

# omit if embedder of choice is not OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Define data sources

Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage, and any data stream.

See pathway-io for more information.

You can easily connect to the data inside the folder with the Pathway file system connector. The data will automatically be updated by Pathway whenever the content of the folder changes.

import pathway as pw

data_sources = []
data_sources.append(
    pw.io.fs.read(
        "./data",
        format="binary",
        mode="streaming",
        with_metadata=True,
    )  # This creates a `pathway` connector that tracks
    # all the files in the ./data directory
)

Create the document indexing pipeline

Now that the data is ready, you must create the document indexing pipeline. The transformations should be a list of TransformComponents ending with an Embedding transformation.

First, split the text using TokenTextSplitter, then embed it with OpenAIEmbedding.

Finally, you can run the server with run_server.

from pathway.xpacks.llm.vector_store import VectorStoreServer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter

embed_model = OpenAIEmbedding(embed_batch_size=10)

transformations_example = [
    TokenTextSplitter(
        chunk_size=150,
        chunk_overlap=10,
        separator=" ",
    ),
    embed_model,
]

processing_pipeline = VectorStoreServer.from_llamaindex_components(
    *data_sources,
    transformations=transformations_example,
)

# Define the Host and port that Pathway will be on
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

# `threaded` runs pathway in detached mode, you have to set it to False when running from terminal or container
# for more information on `with_cache` check out /developers/api-docs/persistence-api
processing_pipeline.run_server(
    host=PATHWAY_HOST, port=PATHWAY_PORT, with_cache=False, threaded=True
)

Awesome! The vector store is now active, you're set to start sending queries.

Create LlamIndex Retriever and create Query Engine

from llama_index.retrievers.pathway import PathwayRetriever

retriever = PathwayRetriever(host=PATHWAY_HOST, port=PATHWAY_PORT)
retriever.retrieve(str_or_query_bundle="what is pathway")


from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever,
)

response = query_engine.query("What is Pathway?")
print(str(response))

Out[]: Empty Response

As you can see, the LLM cannot respond clearly as it lacks current knowledge, but this is where Pathway shines. Add new data to the folder Pathway is listening to, then ask our agent again to see how it responds.

To do that, you can download the repo readme of Pathway into our data folder:

wget 'https://raw.githubusercontent.com/pathwaycom/pathway/main/README.md' -O 'data/pathway_readme.md'

Try again to query with the new data:

response = query_engine.query("What is Pathway?")
print(str(response))

Out[]: Pathway is a Python framework that allows for high-throughput and low-latency real-time data processing...

As you can see, after downloading the document to the folder Pathway is listening to, changes are reflected to the query engine immediately. LLM responses are up to date with the latest changes in the documents which would require extra ETL steps in regular Vector DBs.

Conclusion

With the integration of Pathway within LlamaIndex, you can now access up-to-date documents for your RAG applications from LlamaIndex. You should now be able to use Pathway Reader and Retriever to connect to your data sources and monitor for changes, providing always up-to-date documents for your LlamaIndex application.

If you are interested in building RAG solutions with Pathway, don't hesitate to read how the document store pipeline is built with Pathway. To learn more about the possibilities of combining the live indexing pipeline of Pathway and LLMs, check out real-time RAG alerting with Pathway and ingesting unstructured data to structured.

Discuss tricks & tips for RAG

Join our Discord community and dive into discussions on tricks and tips for mastering Retrieval Augmented Generation

Pathway Team

Power your RAG and ETL pipelines with Live Data

Get started for free

Blog

Langchain and Pathway: RAG Apps with always-up-to-date knowledge

Blog

Real-time Enterprise RAG with SharePoint