showcasellm

Evaluating RAG applications with RAGAS

Berke Can Rizai

·Published March 13, 2025·Updated March 13, 2025·0 min read

Pathway streamlines the process of building RAG applications with always up-to-date knowledge. It empowers you to connect your LLM to live data sources and eliminates the need for separate ETL pipelines for knowledge management.

However, simply building and deploying a RAG app isn't enough, and evaluations shouldn't be treated as an afterthought. In Pathway, we rely on frequent evaluation runs to keep our offerings reliable. This also prevents us from introducing any silent bugs into the pipeline. This guide offers a simplified look at how we evaluate our RAG solutions at Pathway. For a detailed view of the full pipeline, including additional evaluation components and logging, check out the complete CI workflow.

You need to ensure that your RAG application delivers accurate and reliable results with YOUR data. This is where our blog post dives in. You will explore RAG evaluations, create synthetic test data if necessary, and learn how to optimize your Pathway RAG app.

Here's a sneak peek of what we'll cover:

Essential evaluation metrics: We'll unpack key metrics used to assess different aspects of your RAG pipeline, including retrieval accuracy, generation quality, and overall system effectiveness.
Creating synthetic dataset: Build test data based on your files.
Tweaking RAGAS to suit your needs: From modifying metric calculations to modifying the LLM evaluator.
Optimizing your Pathway RAG application: Discover how to fine-tune your RAG system for optimal performance, tailored to your specific use case and dataset.

Evaluation Metrics
Setup and Installation
Dataset
Synthetic Dataset Creation
Launching the Pathway RAG App
Evaluate with the Dataset
Improving-the-Accuracy:
Summary & Findings

Evaluation Metrics

RAG evaluation metrics can be categorized into two parts, "retrieval" metrics and "generation" metrics. Retrieval metrics are usually found in recommendation or information retrieval domains. Whereas generation metrics cover LLM related topics, including how the LLM makes use of the given context, hallucinations, truthfulness and so on.

Some of the retrieval metrics are:

Hit@k: Measures the proportion of times that the relevant item appears in the top-K retrieved results. This can be also mentioned as "Context Recall", that is assuming there is only one relevant document.
Context Recall: Focuses on the comprehensiveness of the retrieved context, measuring the proportion of all relevant documents in the corpus that are successfully retrieved. It is formally defined as (Number of Relevant Items Retrieved) / (Total Number of Relevant Items in Corpus). In simpler terms, recall tells you "Of all the relevant documents that could have been retrieved, how many were actually retrieved?". High recall signifies that your retrieval system is good at finding most of the relevant context available.
Context Precision: This metric focuses on the quality of the retrieved context by measuring the proportion of retrieved documents that are actually relevant to the query. Formally, it is calculated as (Number of Relevant Items Retrieved) / (Total Number of Items Retrieved). In contrast to Hit@k (or "Context Recall") which emphasizes retrieving at least one relevant item within the top-K results, precision evaluates the relevance concentration within the retrieved set. Essentially, precision answers: "Of all documents retrieved, how many were relevant?".
Mean Reciprocal Rank (MRR): Evaluates the ranking of retrieved documents by focusing on the position of the first relevant document in the ranked list. For each query, the Reciprocal Rank (RR) is calculated as 1 / rank, where rank is the position of the first relevant document. If there are no relevant documents in the retrieved list, RR is 0. MRR is then the mean of these reciprocal ranks across a set of queries. Generally, you shouldn't stress about this metric in your RAG application. This is largely because the benefit of having the most relevant context ranked at the top is less critical for LLMs.
Normalized Discounted Cumulative Gain (NDCG): A ranking-based metric that evaluates the quality of retrieved results by considering both relevance and position. Unlike Hit@k and MRR, which primarily focus on whether relevant items appear at the top, NDCG assigns higher importance to highly relevant documents appearing earlier in the ranked list. This metric can be useful when you have more than one relevant items and their relevancy has float labels instead of booleans.

As for the generation metrics:

Faithfulness: Evaluates how grounded the LLM's answer is in the retrieved context. It measures whether the claims in the generated answer are supported by the provided context. Penalizes the hallucinations.
Answer correctness: Factual correctness of the response. Even though this is under generation category, this metric generally covers the whole RAG pipeline.

These are only small subset of the all metrics, however we found these set of metrics to be reliable indicators of the whole RAG application performance. If you are curious about the list of all available metrics in the RAGAS, check it out here!

Setup and Installation

Magic library is used for detecting file types in the UnstructuredParser module.

If you are running this notebook on MacOS, you can install it with:

brew install libmagic

If you are running the notebook on colab or any linux environment, you can install by running:

apt install libmagic1

Install the rest of the dependencies:

pip install "pathway[all]"
pip install ragas
pip install langchain-openai

Dataset

Having a representative dataset is crucial for effective evaluations. It is recommended to set aside dedicated time to create a gold-standard dataset that accurately reflects your use case.

To ensure robust evaluation, consider splitting your dataset into validation and test sets. Validation set helps fine-tune the retrieval and generation parameters, allowing for iterative improvements without overfitting to the final benchmark. Test set is kept separate from the tuning process, provides an unbiased measure of the performance, ensuring that optimizations generalize beyond the development phase.

Here are the steps we will follow:

Prepare your docs to be in markdown format
Create and save synthetic dataset with RAGAS

Synthetic Dataset Creation

Prepare the documents as markdown

Here, we will use Pathway parsers to parse our document's content and save it as a markdown. Then, we will create a synthetic dataset based on the file contents with the gpt-4o. It is a good idea to create synthetic data with a model that is different than the one in your application. This is because LLM's bias will influence the wording, queries, and answers in your dataset. This may introduce unwanted bias in the metrics.

import os
import getpass

import pandas as pd
import pathway as pw


from pathway.xpacks.llm import parsers

Define the helpers to save docs as markdown. This reads the file, parses it, and saves to specified folder with the same filename.

async def document_to_markdown(
    input_path: str, output_folder: str, parser: pw.UDF = parsers.UnstructuredParser()
) -> None:
    os.makedirs(output_folder, exist_ok=True)

    with open(input_path, "rb") as f:
        file_bytes = f.read()
        content = await parser.func(file_bytes)
        file_md = "\n\n".join([split[0] for split in content])

    md_file_name = os.path.splitext(os.path.basename(input_path))[0] + ".md"

    with open(f"{output_folder}{os.path.sep}{md_file_name}", "w") as f:
        f.write(file_md)

MARKDOWN_FOLDER = "./markdown_docs"
INPUT_FOLDER = "./data"

Download the Alphabet 10K report as an example PDF. Feel free to skip this step if you want to use your own documents. You will need to copy your documents to the INPUT_FOLDER.

!wget -P "$INPUT_FOLDER" "https://github.com/pathwaycom/llm-app/blob/main/examples/pipelines/gpt_4o_multimodal_rag/data/20230203_alphabet_10K.pdf"

await document_to_markdown(f"{INPUT_FOLDER}/20230203_alphabet_10K.pdf", MARKDOWN_FOLDER)

Configuring the Generations and Creating the Dataset

Now we are finished with the parsing of the documents, let's create the synthetic dataset with RAGAS.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(MARKDOWN_FOLDER, glob="**/*.md")
docs = loader.load()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass.getpass(
    "Enter your OpenAI API key: "
)

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0.0))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
)

# generate the dataset

dataset = generator.generate_with_langchain_docs(
    docs,
    testset_size=20,
)

Save the dataset into a file:

dataset.to_jsonl("synthetic_dataset.jsonl")

from ragas import EvaluationDataset, SingleTurnSample
from ragas.testset.synthesizers.testset_schema import Testset

If you have a previously saved dataset, you can load it with from_jsonl:

dataset = EvaluationDataset.from_jsonl("synthetic_dataset.jsonl")

You may also download the synthetic dataset we created from the example file:

!wget -P "$MARKDOWN_FOLDER" "https://gist.githubusercontent.com/berkecanrizai/4b036863a57cd6c93c7ca497c93abe2b/raw/4569e19bfd95fad05885fee32046e0b0d5d9d2cb/synthetic_dataset.jsonl"

Have a peek at the dataset;

dataset.to_pandas().head()

Notes on repeatability: The scores presented in this notebook are averaged over three independent runs to ensure reliability. Two of these runs used the provided synthetic dataset generated with the UnstructuredParser, while the third run utilized the data from the PypdfParser.

We found that LLM based evaluations can vary wildly between the runs. Also note that variables such as the order of the documents, wording of the answer & question pairs, and the LLM can have big impact on these scores. We also found that score variability and reliability is one of the main weaknesses of the RAGAS. We plan to repeat these experiments with deepeval in the future, stay tuned!

Launching the Pathway RAG App

Pathway DocumentStore and BaseRAGQuestionAnswerer provides end to end solution for RAG applications.

DocumentStore manages document ingestion from your data sources, as well as document processing that includes parsing, splitting, and the indexing.

BaseRAGQuestionAnswerer creates a Pathway RAG application that:

Indexes the documents (via DocumentStore)
Exposes the question answering endpoints

Let's keep things simple and test a naive RAG solution with the following components:

Unstructured Parser
Token based splitter
OpenAI embedder
Hybrid index that combines semantic search and keyword based BM25 search
A barebones RAG prompt

For more information, check out the documentation:

Connectors: Use Pathway’s file reader to ingest the files.
Parsers: Utilize the UnstructuredParser to parse the documents. This parser supports multiple file types, including PDF, DOCX, and PPTX.
Text Splitters: Split the document content into chunks.
Embedders: Use OpenAI API for embeddings.
Vector/KNN Index (via BruteForceKnnFactory): Semantic index that is powered by an embedder.
BM25 (via TantivyBM25Factory): Keyword based BM25 search.
HybridIndexFactory: combines different indexes to build an hybrid index.
Prompts: Prompt template for RAG.

from pathway.stdlib.indexing import BruteForceKnnFactory, HybridIndexFactory
from pathway.stdlib.indexing.bm25 import TantivyBM25Factory
from pathway.udfs import DiskCache
from pathway.xpacks.llm import embedders, llms, parsers, splitters
from pathway.xpacks.llm.document_store import DocumentStore
from pathway.xpacks.llm.question_answering import BaseRAGQuestionAnswerer, RAGClient
from pathway.xpacks.llm.servers import QASummaryRestServer


# read the text files under the data folder, we can also read from Google Drive, Sharepoint, etc.
# See connectors documentation: https://pathway.com/developers/user-guide/connect/pathway-connectors to learn more
folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

# list of data sources to be indexed
sources = [folder]

# define the document processing steps
parser = parsers.UnstructuredParser()

text_splitter = splitters.TokenCountSplitter(min_tokens=150, max_tokens=450)

embedder = embedders.OpenAIEmbedder(
    cache_strategy=DiskCache(), retry_strategy=pw.udfs.ExponentialBackoffRetryStrategy()
)

index = BruteForceKnnFactory(embedder=embedder)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources, parser=parser, splitter=text_splitter, retriever_factory=index
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.
  Question: {query}

  Context: {context}

  Answer:"""

# create the RAG app that will power the index, and serve the agent endpoint
rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
    search_topk=8,  # number of retrieved chunks for RAG
)

Build and Run the Pathway server

import multiprocessing


# host and port of the RAG app
pathway_host: str = "0.0.0.0"
pathway_port: int = 8000

Once the app starts, it will:

Ingest your files
Parse and chunk the documents
Index the chunks
Host the RAG endpoint for question answering

server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

Start the process:

server_process.start()

RAGClient is the client that can query the Pathway RAG application.

Let's test if the test files are indexed. This will list all the indexed documents in our Pathway server.

from pathway.xpacks.llm.question_answering import RAGClient

pathway_client = RAGClient(pathway_host, pathway_port)
pathway_client.list_documents()

Evaluate with the Dataset

Here, we will iterate over the samples and gather the RAG response and the context documents for each one of the test samples.

Keep in mind that LLM evaluation metrics can fluctuate between runs. Even minor details like context document order or wording can impact results. For more reliable testing, it's best to re-run tests multiple times and average the scores.

def predict_test_dataset(
    dataset: Testset | EvaluationDataset, verbose: bool = True
) -> EvaluationDataset:
    predicted_samples: list[SingleTurnSample] = []

    for sample in dataset.samples:
        single_sample = sample.eval_sample if isinstance(dataset, Testset) else sample

        if verbose:
            print(f"Predicting question: {single_sample.user_input}")

        pw_response: dict = pathway_client.answer(
            prompt=single_sample.user_input, return_context_docs=True
        )
        resp: str = pw_response["response"]
        context_docs: list[str] = [elem["text"] for elem in pw_response["context_docs"]]

        pred_sample = SingleTurnSample(
            response=resp, retrieved_contexts=context_docs, **single_sample.to_dict()
        )
        predicted_samples.append(pred_sample)

    return EvaluationDataset(samples=predicted_samples)

predicted_dataset = predict_test_dataset(dataset)

predicted_dataset.to_pandas().head()

from ragas import evaluate


from ragas.metrics import (
    AnswerCorrectness,
    Faithfulness,
    context_recall,
    context_precision,
)

Calculate the evaluation metrics with our selected metrics.

We introduced few modifications on top of the default RAGAS settings, namely:

We completely ignored semantic similarity in the answer correctness, we found that it usually gives "false positives" and unnecessarily rewards bad predictions*.
We modified answer_correctness_metric's prompt to be more forgiving and not look for the exact same words.
We increased beta parameter of the correctness to favor the recall rather than precision. We reward if LLM has more of relevant documents in the context. This is because LLM can choose to ignore irrelevant documents (False positive in context) which diminishes the importance of the precision.

* This issue stems from the limitations of commonly used encoder models, such as those generating sentence embeddings. These models are primarily trained on tasks like document similarity and natural language inference, making them effective at identifying semantically related text but not at evaluating factual accuracy.

def run_ragas_evaluations(dataset: EvaluationDataset):

    evaluator_llm = LangchainLLMWrapper(
        ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
    )

    answer_correctness_metric = AnswerCorrectness(
        llm=evaluator_llm,
        weights=[
            1.0,
            0.0,
        ],  # ignore the semantic similarity, it is often misleading. Prone to giving hi score to false labels.
        max_retries=3,
        beta=1.5,  # favor the recall a bit more
    )

    # adjust the evaluator LLM prompt to be more forgiving

    correctness_prompt = answer_correctness_metric.get_prompts()["correctness_prompt"]

    correctness_prompt.instruction += """ Answer may be less or more verbose than the ground truth, that is fine.
    If the ground truth is 'Yes' and answer is 'Yes, [... some details]', consider it as true."""
    answer_correctness_metric.set_prompts(**{"correctness_prompt": correctness_prompt})

    metrics: list = [
        answer_correctness_metric,
        Faithfulness(llm=evaluator_llm),
        context_recall,
        context_precision,
    ]
    results = evaluate(dataset=dataset, metrics=metrics)
    return results

ragas_evals_dataset = run_ragas_evaluations(predicted_dataset)

ragas_evals_dataset

{'answer_correctness': 0.5249, 'faithfulness': 0.6275, 'context_recall': 0.9353, 'context_precision': 0.7761}

We will learn how to improve these below.

Let's inspect the eval metrics based on the questions. We see that in some cases, LLM had context recall of 1.0 but failed to answer the question correctly. This may be indicator of poor performance from the LLM, or irrelevant context (if precision is low) that caused LLM to be thrown off.

pd.DataFrame(ragas_evals_dataset.scores)

Terminate the app;

server_process.terminate()
server_process.join()

Clear the previous app from the Pathway engine

def clear_pathway_graph() -> None:
    from pathway.internals.parse_graph import G

    G.clear()


clear_pathway_graph()

Improving the Accuracy

A RAG application's performance is impacted by many variables, we can gather them under two broad categories that are linked together:

Retrieval
Generation

Retrieval performance mainly consists of quality of the input data

Hybrid Index

Hybrid index combines semantic search and keyword based BM25 search.

Pathway HybridIndexFactory lets you combine different indexes to build an hybrid index:

BM25 (via TantivyBM25Factory) → Keyword based BM25 search.
BruteForceKnn → Vector-based semantic search

folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

sources = [folder]

parser = parsers.UnstructuredParser()

text_splitter = splitters.TokenCountSplitter(min_tokens=150, max_tokens=450)

embedder = embedders.OpenAIEmbedder(
    cache_strategy=DiskCache(), retry_strategy=pw.udfs.ExponentialBackoffRetryStrategy()
)

hybrid_index = HybridIndexFactory(
    [
        TantivyBM25Factory(),
        BruteForceKnnFactory(embedder=embedder),
    ]
)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources, parser=parser, splitter=text_splitter, retriever_factory=hybrid_index
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.
  Question: {query}

  Context: {context}

  Answer:"""

rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
)


server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

server_process.start()

pathway_client = RAGClient(pathway_host, pathway_port)
pathway_client.list_documents()

predicted_dataset_hybrid_index = predict_test_dataset(dataset)

ragas_evals_dataset_hybrid_index = run_ragas_evaluations(predicted_dataset_hybrid_index)

ragas_evals_dataset_hybrid_index

{'answer_correctness': 0.5821, 'faithfulness': 0.5228, 'context_recall': 0.8966, 'context_precision': 0.8343}

We see that just by introducing hybrid retrieval, we improved the correctness metric by 10%. Let's see if we can improve on that.

predicted_dataset_hybrid_index.to_pandas()

# terminate the Pathway app

server_process.terminate()
server_process.join()

clear_pathway_graph()

Using a Different Parser

Parsing is a crucial yet often overlooked component of RAG solutions. The quality of your retrieval depends heavily on how well your data is parsed—garbage in, garbage out. A robust parser can significantly enhance your solution, while a poor one can break it. Pathway provides several ready-to-use parsers out of the box, see the documentation. You also have the flexibility to develop and integrate custom parsers tailored to your specific needs.

folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

sources = [folder]

parser = parsers.PypdfParser()

text_splitter = None

embedder = embedders.OpenAIEmbedder(
    cache_strategy=DiskCache(), retry_strategy=pw.udfs.ExponentialBackoffRetryStrategy()
)

hybrid_index = HybridIndexFactory(
    [
        TantivyBM25Factory(),
        BruteForceKnnFactory(embedder=embedder),
    ]
)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources, parser=parser, splitter=text_splitter, retriever_factory=hybrid_index
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.
  Question: {query}

  Context: {context}

  Answer:"""

rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
)


server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

server_process.start()

pathway_client = RAGClient(pathway_host, pathway_port)
pathway_client.list_documents()

predicted_dataset_pypdf_parser = predict_test_dataset(dataset)
ragas_evals_dataset_pypdf_parser = run_ragas_evaluations(predicted_dataset_pypdf_parser)

ragas_evals_dataset_pypdf_parser

{'answer_correctness': 0.6896, 'faithfulness': 0.6609, 'context_recall': 0.9088, 'context_precision': 0.8035}

This had quite an impact! We managed to improve our last best score by more than 30%.

# terminate the Pathway app

server_process.terminate()
server_process.join()

clear_pathway_graph()

Let's Try the Same Parser with the Semantic Search Retriever

folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

sources = [folder]

parser = parsers.PypdfParser()
# pypdf parser splits documents by the page, so we don't need another splitter
text_splitter = None

embedder = embedders.OpenAIEmbedder(
    cache_strategy=DiskCache(), retry_strategy=pw.udfs.ExponentialBackoffRetryStrategy()
)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources,
    parser=parser,
    splitter=text_splitter,
    retriever_factory=BruteForceKnnFactory(embedder=embedder),
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.
  Question: {query}

  Context: {context}

  Answer:"""

rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
)


server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

server_process.start()

predicted_dataset_semantic = predict_test_dataset(dataset)
ragas_evals_dataset_semantic = run_ragas_evaluations(predicted_dataset_semantic)

ragas_evals_dataset_semantic

{'answer_correctness': 0.7026, 'faithfulness': 0.6485, 'context_recall': 0.9382, 'context_precision': 0.7915}

We see that correctness and recall both increased slightly.

# terminate the Pathway app

server_process.terminate()
server_process.join()

clear_pathway_graph()

Changing the Embedder

Up until now, we had been using the OpenAI's text-embedding-ada-002 as the embedding model. You may also try the larger and better text-embedding-3-large or the cheaper and smaller text-embedding-3-small models.

Although API based embedders are well performing and easy to use, if you are concerned about data privacy, you will need a locally hosted embedder. Pathway enables you to use local & open-source models through embedders.SentenceTransformerEmbedder.

Some of the popular open-source embedders are gte-large-en-v1.5, bge-m3, and nomic-embed-text-v1.5. We found that gte-large-en-v1.5 usually produces good results, so let's try swapping the embedder with it.

os.environ["TOKENIZERS_PARALLELISM"] = "false"

folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

sources = [folder]

parser = parsers.PypdfParser()
# pypdf parser splits documents by the page, so we don't need another splitter
text_splitter = None

embedder = embedders.SentenceTransformerEmbedder(
    model="Alibaba-NLP/gte-large-en-v1.5",
    call_kwargs={"show_progress_bar": False},
    trust_remote_code=True,
)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources,
    parser=parser,
    splitter=text_splitter,
    retriever_factory=BruteForceKnnFactory(embedder=embedder),
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.
  Question: {query}

  Context: {context}

  Answer:"""

rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
)


server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

server_process.start()

predicted_dataset_gtembedder = predict_test_dataset(dataset)
ragas_evals_dataset_gtembedder = run_ragas_evaluations(predicted_dataset_gtembedder)

ragas_evals_dataset_gtembedder

{'answer_correctness': 0.641, 'faithfulness': 0.5327, 'context_recall': 0.9357, 'context_precision': 0.8268}

Hmm, seems like this embedder cannot quite work as well as the previous one. Weirdly, faithfulness score dropped significantly, maybe the ordering of the chunks is the reason.

Let's see if we can improve the performance with the prompt.

# terminate the Pathway app

server_process.terminate()
server_process.join()

clear_pathway_graph()

Changing the Prompt

Prompt is one of the more important aspects of a good RAG solution. Although we will aim for a prompt that will work in general, you may want to modify your prompt in consideration with your users' expectations, business goals, domain knowledge or other variables.

Now, let's keep the same embedder as above and change the prompt to be a bit more compute intensive.

folder = pw.io.fs.read(
    path=INPUT_FOLDER,
    format="binary",
    with_metadata=True,
)

sources = [folder]

parser = parsers.PypdfParser()
# pypdf parser splits documents by the page, so we don't need another splitter
text_splitter = None

embedder = embedders.SentenceTransformerEmbedder(
    model="Alibaba-NLP/gte-large-en-v1.5",
    call_kwargs={"show_progress_bar": False},
    trust_remote_code=True,
)

llm = llms.OpenAIChat(model="gpt-4o", cache_strategy=DiskCache())

document_store = DocumentStore(
    docs=sources,
    parser=parser,
    splitter=text_splitter,
    retriever_factory=BruteForceKnnFactory(embedder=embedder),
)

prompt_template: str = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \

Before answering the question, first think about and list the relevant parts from the given context. \
Then, answer the question based on the facts you have listed.

Always structure your responses in the following format:
Relevant contexts: [Write the relevant parts of the context for given question]
Answer: [Detailed reponse to the user's question that is grounded by the facts you listed]

If you don't know the answer, just say that you don't know.


Question: {query}

Context: {context}

Answer:"""

rag_app = BaseRAGQuestionAnswerer(
    llm=llm,
    indexer=document_store,
    prompt_template=prompt_template,
)


server = QASummaryRestServer(pathway_host, pathway_port, rag_app)

server_process = multiprocessing.Process(target=server.run, kwargs=dict(threaded=False))

server_process.start()

predicted_dataset_gtembedder_semantic = predict_test_dataset(dataset)
ragas_evals_dataset_gtembedder_semantic = run_ragas_evaluations(
    predicted_dataset_gtembedder_semantic
)

ragas_evals_dataset_gtembedder_semantic

{'answer_correctness': 0.7448, 'faithfulness': 0.8011, 'context_recall': 0.9471, 'context_precision': 0.8153}

We can see some clear improvements in terms of correctness. As expected, retrieval metrics remained unchanged from the previous runs (except the one with different embedder).

# terminate the Pathway app

server_process.terminate()
server_process.join()

clear_pathway_graph()

Summary & Findings

There is no "one size fits all" logic when it comes to RAG. You need to find what suits you best and start working from there.

Pathway allows you to build RAG applications that are always live & up-to-date and available. Whether you are building a financial analysis tool for yourself or an internal application for lawyers, you need to think about how to update and refresh your knowledge base. With the help of dynamic connectors, you can focus on problems that matter.

We also learned that we can improve our performance by using a better parsing strategy, increasing the amount of retrieved chunks, or introducing hybrid retrieval rather then semantic search strategy.

However, we have only explored a single RAG paradigm that only consisted of simple retrieval and generation. There are many more ways that are left to explore! For instance, knowledge-graphs can help the LLM to have more relevant context, or agent-driven architectures can unlock new search & retrieval capabilities that can adapt, re-try or reason before taking actions. Pathway can help you build such applications as well.

If you are interested in agents, you may explore our LangGraph RAG agents with Pathway cookbook and stay tuned for more content!

If you are interested in diving deeper into the topic, here are some good references to get started: