Customizing Pathway RAG Templates with YAML configuration files

Pathway offers a number of ready-to-use and easy to deploy RAG Templates. To fit them to your needs without the need to alter Python code, you can configure them with YAML configuration files. Pathway uses a custom YAML parser to make configuring templates easier, and this guide explores capabilities of YAML parser.

This article focus on the syntax of the YAML configuration files. You can find examples on configuration YAML files in the dedicated articles:

Mapping tags

YAML format allows for assigning tags to key-value mappings, by prepending a chosen string with !. In Pathway configuration files, these tags are used to reference Python objects. If you provide a mapping, these objects will be called with arguments taken from the mapping. For example, this can be used to define a source Table using an input connector:

source: !pw.io.fs.read
  path: data
  format: binary
  with_metadata: true

Since classes are also callables, this syntax can also be used to initialize objects.

llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0.05
  capacity: 8

If a callable doesn't take any arguments, but you want it to be called when the YAML is read, you need to pass an empty mapping. In the example below an instance of pw.udfs.DefaultCache is assigned to cache_strategy, obtained by calling pw.udfs.DefaultCache without any arguments.

cache_strategy: !pw.udfs.DefaultCache {}

On the other hand, if no mapping is provided, the object is used as is. In LLM templates, this is used to set value of an argument that is an enum, e.g. BruteForceKnnFactory requires metric argument to be given a value from pw.indexing.BruteForceKnnMetricKind enum:

retriever_factory: !pw.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.indexing.BruteForceKnnMetricKind.COS
  dimensions: 1536

While these examples refer to components from Pathway package, you can use it for importing any object - in particular you can use your own functions and classes.

Defining the Data Sources and Schemas

Tags are used to define the data sources and the schema. You can learn more on how to do it in the YAML file in the dedicated article.

Variables

Identifiers starting with $ are given a new meaning - these denote variables to be used later in the configuration file.

$retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
  max_retries: 6

llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: $retry_strategy
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0.05
  capacity: 8

embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  retry_strategy: $retry_strategy
  cache_strategy: !pw.udfs.DefaultCache {}

Environment Variables

You can also use $ to refer to environment variables. For that purpose, you need to use identifiers that consist only of upper case letters and _. Also, if a variable is defined in the YAML file and present in the environment variables, the definition from the YAML file takes precedence. As an example, you can set the port in the question_answering_rag pipeline to be taken from the $PATHWAY_PORT environment variable.

port: $PATHWAY_PORT

Then, before running the pipeline set the $PATHWAY_PORT environment variable and it will be used by the app.

If the values of the environment variables are a valid integer, float or boolean (according to the YAML syntax), they will be parsed. Otherwise the value of the environment value is returned as a string.

Example: Question-Answering RAG

To see YAMLs in practice let's look at the Question-Answering RAG. Note, that it differs from adaptive RAG, multimodal RAG and private RAG by the YAML configuration file - their Python code is the same.

Here is the content of app.yaml from question_answering_rag:

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

  # - !pw.xpacks.connectors.sharepoint.read 
  #   url: $SHAREPOINT_URL
  #   tenant: $SHAREPOINT_TENANT
  #   client_id: $SHAREPOINT_CLIENT_ID
  #   cert_path: sharepointcert.pem
  #   thumbprint: $SHAREPOINT_THUMBPRINT
  #   root_path: $SHAREPOINT_ROOT
  #   with_metadata: true
  #   refresh_interval: 30

  # - !pw.io.gdrive.read
  #   object_id: $DRIVE_ID
  #   service_user_credentials_file: gdrive_indexer.json
  #   name_pattern:
  #     - "*.pdf"
  #     - "*.pptx"
  #   object_size_limit: null
  #   with_metadata: true
  #   refresh_interval: 30

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DiskCache
  temperature: 0.05
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DiskCache

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

$parser: !pw.xpacks.llm.parsers.UnstructuredParser

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS
  dimensions: 1536
  

$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store


# Change host and port by uncommenting these lines
# host: "0.0.0.0"
# port: 8000

# Cache configuration
# with_cache: true

# If `terminate_on_error` is true then the program will terminate whenever any error is encountered.
# Defaults to false, uncomment the following line if you want to set it to true
# terminate_on_error: true

This demo needs the question_answerer to be defined in the configuration file, and allows to override values of host, port, with_cache and terminate_on_error. The first thing you can try to do, is to add another input connector, e.g. that connects to files from Google Drive. The stub is already present in the file, so just uncomment it and fill object_id and service_user_credentials_file.

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

  - !pw.io.gdrive.read
    object_id: FILL_YOUR_DRIVE_ID
    service_user_credentials_file: FILL_PATH_TO_CREDENTIALS_FILE
    name_pattern:
      - "*.pdf"
      - "*.pptx"
    object_size_limit: null
    with_metadata: true
    refresh_interval: 30

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DiskCache
  temperature: 0.05
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DiskCache

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

$parser: !pw.xpacks.llm.parsers.UnstructuredParser

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS
  dimensions: 1536
  

$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store

If you want to change the provider of LLM models, you can change values of llm and embedder. By changing llm to be LiteLLMChat, that uses local api_base, and embedder to be SentenceTransformerEmbedder you obtain a local RAG that does not call external services (this pipeline is now very similar to private RAG from the llm-app).

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

$llm_model: "ollama/mistral"

$llm: !pw.xpacks.llm.llms.LiteLLMChat
  model: $llm_model
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DiskCache
  temperature: 0
  top_p: 1
  format: "json"  # only available in Ollama local deploy, not usable in Mistral API
  api_base: "http://localhost:11434"

$embedding_model: "avsolatorio/GIST-small-Embedding-v0"

$embedder: !pw.xpacks.llm.embedders.SentenceTransformerEmbedder
  model: $embedding_model
  call_kwargs: 
    show_progress_bar: False

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

$parser: !pw.xpacks.llm.parsers.UnstructuredParser

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.engine.BruteForceKnnMetricKind.COS
  dimensions: 1536
  

$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store

Alternatively, you may wish to improve indexing capabilities by using the HybridIndex. In this example you'll use Hybrid Index that combines vector based index - BruteForceKNN - and index based on text search - TantivyBM25.

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DiskCache
  temperature: 0.05
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DiskCache

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

$parser: !pw.xpacks.llm.parsers.UnstructuredParser

$knn_index: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.engine.BruteForceKnnMetricKind.COS
  dimensions: 1536

$bm25_index: !pw.stdlib.indexing.TantivyBM25Factory

$hybrid_index_factory: !pw.stdlib.indexing.HybridIndexFactory
  retriever_factories:
    - $knn_index
    - $bm25_index

$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $hybrid_index_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store

These are just a few examples, but you can use any components from LLM xpack to have a pipeline that fully meets your need!