Full YAML Templates Examples

This page gathers entire YAML configuration files for Pathway Live Data Framework RAG templates. For each template, you can find an example for the different data sources: file system, SharePoint, Google Drive, and S3. You can find more example on how to configure the data sources on the dedicated page.

For further configurations, you can learn how to configure YAML templates and see our different YAML examples for each component.

Adaptive RAG

Configuration of Pathway Live Data Framework Adaptive RAG pipeline.

The Pathway Live Data Framework provides an advanced RAG technique called Adaptive RAG that lowers the costs of the queries. You can find the template on GitHub.Here are the

File System

$sources:
  # File System connector, reading data locally.
  - !pw.io.fs.read
    path: data                   # Path to the data directory
    format: binary               # Format of the data to be read
    with_metadata: true          # Include metadata in the data

# Configures the LLM model settings for generating responses.
$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini" 
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6 
  cache_strategy: !pw.udfs.DefaultCache
  temperature: 0
  capacity: 8

# Specifies the embedder model for converting text into embeddings.
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache

# Defines the splitter settings for dividing text into smaller chunks.
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

# Configures the parser for processing and extracting information from documents.
$parser: !pw.xpacks.llm.parsers.DoclingParser
  cache_strategy: !pw.udfs.DefaultCache

# Sets up the retriever factory for indexing and retrieving documents.
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS

# Manages the storage and retrieval of documents for the RAG template.
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

# Configures the question-answering component using the RAG approach.
question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  n_starting_documents: 2
  factor: 2
  max_iterations: 4

SharePoint

$sources:
  # Connect to your SharePoint data.
  - !pw.xpacks.connectors.sharepoint.read 
    url: $SHAREPOINT_URL       # URL of the SharePoint site
    tenant: $SHAREPOINT_TENANT # Tenant ID for SharePoint
    client_id: $SHAREPOINT_CLIENT_ID # Client ID for authentication
    cert_path: sharepointcert.pem # Path to the certificate file
    thumbprint: $SHAREPOINT_THUMBPRINT # Thumbprint of the certificate
    root_path: $SHAREPOINT_ROOT # Root path in SharePoint
    with_metadata: true        # Include metadata in the data
    refresh_interval: 30       # Interval to refresh data (in seconds)

# Configures the LLM model settings for generating responses.
$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini" 
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6 
  cache_strategy: !pw.udfs.DefaultCache
  temperature: 0
  capacity: 8

# Specifies the embedder model for converting text into embeddings.
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache

# Defines the splitter settings for dividing text into smaller chunks.
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

# Configures the parser for processing and extracting information from documents.
$parser: !pw.xpacks.llm.parsers.DoclingParser
  cache_strategy: !pw.udfs.DefaultCache

# Sets up the retriever factory for indexing and retrieving documents.
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS

# Manages the storage and retrieval of documents for the RAG template.
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

# Configures the question-answering component using the RAG approach.
question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  n_starting_documents: 2
  factor: 2
  max_iterations: 4

Google Drive

$sources:
  # Connect to your data in Google Drive
  - !pw.io.gdrive.read
    object_id: $DRIVE_ID
    service_user_credentials_file: gdrive_indexer.json
    file_name_pattern:
      - "*.pdf"
      - "*.pptx"
    object_size_limit: null
    with_metadata: true
    refresh_interval: 30

# Configures the LLM model settings for generating responses.
$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini" 
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6 
  cache_strategy: !pw.udfs.DefaultCache
  temperature: 0
  capacity: 8

# Specifies the embedder model for converting text into embeddings.
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache

# Defines the splitter settings for dividing text into smaller chunks.
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

# Configures the parser for processing and extracting information from documents.
$parser: !pw.xpacks.llm.parsers.DoclingParser
  cache_strategy: !pw.udfs.DefaultCache

# Sets up the retriever factory for indexing and retrieving documents.
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS

# Manages the storage and retrieval of documents for the RAG template.
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

# Configures the question-answering component using the RAG approach.
question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  n_starting_documents: 2
  factor: 2
  max_iterations: 4

$sources:
  # Connect to your data in S3
  - !pw.io.s3.read
    path: $path
    format: "binary"
    aws_s3_setting: !pw.io.s3.AwsS3Settings
      bucket_name: $bucket
      region: "eu-west-3"
      access_key: $s3_access_key
      secret_access_key: $s3_secret_access_key

# Configures the LLM model settings for generating responses.
$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini" 
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6 
  cache_strategy: !pw.udfs.DefaultCache
  temperature: 0
  capacity: 8

# Specifies the embedder model for converting text into embeddings.
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache

# Defines the splitter settings for dividing text into smaller chunks.
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  max_tokens: 400

# Configures the parser for processing and extracting information from documents.
$parser: !pw.xpacks.llm.parsers.DoclingParser
  cache_strategy: !pw.udfs.DefaultCache

# Sets up the retriever factory for indexing and retrieving documents.
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS

# Manages the storage and retrieval of documents for the RAG template.
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

# Configures the question-answering component using the RAG approach.
question_answerer: !pw.xpacks.llm.question_answering.AdaptiveRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  n_starting_documents: 2
  factor: 2
  max_iterations: 4