Introduction to Customizing AI Pipelines with YAML Configuration Files

In the world of AI pipeline development, flexibility and ease of customization are crucial for innovation and learning. This module in the Free LLM Bootcamp introduces participants to the power of YAML-based customization, showcasing how simple it is to tailor pipelines to specific needs while keeping the underlying Python code hidden behind a configuration file.

Why YAML for AI Pipelines?

YAML (Yet Another Markup Language) is a human-readable data format designed for configuration purposes. Pathway leverages YAML to simplify the process of defining and customizing pipeline components. Whether you're specifying data sources, integrating LLMs, or fine-tuning retrievers, YAML makes the process intuitive and efficient.

Key advantages of using YAML in Pathway's pipelines:

  • Clarity: YAML syntax is clean and readable, even for beginners.
  • Flexibility: Easily switch components like LLM models or embedding techniques without altering Python code.
  • Reusability: Use variables to reduce redundancy and streamline configuration.
  • Extensibility: Add or modify connectors, schemas, and strategies by simply updating the YAML file.

How Pathway Uses YAML for Pipeline Customization

In Pathway's LLM-app, every pipeline comes with a preconfigured app.yaml file. This file abstracts the complexity of Python code and provides a customizable interface for:

  • Data Sources: Define where your data comes from, such as local files, Google Drive, or SharePoint.
  • LLM and Embedder Selection: Choose models like OpenAI's GPT or local models like LiteLLMChat with simple YAML changes.
  • Document Processing: Configure document splitting, parsing, and indexing using strategies like token count or hybrid search.
  • Indexing and Retrieval: Use brute-force KNN, BM25, or hybrid indexing methods with a few lines of YAML.

Example: Configuring a Demo Question-Answering Pipeline

The following example shows a basic YAML configuration for a Question-Answering pipeline:

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-3.5-turbo"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DiskCache
  temperature: 0.05
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DiskCache

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.stdlib.indexing.BruteForceKnnMetricKind.COS
  dimensions: 1536

$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store

This simple YAML file defines:

  • Data Sources: A binary file located in the data directory.
  • LLM: OpenAI's GPT model with caching and retry strategies.
  • Embedder: OpenAI's text-embedding-ada-002.
  • Retriever: A brute-force KNN retriever with cosine similarity.
  • Pipeline Logic: A question-answering module built on the LLM and retriever.

Activity: Experimenting with YAML Customization

  • Customize the Data Source:

Switch from a local file to Google Drive by uncommenting the provided example. Update placeholders like object_id and service_user_credentials_file.

  • Change the LLM and Embedder:

Replace OpenAI GPT with a local LiteLLMChat model. Use SentenceTransformerEmbedder for embedding.

  • Enhance Retrieval:

Upgrade indexing capabilities by combining vector-based KNN with text-based BM25 using the HybridIndex.

Additional Resources

For a comprehensive guide to YAML customization in Pathway's pipelines, check out the detailed article:

This resource will deepen your understanding and provide more advanced examples to experiment with.