How it Works

This page explains, step-by-step, what the updated demo does under the hood. It matches the latest code from “Building with Open AI.” There’s no YAML config anymore—everything is set up directly in app.py.

High-level flow

  1. Ingest: Read files from ./data into a Pathway table (binary).
  2. Parse: Use DoclingParser to parse PDFs and other docs.
  3. Split: Chunk text with TokenCountSplitter(max_tokens=800).
  4. Embed: Create embeddings via OpenAI text-embedding-3-small, with caching + retry.
  5. Index: Build a fast vector index using USearch (cosine similarity).
  6. Generate: Use OpenAI Chat (gpt-4o) to answer questions.
  7. Serve: Expose a REST API via the built-in webserver.

Project layout (updated)

  • app.py — the only app entrypoint (no app.yaml).
  • requirements.txt — Python deps for local and Docker runs.
  • Dockerfile + docker-compose.yml — containerized run.
  • .env — holds secrets and ports:
OPENAI_API_KEY=<your_openai_api_key>
PATHWAY_PORT="8000"
UI_PORT="8501"
  • data/ — put your PDFs and files here.

Key components in app.py

# 1) Read documents
folder = pw.io.fs.read(path="./data", format="binary", with_metadata=True)
sources = [folder]

# 2) Parse
parser = parsers.DoclingParser(async_mode="fully_async", chunk=False)

# 3) Split
text_splitter = splitters.TokenCountSplitter(max_tokens=800)

# 4) Embed (with cache + retry)
embedder = embedders.OpenAIEmbedder(
    model="text-embedding-3-small",
    cache_strategy=DefaultCache(),
    retry_strategy=ExponentialBackoffRetryStrategy(),
)

# 5) Index (USearch, cosine)
index = UsearchKnnFactory(
    reserved_space=1000,
    embedder=embedder,
    metric=USearchMetricKind.COS,
)

# 6) LLM (answering)
llm = llms.OpenAIChat(
    model="gpt-4o",
    cache_strategy=DefaultCache(),
    retry_strategy=ExponentialBackoffRetryStrategy(max_retries=2),
    temperature=0,
    capacity=8,
)

# 7) Server config
pathway_host = "0.0.0.0"
pathway_port = int(os.environ.get("PATHWAY_PORT", 8000))

# 8) Document Store
doc_store = DocumentStore(
    docs=sources,
    splitter=text_splitter,
    parser=parser,
    retriever_factory=index,
)

# 9) RAG app
rag_app = BaseRAGQuestionAnswerer(llm=llm, indexer=doc_store)

# 10) REST API
rag_app.build_server(host=pathway_host, port=pathway_port)
rag_app.run_server(with_cache=True, terminate_on_error=True)

Why these choices?

  • DoclingParser handles complex PDFs (tables, multi-column) more reliably than the old “unstructured” flow.
  • USearch gives a fast, scalable vector index with cosine similarity.
  • TokenCountSplitter(800) balances chunk size and retrieval quality.
  • OpenAI gpt-4o as the chat model for strong reasoning over retrieved context.
  • DefaultCache + ExponentialBackoffRetryStrategy improve stability and reduce token costs.

Live updates (no restarts)

Drop a new file into data/ and the pipeline will auto-detect, parse, chunk, embed, and index it.
You don’t have to restart the server—your knowledge base stays fresh.

Common tweaks

  • Change LLM: swap model="gpt-4o" to another OpenAI chat model.
  • Change embedder: choose a different embedding model supported by Pathway.
  • Adjust chunking: tune max_tokens in TokenCountSplitter.
  • Index capacity: increase reserved_space in UsearchKnnFactory.
  • Ports: set PATHWAY_PORT in .env to avoid conflicts.

Troubleshooting

  • Port in use: set a free PATHWAY_PORT in .env.
  • Waiting on embeddings: logs will show minibatch progress; once they stop, the app is ready.
  • Missing key: ensure OPENAI_API_KEY is present in .env and the process has it loaded.

Conclusion

This demonstrates setting up a powerful RAG pipeline with always up-to-date knowledge. While we've only scratched the surface, there's more to explore:

  • Re-ranking: Prioritize the most relevant results for your specific query.
  • Knowledge Graphs: Leverage relationships between entities to improve understanding.
  • Hybrid Indexing: Combine different indexing strategies for optimal retrieval.
  • Adaptive Reranking: Iteratively enlarge the context for optimal accuracy, see our next tutorial around adaptive RAG.

Stay tuned for future examples exploring these RAG techniques with Pathway!

Enjoy building your RAG project! If you have any questions or need further assistance, feel free to reach out to the Pathway team or check with your peers from the bootcamp cohort.

What if you want to use a Multimodal LLM like GPT-4o

That's a great idea indeed. Multimodal LLMs like GPT-4o excel at parsing images, charts, etc. thereby significantly enhancing the accuracy for text-based use-cases as well.

For example, imagine if you're building a RAG project with Google Drive as a data source. But that Drive folder has several financial documents with charts, tables, etc. Below is an interesting example where you'll see how Pathway parsed tables as images and used GPT-4o to get a much more accurate response.

gpt_4o_multimodal_rag exampleGitHub