How it Works
This page explains, step-by-step, what the updated demo does under the hood. It matches the latest code from “Building with Open AI.” There’s no YAML config anymore—everything is set up directly in app.py
.
High-level flow
- Ingest: Read files from
./data
into a Pathway table (binary). - Parse: Use
DoclingParser
to parse PDFs and other docs. - Split: Chunk text with
TokenCountSplitter(max_tokens=800)
. - Embed: Create embeddings via OpenAI
text-embedding-3-small
, with caching + retry. - Index: Build a fast vector index using USearch (cosine similarity).
- Generate: Use OpenAI Chat (
gpt-4o
) to answer questions. - Serve: Expose a REST API via the built-in webserver.
Project layout (updated)
app.py
— the only app entrypoint (noapp.yaml
).requirements.txt
— Python deps for local and Docker runs.Dockerfile
+docker-compose.yml
— containerized run..env
— holds secrets and ports:
OPENAI_API_KEY=<your_openai_api_key>
PATHWAY_PORT="8000"
UI_PORT="8501"
- data/ — put your PDFs and files here.
Key components in app.py
# 1) Read documents
folder = pw.io.fs.read(path="./data", format="binary", with_metadata=True)
sources = [folder]
# 2) Parse
parser = parsers.DoclingParser(async_mode="fully_async", chunk=False)
# 3) Split
text_splitter = splitters.TokenCountSplitter(max_tokens=800)
# 4) Embed (with cache + retry)
embedder = embedders.OpenAIEmbedder(
model="text-embedding-3-small",
cache_strategy=DefaultCache(),
retry_strategy=ExponentialBackoffRetryStrategy(),
)
# 5) Index (USearch, cosine)
index = UsearchKnnFactory(
reserved_space=1000,
embedder=embedder,
metric=USearchMetricKind.COS,
)
# 6) LLM (answering)
llm = llms.OpenAIChat(
model="gpt-4o",
cache_strategy=DefaultCache(),
retry_strategy=ExponentialBackoffRetryStrategy(max_retries=2),
temperature=0,
capacity=8,
)
# 7) Server config
pathway_host = "0.0.0.0"
pathway_port = int(os.environ.get("PATHWAY_PORT", 8000))
# 8) Document Store
doc_store = DocumentStore(
docs=sources,
splitter=text_splitter,
parser=parser,
retriever_factory=index,
)
# 9) RAG app
rag_app = BaseRAGQuestionAnswerer(llm=llm, indexer=doc_store)
# 10) REST API
rag_app.build_server(host=pathway_host, port=pathway_port)
rag_app.run_server(with_cache=True, terminate_on_error=True)
Why these choices?
- DoclingParser handles complex PDFs (tables, multi-column) more reliably than the old “unstructured” flow.
- USearch gives a fast, scalable vector index with cosine similarity.
- TokenCountSplitter(800) balances chunk size and retrieval quality.
- OpenAI gpt-4o as the chat model for strong reasoning over retrieved context.
- DefaultCache + ExponentialBackoffRetryStrategy improve stability and reduce token costs.
Live updates (no restarts)
Drop a new file into data/
and the pipeline will auto-detect, parse, chunk, embed, and index it.
You don’t have to restart the server—your knowledge base stays fresh.
Common tweaks
- Change LLM: swap
model="gpt-4o"
to another OpenAI chat model. - Change embedder: choose a different embedding model supported by Pathway.
- Adjust chunking: tune
max_tokens
inTokenCountSplitter
. - Index capacity: increase
reserved_space
inUsearchKnnFactory
. - Ports: set
PATHWAY_PORT
in.env
to avoid conflicts.
Troubleshooting
- Port in use: set a free
PATHWAY_PORT
in.env
. - Waiting on embeddings: logs will show minibatch progress; once they stop, the app is ready.
- Missing key: ensure
OPENAI_API_KEY
is present in.env
and the process has it loaded.
Conclusion
This demonstrates setting up a powerful RAG pipeline with always up-to-date knowledge. While we've only scratched the surface, there's more to explore:
- Re-ranking: Prioritize the most relevant results for your specific query.
- Knowledge Graphs: Leverage relationships between entities to improve understanding.
- Hybrid Indexing: Combine different indexing strategies for optimal retrieval.
- Adaptive Reranking: Iteratively enlarge the context for optimal accuracy, see our next tutorial around adaptive RAG.
Stay tuned for future examples exploring these RAG techniques with Pathway!
Enjoy building your RAG project! If you have any questions or need further assistance, feel free to reach out to the Pathway team or check with your peers from the bootcamp cohort.
What if you want to use a Multimodal LLM like GPT-4o
That's a great idea indeed. Multimodal LLMs like GPT-4o excel at parsing images, charts, etc. thereby significantly enhancing the accuracy for text-based use-cases as well.
For example, imagine if you're building a RAG project with Google Drive as a data source. But that Drive folder has several financial documents with charts, tables, etc. Below is an interesting example where you'll see how Pathway parsed tables as images and used GPT-4o to get a much more accurate response.