blog

Gemini 2.0 for Document Ingestion and Analytics with Pathway

Saksham Goel

·Published February 20, 2025·Updated February 20, 2025·0 min read

Gemini 2.0 for Document Ingestion and Analytics with Pathway

Most LLM showcases emphasize question-answering on existing data. Yet, ingesting raw documents is often a bigger challenge—especially for slides or PDFs with both text and visuals.

With Gemini 2.0, you can convert streaming slides or PDFs into structured, query-ready data. Its multimodal capabilities handle OCR-style extraction, while Pathway handles subsequent chunking, indexing, and retrieval to power real-time analytics. This approach leverages an existing open-source AI pipeline—Multimodal RAG with Pathway—which showcases how to parse and index unstructured financial documents, including tables and images, using a vision language model.

In this article, you'll learn how to plug Gemini 2.0 into a real-time RAG pipeline built with Pathway, enabling you to power accurate, context-aware decisions on constantly changing data. This setup allows you to seamlessly integrate a "write path" (ingestion) and "read path" (query), simplifying your entire data flow with Pathway.

Benefits of Using Pathway for Document Ingestion and Analytics with Gemini 2.0

1. Unified Pipeline (Write Path + Read Path)

Write Path: Acquire and prepare knowledge at ingestion time (parse PDFs, PPTX, etc.). Break them into meaningful chunks, embed them, and index them immediately.
Read Path: At query time, retrieve the most relevant chunks and let the LLM compose a final answer. Pathway orchestrates both steps in a single solution.

2. Seamless Multimodal Parsing

Vision-Based Approach: Pathway's SlideParser converts each slide or PDF page into an image, letting the LLM analyze text, tables, charts, and other visual elements in a single step.
Simplified Architecture: No need for separate OCR or specialized table-parsing modules—it's all handled in the same pipeline.

3. Live Indexing

Continuous Updates: As new documents arrive, Pathway automatically updates the vector store, ensuring your RAG system always reflects the most current data.

4. Scalability and Flexibility

Streaming Architecture: Pathway orchestrates concurrency, error handling, and transformations, even with high-volume or constantly updating documents.
Customizable Templates: The entire pipeline is based on an existing YAML-defined template that you can adapt to use Gemini. You can configure data sources, switch from a vector to a hybrid index, or tweak other steps with just a one-line change. This makes it easier to set up new projects or update multiple configurations at once.

5. Minimal Code Overhead

Concise Integration: A few lines of code handle ingestion, parsing, embedding, and serve a retrieval-augmented generation (RAG) endpoint.
Focus on Accuracy: By reducing boilerplate, you can concentrate on improving knowledge retrieval instead of juggling multiple services or complex plumbing.

How to Use Pathway & Gemini 2.0 for OCR: Sample Code Walkthrough

Below is an example adapted from the Pathway's gpt_4o_multimodal_rag pipeline, showing how you can parse PPTX/PDF files, generate textual chunks, embed them and store them for retrieval in a single, integrated pipeline.

Important: SlideParser requires a Pathway license key. If you haven’t already, request your free license key to unlock SlideParser and other enterprise features. The application will be updated with this key in Step 5.

Pathway: Synchronizes and indexes data in real-time, orchestrating concurrency and streaming.
SlideParser: Converts ppts/pdfs to images and uses a vision-capable LLM for parsing.
Gemini 2.0: Accessed through litellm and google.generativeai for OCR-like extraction and chunking.

Pathway acts as an end-to-end RAG orchestrator, wrapping data ingestion, streaming, and real-time indexing into one containerized pipeline that scales effortlessly from a laptop to enterprise deployments.

Architecture diagram

Step 1: Clone the LLM App Templates Repository

Clone the llm-app repository from GitHub. This repository contains all the files you need.

git clone https://github.com/pathwaycom/llm-app.git

If you receive an error because an older version of the repository exists, navigate to the correct directory and update it using:

git pull

Step 2: Navigate to the Multimodal Project Directory

Change to the directory where the example is located:

cd examples/pipelines/gpt_4o_multimodal_rag

Step 3: Modify `Dockerfile`

Below is an updated Dockerfile that replaces the default dependencies (python3-opencv, tesseract-ocr, etc.) with just poppler-utils and libreoffice, reducing the container footprint while still covering multimodal parsing needs.

FROM pathwaycom/pathway:latest

WORKDIR /app

RUN apt-get update && apt-get install -y \
    poppler-utils \
    libreoffice \
    && rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*

COPY . .

CMD ["python", "app.py"]

Step 4: Modify the `app.yaml` File

In the default YAML configuration, the pipeline uses GPT-3.5 for language tasks and a generic DoclingParser for document parsing. The snippet below replaces these defaults to integrate Gemini 2.0 for OCR-like parsing, updates the prompt to better handle slide images, and switches the embedder to GeminiEmbedder.

By default, documents are read from a local data folder (see $sources in the YAML). If files need to be pulled from other sources—such as SharePoint, Google Drive, or S3—Pathway allows seamless connector switching by adding or replacing the relevant I/O block.

app.yaml

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

$parser_llm: !pw.xpacks.llm.llms.LiteLLMChat
  model: "gemini/gemini-2.0-flash"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 2
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0

$parse_prompt: |
  Apply OCR to following page and respond in markdown. 
  Tables should be formatted as markdown tables. Make sure to include table information such as title in a readable format.
  Spell out all the text that is on the page.

$embedder: !pw.xpacks.llm.embedders.GeminiEmbedder
  model: "models/embedding-001"
  cache_strategy: !pw.udfs.DefaultCache {}
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 3

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  min_tokens: 200
  max_tokens: 750

$parser: !pw.xpacks.llm.parsers.SlideParser
  llm: $parser_llm
  parse_prompt: $parse_prompt
  image_size: !!python/tuple [800, 1200]
  cache_strategy: !pw.udfs.DefaultCache {}

$knn_index: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.engine.BruteForceKnnMetricKind.COS

$bm25_index: !pw.stdlib.indexing.TantivyBM25Factory

$retriever_factory: !pw.stdlib.indexing.HybridIndexFactory
  retriever_factories:
    - $knn_index
    - $bm25_index
  
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 2
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0
  verbose: true

$prompt_template: |
  You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.
  Question: {query} 

  Context: {context}

  Answer:

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store
  prompt_template: $prompt_template
  # Optionally, you can adjust the number of documents included in the context
  # search_topk: 6

Key Updates

Parser & Prompt: Switch from the default DoclingParser to SlideParser for OCR-like handling of PPTX/PDF pages, along with a new parse_prompt that ensures table data is captured.
Embedder: Use GeminiEmbedder instead of an OpenAI-based embedder to integrate Gemini 2.0 capabilities.
LLM Choice: A new LiteLLMChat instance references gemini/gemini-2.0-flash for OCR tasks, while the QA step continues using gpt-4o.
Removed Unused Services: Comments for SharePoint or GDrive imports are removed for clarity, leaving a minimal setup focused on local files.

This configuration ensures your pipeline is optimized for vision-based parsing using Gemini 2.0 in tandem with Pathway’s real-time indexing and retrieval.

Step 5: Obtain and Update the Pathway License Key for SlideParser

Pathway is an open-source framework that provides core functionalities for free. However, to use advanced features like SlideParser, you need a Pathway license key. This key unlocks additional enterprise-grade capabilities such as enhanced RAM limits, enterprise connectors (e.g., SharePoint, Delta Table, Iceberg), full persistence, and monitoring.

To obtain your free license key, visit Pathway License Key Page and follow the instructions.

Once you have the key, update it in app.py by replacing the existing demo key:

pw.set_license_key("your-license-key-here")

This ensures that SlideParser and other advanced Pathway features are enabled in your application.

Step 6: Update the `.env` File with Your API Keys

Rename the .env.example file in the project directory to .env and update it with your keys:

.env

GEMINI_API_KEY=
GOOGLE_API_KEY=
OPENAI_API_KEY=

Save the file after making the changes.

Step 7: Running the Project

Locally

If you are using Windows, refer to the Docker instructions in the next section. For a local run, first install the dependencies:

pip install -r requirements.txt

Then, start the app:

python app.py

With Docker

Build and run the Docker image. Note that this step might take a few minutes.

Build the Docker Image:

docker build -t rag .

Run the Docker Container:

Mount the data folder and expose port 8000.

For Windows:

docker run -v "%cd%/data:/app/data" -p 8000:8000 rag

For Linux/Mac:

docker run -v "$(pwd)/data:/app/data" -p 8000:8000 --env-file .env rag

This will start the pipeline and the UI for asking questions.

Step 8: Querying the Pipeline

Once your service is running on your chosen host and port (by default, 0.0.0.0:8000), you can test the service using curl.

Check the Indexed Files

Make a POST request to list the files currently indexed:

curl -X 'POST' \
  'http://0.0.0.0:8000/v2/list_documents' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json'

You should receive a response similar to:

[{"modified_at": 1715765613, "owner": "saksham", "path": "data/20230203_alphabet_10K.pdf", "seen_at": 1715768762}]

If you add or remove files from the connected folder, repeat the request to see the updated index. The service logs will display the progress of indexing new and modified files.

Ask a Question

Test the retrieval-augmented generation (RAG) capability by asking a question about a table within a report. For example, run the following command:

curl -X 'POST' \
  'http://0.0.0.0:8000/v2/answer' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "How much was Operating lease cost in 2021?" 
}'

You should receive a correct response such as:

{"response": "$2,699 million"}

The initial LLM parsing step allows the system to include the relevant table data in the context, enabling accurate answers where other RAG applications might struggle.

Understanding your RAG pipeline

Data Ingestion

Pathway reads files from ./data as binary streams, ready for live updates.

Document Parsing

PDF pages or PPTX convert to images, and the LLM is prompted to extract text, tables, etc.

Chunking & Embedding

The parsed text is split into semantic chunks and embedded, with Pathway storing these embeddings in an integrated vector store.

Indexing & Querying

For queries, Pathway retrieves relevant chunks, then an LLM composes the final answer. The entire flow is “live,” so newly ingested docs are instantly queryable.

SlideParser Parameters Overview

Here are the parameters for the SlideParser:

class SlideParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, intermediate_image_format='jpg', image_size=(1280, 720), run_mode='parallel', retry_strategy=ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)

Parameters:

llm: The LLM used for parsing images (must support image inputs).
parse_prompt: Prompt fed to the LLM to guide parsing.
detail_parse_schema: An optional Pydantic schema for a deeper second-pass parse (if needed).
include_schema_in_text: If True, merges the schema parse into the text output—handy for search or referencing.
intermediate_image_format: Format for intermediate slides (“jpg” by default).
image_size: Tuple of (width, height) in pixels for generating images.
run_mode: "parallel" or "sequential." Parallel is faster, but sequential can reduce timeouts or memory issues.
retry_strategy: Recommended for robust calls to proprietary LLMs.
cache_strategy: Optional caching mechanism for speed-ups.

For more details, visit the SlideParser API documentation.

Ingesting Millions of PDFs: Why Gemini 2.0 is a Game Changer

A common pain point in real-world doc ingestion is handling non-trivial layouts—tables, images, multilingual text, etc. Many approaches require orchestrating multiple specialized models for layout detection and table parsing (e.g., Kubernetes clusters with multiple GPU services). This can get expensive and complicated, particularly when scaling to millions of documents.

Gemini 2.0 flips that equation: it merges near-perfect OCR and chunking performance with far better cost-efficiency than older solutions. Pathway then syncs these parsed documents continuously, keeping your retrieval pipeline accurate even with large or fast-changing data volumes.

Key Takeaways & Conclusion

LLM-Based Vision Parsing: SlideParser plus Gemini 2.0 enables single-step handling of text, images, tables, and layout elements—without the hassle of multiple OCR models.
Integrated Pipeline: Pathway unifies “write path” ingestion (doc-to-chunks indexing) and “read path” querying (RAG), reducing complexity and overhead.
Scale & Affordability: Gemini 2.0's improved pricing and accuracy make large-scale PDF ingestion far more economical than older solutions. Future-Proofing: While bounding box accuracy is still evolving, the trend points toward more robust layout understanding from next-gen LLMs.

By leveraging Gemini 2.0 for OCR-like parsing and Pathway for real-time ingestion, live indexing, and dynamic retrieval, you can reduce complexity in your tech stack while powering accurate, context-aware decisions. Whether you're ingesting millions of pages or handling a steady trickle of updates, this pipeline strategy ensures data remains synchronized and analysis-ready—without costly stitching of multiple microservices.

Are you looking to build an enterprise-grade RAG app?

Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. If you'd like to explore how Pathway can support your RAG and Generative AI initiatives, we invite you to schedule a discovery session with our team.

Schedule a 15-minute demo with one of our experts to see how Pathway can be the right solution for your enterprise needs.

If you'd like to explore more, check out Pathway's documentation, or feel free to reach out about customizing this approach for your workflow. Happy building!