tutorialengineering

Real-Time Multimodal Data Processing with Pathway and Docling

Saksham Goel

·Published May 30, 2025·Updated May 30, 2025·0 min read

The Challenge of Multimodal Data Processing in Finance

In today’s landscape, data isn’t just text – it’s multimodal. Financial institutions grapple with information streaming in as PDFs, images (charts, scanned documents), audio (earnings calls, podcasts), video (webinars), and more. However, processing this multimodal data in real time is a daunting task. Each format speaks its own “language,” and traditional pipelines struggle to unify them fast enough for actionable insights.

In finance, market-moving insights might be buried in a CEO’s spoken words on an earnings call (audio), a complex table in a quarterly report (PDF), or even a chart from an explainer video.

Why is real-time multimodal processing so difficult? It boils down to complexity on multiple fronts:

Diverse Formats: Each modality requires different parsing techniques. For example, PDFs may contain embedded text and images, requiring text extraction and OCR; audio needs speech-to-text transcription; images (like charts or diagrams) need computer vision to interpret.
Volume & Velocity: Data streams in high volume and speed (e.g. live news feeds, sensor streams). Batching for offline analysis isn’t an option – you need streaming ingestion and processing.
Asynchronous & Unstructured: A breaking news alert might arrive as text while simultaneously a detailed report comes in PDF. These must be correlated and indexed together. Unstructured content (free text, visuals) lacks the schema that databases have, making real-time structuring a must.
Semantic Understanding: Even after extraction, making sense of the data (e.g. linking a chart image to its textual description or understanding an audio quote in context) is hard. Multimodal retrieval-augmented generation (RAG) demands semantic alignment across formats.

Simply put, multimodal data is complex to parse and integrate in real time. Many organizations resort to batch jobs or siloed tools – which is slow and error-prone. In high-stakes domains like finance, these delays and gaps mean lost opportunities and unmanaged risks.

How do you achieve Multimodal Data Processing with Pathway and Docling

Pathway and Docling tackle real-time data challenges with a powerful combination: a high-performance streaming framework and an advanced multimodal parser. Designed with privacy and developers in mind, this duo delivers fast, reliable, and flexible data processing. Here are the main advantages:

Pathway is a Python framework used for building a LiveAI™ layer (combining real-time intelligence with AI) in enterprise applications. With Pathway, you can stream data changes the moment they happen–be it Kafka topics and cloud storage events. For RAG use cases, this means you can index documents and their changes on the fly—your indexes are always up-to-date with the latest filings or news. It is self-hostable, making it ideal for privacy-sensitive deployments.
Docling is an open-source document processing toolkit (from IBM) designed to convert unstructured files into structured data for AI applications. Docking parses each document as it flows through the stream. It converts your PDFs, Word files, slides, images, and HTML into clean, developer-friendly formats, such as JSON or Markdown. It relies on a layout-aware AI (DocLayNet, TableFormer) instead of brute-force OCR, preserving headings, tables, figures, and multi-column text while remaining lightweight enough for a laptop or modest server and flexible enough to swap in custom models.

Joint Value for Teams

Speed: Stream ingestion + incremental parsing keeps knowledge bases minutes—or seconds—fresh.
Privacy: Both tools run on-premises or in a private Virtual Private Cloud (VPC); no data is sent to third-party APIs.
Flexibility: Pipeline-as-code lets you slot in monitoring, alerts, or bespoke ML models anywhere in the flow.
Cost-efficiency: Commodity hardware is enough; no expensive, always-on OCR services or heavyweight databases.

Bottom line: Pathway handles the live firehose, Docling turns messy documents into structured gold, and you stay in complete control—ideal for enterprise-grade, real-time AI applications.

Hands-On Guide to Real-Time Multimodal Data Processing

You will learn how to set up the Docling parser, configure a real-time RAG system, and demonstrate its application using a sample financial document. This guide provides a comprehensive understanding of effectively utilizing the Docling parser to improve the accuracy and relevance of responses generated by the Pathway RAG system.

Prerequisites

Environment: This guide is intended to be run in Google Colab.

Before you begin, ensure you have the following set up:

Install all basic dependencies to run your Pathway pipelines, including the powerful Rust engine:

!pip install "pathway[all]"

Ensure that you have a valid OpenAI API key:

import os
os.environ["OPENAI_API_KEY"] = "sk-...."

Enter Docling: Open-Source Multimodal Parsing by IBM

To address the parsing piece of the puzzle, IBM Research open-sourced Docling, a toolkit purpose-built for multimodal document parsing. Docling is a MIT-licensed library that can ingest a wide range of documents – PDFs, DOCX, PPTX, images, HTML, even AsciiDoc – and convert them into clean structured data (JSON or Markdown). Crucially, Docling handles not just plain text but also preserves layout, hierarchy, tables, and images.

Unlike visual-only parsers, Docling brings in linguistic context—like sentence structures and entity linking—for deeper document comprehension. This makes it especially effective for documents like contracts, financial statements, and research papers that rely on domain-specific terminology.

This section covers how the Docling parser works and how you can control its behavior via parameters. You will see its functionality, trade-offs, and best practices.

First, download a sample document containing multimodal data, including text, tables, images, and equations, which will be used throughout this section. This uses Tesla's Q3 financial report as the source document. It is possible to download the document using the wget command:

!mkdir data
!wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf -O data/TSLA-Q3-2023-Update-3.pdf

Now, let's load the document:

with open("data/TSLA-Q3-2023-Update-3.pdf", "rb") as f:
    content = f.read()

Parsing with Docling

Next, import and define a DoclingParser object:

from pathway.xpacks.llm.parsers import DoclingParser

parser = DoclingParser()

To parse a PDF file, simply run:

doc = await parser.parse(content)

If running outside a colab notebook, you will need to create and execute an asyncio event loop:

import asyncio

doc = asyncio.run(parser.parse(content))

Understanding Parser Output

The parser returns a list of chunks. Each chunk consists of a tuple structured as (parsed_text, metadata_dict). Let's examine a specific chunk that contains table:

print(doc[5][0])

Output:

HEADINGS:
# F I N A N C I A L S U M M A R Y (Unaudited)
CONTENT:

| Category                                      | Q3-2022 | Q4-2022 | Q1-2023 | Q2-2023 | Q3-2023 | YoY   |
|:---------------------------------------------|--------:|--------:|--------:|--------:|--------:|------:|
| Total automotive revenues                    |  18,692 |  21,307 |  19,963 |  21,268 |  19,625 |   5%  |
| Energy generation and storage revenue        |   1,117 |   1,310 |   1,529 |   1,509 |   1,559 |  40%  |
| Services and other revenue                   |   1,645 |   1,701 |   1,837 |   2,150 |   2,166 |  32%  |
| **Total revenues**                           |  21,454 |  24,318 |  23,329 |  24,927 |  23,350 |   9%  |
| Total gross profit                           |   5,382 |   5,777 |   4,511 |   4,533 |   4,178 | -22%  |
| Total GAAP gross margin                      |   25.1% |   23.8% |   19.3% |   18.2% |   17.9% | -719 bp |
| Operating expenses                           |   1,694 |   1,876 |   1,847 |   2,134 |   2,414 |  43%  |
| Income from operations                       |   3,688 |   3,901 |   2,664 |   2,399 |   1,764 | -52%  |
| Operating margin                             |   17.2% |   16.0% |   11.4% |    9.6% |    7.6% | -964 bp |
| Adjusted EBITDA                              |   4,968 |   5,404 |   4,267 |   4,653 |   3,758 | -24%  |
| Adjusted EBITDA margin                       |   23.2% |   22.2% |   18.3% |   18.7% |   16.1% | -706 bp |
| Net income attributable to stockholders (GAAP)     |   3,292 |   3,687 |   2,513 |   2,703 |   1,853 | -44%  |
| Net income attributable to stockholders (non-GAAP) |   3,654 |   4,106 |   2,931 |   3,148 |   2,318 | -37%  |
| EPS, diluted (GAAP)                          |   0.95  |   1.07  |   0.73  |   0.78  |   0.53  | -44%  |
| EPS, diluted (non-GAAP)                      |   1.05  |   1.19  |   0.85  |   0.91  |   0.66  | -37%  |
| Net cash provided by operating activities    |   5,100 |   3,278 |   2,513 |   3,065 |   3,308 | -35%  |
| Capital expenditures                         |  (1,803)|  (1,858)|  (2,072)|  (2,060)|  (2,460)|  36%  |
| Free cash flow                               |   3,297 |   1,420 |     441 |   1,005 |     848 | -74%  |
| Cash, cash equivalents, and investments      |  21,107 |  22,185 |  24,402 |  23,075 |  26,077 |  24%  |

Let's cross-check how parsed table compares to the original:

Parsed table Image taken from TSLA-Q3-2023-Update-3.

As you can see Docling did a great job at parsing this table!

Note: Exact financial figures or performance comparisons may vary depending on the model used, version of the library, etc.

Metadata

Each parsed chunk includes a metadata dictionary that provides additional context about the extracted content. This metadata can include information such as:

Section titles and headings
Footnotes and references
Source document structure details
Location of the elements within PDF (e.g. page and exact bounding box)

This metadata might be useful for downstream tasks such as document indexing, content retrieval, and contextual ranking in RAG applications.

Image Parsing

By default, Pathway's Docling parser does not parse (describe) images. To enable image parsing, supply a vision-enabled LLM like gpt-4o-mini to the Docling constructor:

from pathway.xpacks.llm.llms import OpenAIChat
from pathway.internals import udfs

llm = OpenAIChat(
    model="gpt-4o-mini",
    cache_strategy=udfs.DefaultCache(),
    retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=4),
)

parser = DoclingParser(
    image_parsing_strategy="llm",
    multimodal_llm=llm,
)

doc = await parser.parse(content)

As you can see it is enough just to change the image_parsing_strategy parameter to llm and provide multimodal llm of your choice to the parser constructor.

Let's see the new output!

print(doc[30][0])

HEADINGS:
# O U T L O O K
CONTENT:
The image is a map of the United States, highlighting different states with a focus on specific monetary values, likely representing a figure such as average income, salary, or a similar financial metric.

### Key Features of the Map:

- **Color Coding**: The states are shaded in a light blue color, with certain states marked in a darker blue. This darker shading indicates specific states associated with particular monetary values.
  
- **Highlighted States with Values**:
    - **New York**: $35,990
    - **Maine**: $35,490
    - **Pennsylvania**: $34,490
    - **Connecticut**: $34,240
    - **Illinois**: $32,490
    - **New Jersey**: $32,490
    - **Delaware**: $33,990
    - **Maryland**: $33,490
    - **Massachusetts**: $32,990
    - **Vermont**: $33,990

- **Additional Information**:
    - The text at the bottom left states: **"$36,490 in all other states"**, suggesting that the figure mentioned applies to states not specifically listed on the map.

### Overall Summary:
The map presents financial data across various states, differentiating between specific monetary amounts for selected states while providing a general figure for all others. It visually emphasizes how certain regions have varying values, likely related to income or living costs.

Below you can find an actual image that has been described using vllm:

Parsed image Image taken from TSLA-Q3-2023-Update-3.

Note how each text visible on the graph has been transcribed and appended to the caption along with description showcasing what happens on the image.

Parsing tables using OpenAI

By default the DoclingParser wrapper uses Docling engine for parsing tables. It corresponds to setting table_parsing_strategy to docling. However, you can use vLM for parsing tables as well! It is enough to change table_parsing_strategy to llm.

In fact, you can choose whichever combination you like - for instance you can decide to parse images with vLM and tables with Docling. These features are independent of each other and allow you to have modular control over your parsing process.

Chunking

The Docling parser structures documents into coherent chunks, classifying elements such as equations, captions, footnotes, and tables. Pathway's integration modifies the default behavior to better support RAG systems, including:

Converting tables into Markdown format
Adding captions and headings for better retrieval

You can disable chunking by setting chunk=False in the constructor.

Keep in mind that this chunker does not split the text based on the length of text. So if you expect certain passages in your document to be long, please consider using additional splitter like TokenCountSplitter.

Building a Multimodal RAG System

IDE: Recommended to run in a Python IDE like VS Code or JupyterLab.

Now, let’s walk through building a real-time RAG pipeline using Docling as a multimodal parser. You’ll use one of the templates from llm-app repository as the foundation.

Step 1: Clone the LLM App Templates Repository

Clone the llm-app repository from GitHub. This repository contains all the files you need.

git clone https://github.com/pathwaycom/llm-app.git

Step 2: Navigate to the Multimodal Project Directory

Change to the directory where the example is located:

cd llm-app/examples/pipelines/gpt_4o_multimodal_rag

Step 3: Modify the `app.yaml` File

In this example, you will customize the app.yaml file to enable Docling as the parser for handling multimodal inputs. Specifically, we configure:

DoclingParser to use llm for image parsing
docling strategy for tables (default, but made explicit)
gpt-4o as the LLM for parsing tasks

Below is the modified version:

$sources:
  - !pw.io.fs.read
    path: data
    format: binary
    with_metadata: true

$llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DefaultCache {}
  temperature: 0
  capacity: 8

$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
  model: "text-embedding-ada-002"
  cache_strategy: !pw.udfs.DefaultCache {}

$parsing_llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o"
  retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
    max_retries: 6
  cache_strategy: !pw.udfs.DefaultCache {}

$parser: !pw.xpacks.llm.parsers.DoclingParser
  multimodal_llm: $parsing_llm
  image_parsing_strategy: "llm"
  table_parsing_strategy: "docling"  # default

$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
  min_tokens: 400
  max_tokens: 750

$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
  reserved_space: 1000
  embedder: $embedder
  metric: !pw.engine.BruteForceKnnMetricKind.COS
  
$document_store: !pw.xpacks.llm.document_store.DocumentStore
  docs: $sources
  parser: $parser
  splitter: $splitter
  retriever_factory: $retriever_factory

question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
  llm: $llm
  indexer: $document_store

Step 4: Running the Project

Locally

If you are using Windows, refer to the Docker instructions in the next section. For a local run, first install the dependencies:

pip install -r requirements.txt

Then, start the app:

python app.py

With Docker

Build the Docker Image:

docker build -t rag .

Run the Docker Container:

Mount the data folder and expose port 8000.

For Windows:

docker run -v "%cd%/data:/app/data" -p 8000:8000 rag

For Linux/Mac:

docker run -v "$(pwd)/data:/app/data" -p 8000:8000 rag

Step 5: Querying the Pipeline

Once your service is running on your chosen host and port (by default, 0.0.0.0:8000), you can test the service using curl.

Query the system to list indexed documents:

curl -X 'POST'   'http://localhost:8000/v2/list_documents'   -H 'accept: */*'   -H 'Content-Type: application/json'

It should return something like that:

[{"created_at": 1747767489, "modified_at": 1747767489, "owner": "root", "path": "data/20230203_alphabet_10K.pdf", "size": 897814, "seen_at": 1747837448}]%

Retrieve information from the indexed documents:

curl -X 'POST'   'http://0.0.0.0:8000/v2/answer'   -H 'accept: */*'   -H 'Content-Type: application/json'   -d '{
  "prompt": "What is the operating income for the fiscal year of 2022?" }'

Response:

{"response": "$74,842 million"}
}

With this setup, you can now built a real-time RAG system with the Docling parser using Pathway to process multimodal documents efficiently!

Customizing your RAG pipeline using YAML

Pathway framework allows you to easily adjust the behavior of your RAG application using YAML-based templates so that it addresses your use case as much as possible.

Let's say that you know that your documents won't contain any tables but are abundant in images. You can easily change the desired behaviors of your RAG app by modifying the yaml config accordingly. For instance, you can change the parser by changing the parser element as follows:

$parser: !pw.xpacks.llm.parsers.DoclingParser
  multimodal_llm: $parsing_llm
  image_parsing_strategy: "llm"
  table_parsing_strategy: "docling"  # default

It will cause that all images will be parsed (described) by parsing_llm while tables will be parsed by Docling.

Conclusion: Toward Real-Time Multimodal Intelligence

Processing real-time multimodal data no longer has to be a nightmare of separate tools and lagging batch jobs. With Docling’s powerful multimodal parsing and Pathway’s streaming RAG framework, developers can build systems that truly ingest everything (text, tables, audio, visuals) and turn it into actionable knowledge in seconds.

The pain point of fragmented, unstructured data is addressed by Docling’s powerful parsing – turning PDFs and images into structured text – while Pathway addresses the real-time orchestration – ensuring that from ingestion to answer, the flow is continuous and automated.

Developers and enterprises can finally take control of their data: instead of relying on black-box vendors or slow manual processes, they can deploy an in-house system that ingests all data types and serves up answers or alerts in seconds. They’re not locked into a single cloud provider or a one-size-fits-all API. As AI models evolve (e.g., new vision-language models or domain-specific LLMs), they can be integrated into this pipeline.

For more references regarding Docling parser, please visit our user guide page on parsers. You can check out Pathway’s Multimodal RAG App Template -an open-source example for building a real-time Q&A system over PDFs and images. It includes ready-to-use code and a step-by-step tutorial to run it on your own data.

If you are interested in diving deeper into the topic, here are some good references to get started with Pathway:

Are you looking to build an enterprise-grade RAG app?

Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. If you’d like to explore how Pathway can support your RAG and Generative AI initiatives, we invite you to schedule a discovery session with our team.

If you’re dealing with messy multimodal data and need answers in real time, give Pathway + Docling a try. Schedule a 15-minute demo with one of Pathway’s experts to discuss how a real-time multimodal pipeline could look for your enterprise data.