Real-Time Multimodal Data Processing with Pathway and Docling

The Challenge of Multimodal Data Processing in Finance
In today’s landscape, data isn’t just text – it’s multimodal. Financial institutions grapple with information streaming in as PDFs, images (charts, scanned documents), audio (earnings calls, podcasts), video (webinars), and more. However, processing this multimodal data in real time is a daunting task. Each format speaks its own “language,” and traditional pipelines struggle to unify them fast enough for actionable insights.
In finance, market-moving insights might be buried in a CEO’s spoken words on an earnings call (audio), a complex table in a quarterly report (PDF), or even a chart from an explainer video.
Why is real-time multimodal processing so difficult? It boils down to complexity on multiple fronts:
- Diverse Formats: Each modality requires different parsing techniques. For example, PDFs may contain embedded text and images, requiring text extraction and OCR; audio needs speech-to-text transcription; images (like charts or diagrams) need computer vision to interpret.
- Volume & Velocity: Data streams in high volume and speed (e.g. live news feeds, sensor streams). Batching for offline analysis isn’t an option – you need streaming ingestion and processing.
- Asynchronous & Unstructured: A breaking news alert might arrive as text while simultaneously a detailed report comes in PDF. These must be correlated and indexed together. Unstructured content (free text, visuals) lacks the schema that databases have, making real-time structuring a must.
- Semantic Understanding: Even after extraction, making sense of the data (e.g. linking a chart image to its textual description or understanding an audio quote in context) is hard. Multimodal retrieval-augmented generation (RAG) demands semantic alignment across formats.
Simply put, multimodal data is complex to parse and integrate in real time. Many organizations resort to batch jobs or siloed tools – which is slow and error-prone. In high-stakes domains like finance, these delays and gaps mean lost opportunities and unmanaged risks.
How do you achieve Multimodal Data Processing with Pathway and Docling
Pathway and Docling tackle real-time data challenges with a powerful combination: a high-performance streaming framework and an advanced multimodal parser. Designed with privacy and developers in mind, this duo delivers fast, reliable, and flexible data processing. Here are the main advantages:
- Pathway is a Python framework used for building a Live AI layer (combining real-time intelligence with AI) in enterprise applications. With Pathway, you can stream data changes the moment they happen–be it Kafka topics and cloud storage events. For RAG use cases, this means you can index documents and their changes on the fly—your indexes are always up-to-date with the latest filings or news. It is self-hostable, making it ideal for privacy-sensitive deployments.
- Docling is an open-source document processing toolkit (from IBM) designed to convert unstructured files into structured data for AI applications. Docking parses each document as it flows through the stream. It converts your PDFs, Word files, slides, images, and HTML into clean, developer-friendly formats, such as JSON or Markdown. It relies on a layout-aware AI (DocLayNet, TableFormer) instead of brute-force OCR, preserving headings, tables, figures, and multi-column text while remaining lightweight enough for a laptop or modest server and flexible enough to swap in custom models.
Joint Value for Teams
- Speed: Stream ingestion + incremental parsing keeps knowledge bases minutes—or seconds—fresh.
- Privacy: Both tools run on-premises or in a private Virtual Private Cloud (VPC); no data is sent to third-party APIs.
- Flexibility: Pipeline-as-code lets you slot in monitoring, alerts, or bespoke ML models anywhere in the flow.
- Cost-efficiency: Commodity hardware is enough; no expensive, always-on OCR services or heavyweight databases.
Bottom line: Pathway handles the live firehose, Docling turns messy documents into structured gold, and you stay in complete control—ideal for enterprise-grade, real-time AI applications.
Hands-On Guide to Real-Time Multimodal Data Processing
You will learn how to set up the Docling parser, configure a real-time RAG system, and demonstrate its application using a sample financial document. This guide provides a comprehensive understanding of effectively utilizing the Docling parser to improve the accuracy and relevance of responses generated by the Pathway RAG system.
Prerequisites
Environment: This guide is intended to be run in Google Colab.
Before you begin, ensure you have the following set up:
- Install all basic dependencies to run your Pathway pipelines, including the powerful Rust engine:
!pip install "pathway[all]"
- Ensure that you have a valid OpenAI API key:
import os
os.environ["OPENAI_API_KEY"] = "sk-...."
Enter Docling: Open-Source Multimodal Parsing by IBM
To address the parsing piece of the puzzle, IBM Research open-sourced Docling, a toolkit purpose-built for multimodal document parsing. Docling is a MIT-licensed library that can ingest a wide range of documents – PDFs, DOCX, PPTX, images, HTML, even AsciiDoc – and convert them into clean structured data (JSON or Markdown). Crucially, Docling handles not just plain text but also preserves layout, hierarchy, tables, and images.
Unlike visual-only parsers, Docling brings in linguistic context—like sentence structures and entity linking—for deeper document comprehension. This makes it especially effective for documents like contracts, financial statements, and research papers that rely on domain-specific terminology.
This section covers how the Docling parser works and how you can control its behavior via parameters. You will see its functionality, trade-offs, and best practices.
First, download a sample document containing multimodal data, including text, tables, images, and equations, which will be used throughout this section.
This uses Tesla's Q3 financial report as the source document. It is possible to download the document using the wget
command:
!mkdir data
!wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf -O data/TSLA-Q3-2023-Update-3.pdf
Now, let's load the document:
with open("data/TSLA-Q3-2023-Update-3.pdf", "rb") as f:
content = f.read()
Parsing with Docling
Next, import and define a DoclingParser
object:
from pathway.xpacks.llm.parsers import DoclingParser
parser = DoclingParser()
To parse a PDF file, simply run:
doc = await parser.parse(content)
If running outside a colab notebook, you will need to create and execute an asyncio event loop:
import asyncio
doc = asyncio.run(parser.parse(content))
Understanding Parser Output
The parser returns a list of chunks. Each chunk consists of a tuple structured as (parsed_text, metadata_dict)
. Let's examine a specific chunk that contains table:
print(doc[5][0])
Output:
HEADINGS:
# F I N A N C I A L S U M M A R Y (Unaudited)
CONTENT:
| Category | Q3-2022 | Q4-2022 | Q1-2023 | Q2-2023 | Q3-2023 | YoY |
|:---------------------------------------------|--------:|--------:|--------:|--------:|--------:|------:|
| Total automotive revenues | 18,692 | 21,307 | 19,963 | 21,268 | 19,625 | 5% |
| Energy generation and storage revenue | 1,117 | 1,310 | 1,529 | 1,509 | 1,559 | 40% |
| Services and other revenue | 1,645 | 1,701 | 1,837 | 2,150 | 2,166 | 32% |
| **Total revenues** | 21,454 | 24,318 | 23,329 | 24,927 | 23,350 | 9% |
| Total gross profit | 5,382 | 5,777 | 4,511 | 4,533 | 4,178 | -22% |
| Total GAAP gross margin | 25.1% | 23.8% | 19.3% | 18.2% | 17.9% | -719 bp |
| Operating expenses | 1,694 | 1,876 | 1,847 | 2,134 | 2,414 | 43% |
| Income from operations | 3,688 | 3,901 | 2,664 | 2,399 | 1,764 | -52% |
| Operating margin | 17.2% | 16.0% | 11.4% | 9.6% | 7.6% | -964 bp |
| Adjusted EBITDA | 4,968 | 5,404 | 4,267 | 4,653 | 3,758 | -24% |
| Adjusted EBITDA margin | 23.2% | 22.2% | 18.3% | 18.7% | 16.1% | -706 bp |
| Net income attributable to stockholders (GAAP) | 3,292 | 3,687 | 2,513 | 2,703 | 1,853 | -44% |
| Net income attributable to stockholders (non-GAAP) | 3,654 | 4,106 | 2,931 | 3,148 | 2,318 | -37% |
| EPS, diluted (GAAP) | 0.95 | 1.07 | 0.73 | 0.78 | 0.53 | -44% |
| EPS, diluted (non-GAAP) | 1.05 | 1.19 | 0.85 | 0.91 | 0.66 | -37% |
| Net cash provided by operating activities | 5,100 | 3,278 | 2,513 | 3,065 | 3,308 | -35% |
| Capital expenditures | (1,803)| (1,858)| (2,072)| (2,060)| (2,460)| 36% |
| Free cash flow | 3,297 | 1,420 | 441 | 1,005 | 848 | -74% |
| Cash, cash equivalents, and investments | 21,107 | 22,185 | 24,402 | 23,075 | 26,077 | 24% |
Let's cross-check how parsed table compares to the original:
Image taken from TSLA-Q3-2023-Update-3.
As you can see Docling did a great job at parsing this table!
Note: Exact financial figures or performance comparisons may vary depending on the model used, version of the library, etc.
Metadata
Each parsed chunk includes a metadata dictionary that provides additional context about the extracted content. This metadata can include information such as:
- Section titles and headings
- Footnotes and references
- Source document structure details
- Location of the elements within PDF (e.g. page and exact bounding box)
This metadata might be useful for downstream tasks such as document indexing, content retrieval, and contextual ranking in RAG applications.
Image Parsing
By default, Pathway's Docling parser does not parse (describe) images. To enable image parsing, supply a vision-enabled LLM like gpt-4o-mini
to the Docling constructor:
from pathway.xpacks.llm.llms import OpenAIChat
from pathway.internals import udfs
llm = OpenAIChat(
model="gpt-4o-mini",
cache_strategy=udfs.DefaultCache(),
retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=4),
)
parser = DoclingParser(
image_parsing_strategy="llm",
multimodal_llm=llm,
)
doc = await parser.parse(content)
As you can see it is enough just to change the image_parsing_strategy
parameter to llm
and provide multimodal llm of your choice to the parser constructor.
Let's see the new output!
print(doc[30][0])
HEADINGS:
# O U T L O O K
CONTENT:
The image is a map of the United States, highlighting different states with a focus on specific monetary values, likely representing a figure such as average income, salary, or a similar financial metric.
### Key Features of the Map:
- **Color Coding**: The states are shaded in a light blue color, with certain states marked in a darker blue. This darker shading indicates specific states associated with particular monetary values.
- **Highlighted States with Values**:
- **New York**: $35,990
- **Maine**: $35,490
- **Pennsylvania**: $34,490
- **Connecticut**: $34,240
- **Illinois**: $32,490
- **New Jersey**: $32,490
- **Delaware**: $33,990
- **Maryland**: $33,490
- **Massachusetts**: $32,990
- **Vermont**: $33,990
- **Additional Information**:
- The text at the bottom left states: **"$36,490 in all other states"**, suggesting that the figure mentioned applies to states not specifically listed on the map.
### Overall Summary:
The map presents financial data across various states, differentiating between specific monetary amounts for selected states while providing a general figure for all others. It visually emphasizes how certain regions have varying values, likely related to income or living costs.
Below you can find an actual image that has been described using vllm:
Image taken from TSLA-Q3-2023-Update-3.
Note how each text visible on the graph has been transcribed and appended to the caption along with description showcasing what happens on the image.
Parsing tables using OpenAI
By default the DoclingParser
wrapper uses Docling engine for parsing tables. It corresponds to setting table_parsing_strategy
to docling
. However, you can use vLM for parsing tables as well! It is enough to change table_parsing_strategy
to llm
.
In fact, you can choose whichever combination you like - for instance you can decide to parse images with vLM and tables with Docling. These features are independent of each other and allow you to have modular control over your parsing process.
Chunking
The Docling parser structures documents into coherent chunks, classifying elements such as equations, captions, footnotes, and tables. Pathway's integration modifies the default behavior to better support RAG systems, including:
- Converting tables into Markdown format
- Adding captions and headings for better retrieval
You can disable chunking by setting chunk=False
in the constructor.
Keep in mind that this chunker does not split the text based on the length of text.
So if you expect certain passages in your document to be long, please consider using additional splitter like TokenCountSplitter
.
Building a Multimodal RAG System
IDE: Recommended to run in a Python IDE like VS Code or JupyterLab.
Now, let’s walk through building a real-time RAG pipeline using Docling as a multimodal parser. You’ll use one of the templates from llm-app repository as the foundation.
Step 1: Clone the LLM App Templates Repository
Clone the llm-app
repository from GitHub. This repository contains all the files you need.
git clone https://github.com/pathwaycom/llm-app.git
Step 2: Navigate to the Multimodal Project Directory
Change to the directory where the example is located:
cd llm-app/examples/pipelines/gpt_4o_multimodal_rag
Step 3: Modify the app.yaml
File
In this example, you will customize the app.yaml
file to enable Docling as the parser for handling multimodal inputs. Specifically, we configure:
DoclingParser
to use llm for image parsingdocling strategy
for tables (default, but made explicit)gpt-4o
as the LLM for parsing tasks
Below is the modified version:
$sources:
- !pw.io.fs.read
path: data
format: binary
with_metadata: true
$llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-4o-mini"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DefaultCache {}
temperature: 0
capacity: 8
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
model: "text-embedding-ada-002"
cache_strategy: !pw.udfs.DefaultCache {}
$parsing_llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-4o"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DefaultCache {}
$parser: !pw.xpacks.llm.parsers.DoclingParser
multimodal_llm: $parsing_llm
image_parsing_strategy: "llm"
table_parsing_strategy: "docling" # default
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
min_tokens: 400
max_tokens: 750
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
reserved_space: 1000
embedder: $embedder
metric: !pw.engine.BruteForceKnnMetricKind.COS
$document_store: !pw.xpacks.llm.document_store.DocumentStore
docs: $sources
parser: $parser
splitter: $splitter
retriever_factory: $retriever_factory
question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
llm: $llm
indexer: $document_store
Step 4: Running the Project
Locally
If you are using Windows, refer to the Docker instructions in the next section. For a local run, first install the dependencies:
pip install -r requirements.txt
Then, start the app:
python app.py
With Docker
Build the Docker Image:
docker build -t rag .
Run the Docker Container:
Mount the data folder and expose port 8000.
- For Windows:
docker run -v "%cd%/data:/app/data" -p 8000:8000 rag
- For Linux/Mac:
docker run -v "$(pwd)/data:/app/data" -p 8000:8000 rag
Step 5: Querying the Pipeline
Once your service is running on your chosen host and port (by default, 0.0.0.0:8000
), you can test the service using curl
.
Query the system to list indexed documents:
curl -X 'POST' 'http://localhost:8000/v2/list_documents' -H 'accept: */*' -H 'Content-Type: application/json'
It should return something like that:
[{"created_at": 1747767489, "modified_at": 1747767489, "owner": "root", "path": "data/20230203_alphabet_10K.pdf", "size": 897814, "seen_at": 1747837448}]%
Retrieve information from the indexed documents:
curl -X 'POST' 'http://0.0.0.0:8000/v2/answer' -H 'accept: */*' -H 'Content-Type: application/json' -d '{
"prompt": "What is the operating income for the fiscal year of 2022?" }'
Response:
{"response": "$74,842 million"}
}
With this setup, you can now built a real-time RAG system with the Docling parser using Pathway to process multimodal documents efficiently!
Customizing your RAG pipeline using YAML
Pathway framework allows you to easily adjust the behavior of your RAG application using YAML-based templates so that it addresses your use case as much as possible.
Let's say that you know that your documents won't contain any tables but are abundant in images. You can easily change the desired behaviors of your RAG app by modifying the yaml config accordingly. For instance, you can change the parser by changing the parser element as follows:
$parser: !pw.xpacks.llm.parsers.DoclingParser
multimodal_llm: $parsing_llm
image_parsing_strategy: "llm"
table_parsing_strategy: "docling" # default
It will cause that all images will be parsed (described) by parsing_llm
while tables will be parsed by Docling.
Conclusion: Toward Real-Time Multimodal Intelligence
Processing real-time multimodal data no longer has to be a nightmare of separate tools and lagging batch jobs. With Docling’s powerful multimodal parsing and Pathway’s streaming RAG framework, developers can build systems that truly ingest everything (text, tables, audio, visuals) and turn it into actionable knowledge in seconds.
The pain point of fragmented, unstructured data is addressed by Docling’s powerful parsing – turning PDFs and images into structured text – while Pathway addresses the real-time orchestration – ensuring that from ingestion to answer, the flow is continuous and automated.
Developers and enterprises can finally take control of their data: instead of relying on black-box vendors or slow manual processes, they can deploy an in-house system that ingests all data types and serves up answers or alerts in seconds. They’re not locked into a single cloud provider or a one-size-fits-all API. As AI models evolve (e.g., new vision-language models or domain-specific LLMs), they can be integrated into this pipeline.
For more references regarding Docling parser, please visit our user guide page on parsers. You can check out Pathway’s Multimodal RAG App Template -an open-source example for building a real-time Q&A system over PDFs and images. It includes ready-to-use code and a step-by-step tutorial to run it on your own data.
If you are interested in diving deeper into the topic, here are some good references to get started with Pathway:
- Pathway Developer Documentation
- Multimodal RAG App Template
- Discord Community of Pathway
- End-to-end Real-time RAG app with Pathway
Are you looking to build an enterprise-grade RAG app?
Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. If you’d like to explore how Pathway can support your RAG and Generative AI initiatives, we invite you to schedule a discovery session with our team.
If you’re dealing with messy multimodal data and need answers in real time, give Pathway + Docling a try. Schedule a 15-minute demo with one of Pathway’s experts to discuss how a real-time multimodal pipeline could look for your enterprise data.
- Luca Metehaututorial · engineeringAug 6, 2024Computing the Option Greeks using Pathway and Databento
- Pathway Teamtutorial · engineering · case-studyAug 27, 2024Achieve Sub-Second Latency with your S3 Storage without Kafka
- Jan Chorowskiblog · feature · engineering · tutorial · case-studyMar 28, 2024Adaptive RAG: cut your LLM costs without sacrificing accuracy