DocLing Document Parser: A Free and Open-Source Alternative to SaaS Parsing APIs

How to Use the DocLing Document Parser in a Real-Time RAG System
In this article, you will explore how to integrate the multimodal Docling parser within a real-time RAG system. The Docling document parser is a powerful tool designed to process various data types, including text, tables, images, and equations. This makes it an ideal choice for handling complex documents. By leveraging the capabilities of the Docling parser, you can enhance the performance and versatility of the Pathway RAG system, enabling it to better understand and generate responses from diverse and rich data sources.
As a cloud-agnostic tool, DocLing frees you from vendor lock-in and recurring API costs, while offering full control over data privacy and processing infrastructure.
You will learn how to set up the DocLing parser, configure the Pathway RAG system, and demonstrate its application using a sample document. This guide provides a comprehensive understanding of effectively utilizing the DocLing parser to improve the accuracy and relevance of responses generated by the Pathway RAG system.
Prerequisites
Before you begin, ensure you have the following set up:
- Install the Pathway engine, which allows you to build a RAG system:
pip install "pathway[all]"
- Ensure that the
OPENAI_API_KEY
is available as an environment variable:
export OPENAI_API_KEY="sk-...."
DocLing Overview
Unlike visual-only parsers, DocLing brings in linguistic context—like sentence structures and entity linking—for deeper document comprehension. This makes it especially effective for documents like contracts, financial statements, and research papers that rely on domain-specific terminology.
This section covers how the DocLing parser works and how you can control its behavior via parameters. You will see its functionality, trade-offs, and best practices.
First, download a sample document containing multimodal data, including text, tables, images, and equations, which will be used throughout this section. The paper "Mixtral of Experts" of Jiang et al., introducing the Mixtral 8x7B model, can be downloaded from arxiv:
mkdir data
wget https://arxiv.org/pdf/2401.04088 -O data/mixtral.pdf
Now, let's load the document:
with open("mixtral.pdf", "rb") as f:
content = f.read()
Parsing with DocLing
Next, import and define a DoclingParser
object:
from pathway.xpacks.llm.parsers import DoclingParser
parser = DoclingParser()
To parse a PDF file, simply run:
doc = await parser.parse(content)
If running outside a Jupyter notebook, you will need to create and execute an asyncio event loop:
import asyncio
doc = asyncio.run(parser.parse(content))
Understanding parser output
The parser returns a list of chunks. Each chunk consists of a tuple structured as (parsed_text, metadata_dict)
. Let's examine a specific chunk that contains table:
print(doc[15][0])
Output:
HEADINGS:
# 3 Results
CONTENT:
| | LLaMA 2 70B | GPT-3.5 | Mixtral 8x7B |
|:------------------------------|:--------------|:----------|:---------------|
| MMLU (MCQ in 57 subjects) | 69.9% | 70.0% | 70.6% |
| HellaSwag (10-shot) | 87.1 % | 85.5% | 86.7% |
| ARC Challenge (25-shot) | 85.1% | 85.2% | 85.8% |
| WinoGrande (5-shot) | 83.2% | 81.6% | 81.2% |
| MBPP (pass@1) | 49.8% | 52.2% | 60.7% |
| GSM-8K (5-shot) | 53.6% | 57.1% | 58.4% |
| MTBench (for Instruct Models) | 6.86 | 8.32 | 8.30 |
CAPTION:
Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5. Mixtral outperforms or matches Llama 2 70B and GPT-3.5 performance on most metrics.
Let's cross-check how parsed table compares to the original:
Image taken from "Mixtral 8x7B", Jiang et al.
As you can see Docling did a great job at parsing this table!
Metadata
Each parsed chunk includes a metadata dictionary that provides additional context about the extracted content. This metadata can include information such as:
- Section titles and headings
- Footnotes and references
- Source document structure details
- Location of the elements within PDF (e.g. page and exact bounding box)
This metadata might be useful for downstream tasks such as document indexing, content retrieval, and contextual ranking in RAG applications.
Image Parsing
By default, Pathway's DocLing parser does not parse (describe) images. To enable image parsing, supply a vision-enabled LLM like gpt-4o-mini
to the DocLing constructor:
from pathway.xpacks.llm.llms import OpenAIChat
from pathway.internals import udfs
llm = OpenAIChat(
model="gpt-4o-mini",
cache_strategy=udfs.DefaultCache(),
retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=4),
)
parser = DoclingParser(
image_parsing_strategy="llm",
multimodal_llm=llm,
)
doc = await parser.parse(content)
As you can see it is enough just to change the image_parsing_strategy
parameter to llm
and provide multimodal llm of your choice to the parser constructor.
Let's see the new output!
print(doc[4][0])
HEADINGS:
# 1 Introduction
CONTENT:
CAPTION:
Figure 1: Mixture of Experts Layer. Each input vector is assigned to 2 of the 8 experts by a router. The layer's output is the weighted sum of the outputs of the two selected experts. In Mixtral, an expert is a standard feedforward block as in a vanilla transformer architecture.
The image illustrates a "Mixture of Experts Layer," which is a concept often used in machine learning, particularly for enhancing model performance by allowing different sub-models (experts) to specialize in different types of inputs.
### Detailed Explanation:
1. **Title**: The top of the image includes the title "Mixture of Experts Layer."
2. **Inputs**: On the left side, there is a box labeled "inputs." This represents the data that will be processed by the model.
3. **Router**: Next to the inputs is a box labeled "router." The router is responsible for directing the incoming data to the appropriate experts based on certain criteria. It determines which expert(s) should be used for a given input.
...
Below you can find an actual image that has been described using vllm:
Image taken from "Mixtral 8x7B", Jiang et al.
Note how each text visible on the graph has been transcribed and appended to the caption along with description showcasing what happens on the image.
Parsing tables using OpenAI
By default the DoclingParser
wrapper uses Docling engine for parsing tables. It corresponds to setting table_parsing_strategy
to docling
. However, you can use vLM for parsing tables as well! It is enough to change table_parsing_strategy
to llm
. In fact, you can choose whichever combination you like - for instance you can decide to parse images with vLM and tables with Docling. These features are independent of each other and allow you to have modular control over your parsing process.
Chunking
The DocLing parser structures documents into coherent chunks, classifying elements such as equations, captions, footnotes, and tables. Pathway's integration modifies the default behavior to better support RAG systems, including:
- Converting tables into Markdown format
- Adding captions and headings for better retrieval
You can disable chunking by setting chunk=False
in the constructor.
Keep in mind that this chunker does not split the text based on the length of text.
So if you expect certain passages in your document to be long, please consider using additional splitter like TokenCountSplitter
.
Building the RAG System
Now, let’s walk through building a RAG system using Docling as a multimodal parser. You’ll use the llm-app template as the foundation.
Clone the repository:
git clone https://github.com/pathwaycom/llm-app.git
Navigate to the pipeline directory:
cd llm-app/examples/pipelines/gpt_4o_multimodal_rag
Modify the app.yaml
configuration file to suit your needs. Below is a simplified version:
$sources:
- !pw.io.fs.read
path: data
format: binary
with_metadata: true
$llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-4o-mini"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DefaultCache
temperature: 0
capacity: 8
$embedder: !pw.xpacks.llm.embedders.OpenAIEmbedder
model: "text-embedding-ada-002"
cache_strategy: !pw.udfs.DefaultCache
$parsing_llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-4o"
retry_strategy: !pw.udfs.ExponentialBackoffRetryStrategy
max_retries: 6
cache_strategy: !pw.udfs.DefaultCache
$parser: !pw.xpacks.llm.parsers.DoclingParser
multimodal_llm: parsing_llm
image_parsing_strategy: "llm"
table_parsing_strategy: "docling" # default
$splitter: !pw.xpacks.llm.splitters.TokenCountSplitter
min_tokens: 400
max_tokens: 750
$retriever_factory: !pw.stdlib.indexing.BruteForceKnnFactory
reserved_space: 1000
embedder: $embedder
metric: !pw.engine.BruteForceKnnMetricKind.COS
$document_store: !pw.xpacks.llm.document_store.DocumentStore
docs: $sources
parser: $parser
splitter: $splitter
retriever_factory: $retriever_factory
question_answerer: !pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
llm: $llm
indexer: $document_store
Launch the application:
pip install -r requirements.txt
python app.py
Query the system to list indexed documents:
curl -X 'POST' 'http://0.0.0.0:8000/v1/pw_list_documents' -H 'accept: */*' -H 'Content-Type: application/json'
It should return something like that:
[{"created_at": null, "modified_at": 1704774357, "owner": "albert", "path": "data/mixtral.pdf", "size": 2475990, "seen_at": 1742294707}]
Retrieve information from the indexed documents:
curl -X 'POST' 'http://0.0.0.0:8000/v1/pw_ai_answer' -H 'accept: */*' -H 'Content-Type: application/json' -d '{"prompt": "What score does Mixtral achieve on the MMLU benchmark?"}'
Response:
{"response": "Mixtral achieves a score of 70.6% on the MMLU benchmark."}
With this setup, you can now use the Pathway RAG system with the DocLing parser to process multimodal documents efficiently!
Customizing your RAG pipeline
Pathway framework allows you to easily adjust the behavior of your RAG application using YAML-based templates so that it addresses your use case as much as possible.
Let's say that you know that your documents won't contain any tables but are abundant in images. You can easily change the desired behaviors of your RAG app by modifying the yaml config accordingly. For instance, you can change the parser by changing the parser element as follows:
$parser: !pw.xpacks.llm.parsers.DoclingParser
multimodal_llm: parsing_llm
image_parsing_strategy: "llm"
table_parsing_strategy: "docling" # default
It will cause that all images will be parsed (described) by parsing_llm
while tables will be parsed by Docling.
Conclusion
Whether you're parsing PDFs, invoices, or academic research, the DocLing document parser provides an open-source, private, and cost-effective alternative to commercial SaaS tools.
For more references regarding Docling parser, please visit our user guide page on parsers.
If you are interested in diving deeper into the topic, here are some good references to get started with Pathway:
- Pathway Developer Documentation
- In-depth article on Multimodal RAG
- Discord Community of Pathway
- Multimodal RAG App Template
- Power and Deploy RAG Agent Tools with Pathway
- End-to-end Real-time RAG app with Pathway
Are you looking to build an enterprise-grade RAG app?
Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. If you’d like to explore how Pathway can support your RAG and Generative AI initiatives, we invite you to schedule a discovery session with our team.
Schedule a 15-minute demo with one of our experts to see how Pathway can be the right solution for your enterprise.

Albert Roethel
AI Engineer