Parsers

Parsers play a crucial role in the Retrieval-Augmented Generation (RAG) pipeline by transforming raw, unstructured data into structured formats that can be effectively indexed, retrieved, and processed by language models. In a RAG system, data often comes from diverse sources such as documents, web pages, APIs, and databases, each with its own structure and format. Parsers help extract relevant content, normalize it into a consistent structure, and enhance the retrieval process by making information more accessible and usable.

This article explores the different types of parsers that can be used in our pipelines, highlighting their specific functions and how they differ from one another. Understanding these parsers is key to optimizing data ingestion, improving retrieval accuracy, and ultimately enhancing the quality of generated responses.

Here is a table listing all available parsers and some details about them:

Name	Data	Description
DoclingParser	PDF + tables + images	Utilizes docling library to extract structured content from PDFs, including images.
ImageParser	Image	Transforms images into textual descriptions and extracts structured information.
PaddleOCR	PDF + tables + images	Utilizes the PaddleOCR library to extract structured content from PDFs and images.
PypdfParser	PDF	Uses pypdf library to extract text from PDFs with optional text cleanup.
SlideParser	Slide	Extracts information from PPTX and PDF slide decks using vision-based LLMs.
UnstructuredParser	Text + tables	Leverages Unstructured library to parse various document types.
Utf8Parser	Text	Decodes text encoded in UTF-8.

Utf8Parser

Utf8Parser is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.

PaddleOCR

PaddleOCRParser is a parser that relies on the PaddleOCR library. It requires the paddlepaddle package. The version depends on your hardware. If you want to run the OCR on CPU, you can install it with the following pip command: pip install paddlepaddle>=3.2.0 For GPU support, follow the instructions on the official site.

The PaddleOCRParser uses a Paddle pipeline object to perform the parsing/OCR. Currently, PaddleOCR and PPStructureV3 pipelines are supported. By default, it uses a PPStructureV3 pipeline.

More details on how to use the PaddleOCRParser in the associated blog post.

UnstructuredParser

UnstructuredParser leverages the parsing capabilities of Unstructured. It supports various document types, including PDFs, HTML, Word documents, and more, making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.

However, there are some limitations associated with the open-source library, such as reduced performance in document and table extraction and reliance on older, less sophisticated vision transformer models. Moreover, Unstructured does not support image extraction.

Chunking modes

Many parsers include chunking functionality, allowing them to use a document's structure to split content into smaller, semantically consistent chunks. Pathway's UnstructuredParser supports five chunking modes:

basic - Uses Unstructured's basic chunking strategy, which splits text into chunks shorter than the specified max_characters length (set via the chunking_kwargs argument). It also supports a soft threshold for chunk length using new_after_n_chars.
by_title - Uses Unstructured's chunk-by-title strategy strategy, similar to basic chunking but with additional constraints to split chunks at section or page breaks, resulting in more structured chunks. Like basic chunking, it can be configured via chunking_kwargs.
elements - Breaks down a document into homogeneous Unstructured elements such as Title, NarrativeText, Footer, ListItem etc. Not recommended for PDFs or other complex data sources. Best suited for simple input data where individual elements need to be separated.
paged - Collects all elements found on a single page into one chunk. Useful for documents where content is well-separated across pages.
single - Aggregates all Unstructured elements into a single large chunk. Use this mode when applying other chunking strategies available in Pathway or when using a custom chunking approach.

Unstructured chunking is character-based rather than token-based, meaning you do not have precise control over the maximum number of tokens each chunk will occupy in the context window.

DoclingParser

DoclingParser is a PDF parser that utilizes the docling library to extract structured content from PDFs. It extends docling's DocumentConverter with additional functionality to parse images from PDFs using vision-enabled language models. This allows for a more comprehensive extraction of content, including tables and embedded images.

It is recommended to use this parser when extracting text, tables, and images from PDFs.

DoclingParser offers structure-aware chunking functionality. It separates tables and images into distinct chunks, merges all list items into a single chunk, and ensures each chunk is wrapped with markdown headings at the top and appropriate captions at the bottom when available (e.g., for tables and images).

Table parsing

There are two main approaches for parsing tables: (1) using Docling engine or (2) parsing using multimodal LLM. The first one will run Docling OCR on the top of table that is in the pdf and transform it into markdown format. The second one will transform the table into an image and send it to multimodal LLM and ask for parsing it. As of now we only support LLMs having same API interface as OpenAI.

In order to choose between these two you must set table_parsing_strategy to either llm or docling. If you don't want to parse tables simply set this argument to None.

Image parsing

If image_parsing_strategy="llm", the parser detects images within the document, processes them with a multimodal LLM (such as OpenAI's GPT-4o), and embeds its descriptions in the Markdown output. If disabled, images are replaced with placeholders.

See PdfPipelineOptions for reference of possible configuration, like OCR options, picture classification, code OCR, scientific formula enrichment, etc.

Want to learn more? Have a look on our special blog post about DoclingParser.