Parsers

Parsers play a crucial role in the Retrieval-Augmented Generation (RAG) pipeline by transforming raw, unstructured data into structured formats that can be effectively indexed, retrieved, and processed by language models. In a RAG system, data often comes from diverse sources such as documents, web pages, APIs, and databases, each with its own structure and format. Parsers help extract relevant content, normalize it into a consistent structure, and enhance the retrieval process by making information more accessible and usable.

This article explores the different types of parsers that can be used in our pipelines, highlighting their specific functions and how they differ from one another. Understanding these parsers is key to optimizing data ingestion, improving retrieval accuracy, and ultimately enhancing the quality of generated responses.

Here is a table listing all available parsers and some details about them:

NameDataDescription
Utf8ParserTextDecodes text encoded in UTF-8.
UnstructuredParserText + tablesLeverages Unstructured library to parse various document types.
DoclingParserPDF + tables + imagesUtilizes docling library to extract structured content from PDFs, including images.
PypdfParserPDFUses pypdf library to extract text from PDFs with optional text cleanup.
ImageParserImageTransforms images into textual descriptions and extracts structured information.
SlideParserSlideExtracts information from PPTX and PDF slide decks using vision-based LLMs.

Utf8Parser

Utf8Parser is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.

UnstructuredParser

UnstructuredParser leverages the parsing capabilities of Unstructured. It supports various document types, including PDFs, HTML, Word documents, and more, making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.

However, there are some limitations associated with the open-source library, such as reduced performance in document and table extraction and reliance on older, less sophisticated vision transformer models. Moreover, Unstructured does not support image extraction.

Chunking modes

Many parsers include chunking functionality, allowing them to use a document's structure to split content into smaller, semantically consistent chunks. Pathway's UnstructuredParser supports five chunking modes:

  • basic - Uses Unstructured's basic chunking strategy, which splits text into chunks shorter than the specified max_characters length (set via the chunking_kwargs argument). It also supports a soft threshold for chunk length using new_after_n_chars.
  • by_title - Uses Unstructured's chunk-by-title strategy strategy, similar to basic chunking but with additional constraints to split chunks at section or page breaks, resulting in more structured chunks. Like basic chunking, it can be configured via chunking_kwargs.
  • elements - Breaks down a document into homogeneous Unstructured elements such as Title, NarrativeText, Footer, ListItem etc. Not recommended for PDFs or other complex data sources. Best suited for simple input data where individual elements need to be separated.
  • paged - Collects all elements found on a single page into one chunk. Useful for documents where content is well-separated across pages.
  • single - Aggregates all Unstructured elements into a single large chunk. Use this mode when applying other chunking strategies available in Pathway or when using a custom chunking approach.

Example of usage:

from pathway.xpacks.llm.parsers import UnstructuredParser
parser = UnstructuredParser(
    chunking_mode="by_title",
    chunking_kwargs={
        "max_characters": 3000,       # hard limit on number of characters in each chunk
        "new_after_n_chars": 2000,    # soft limit on number of characters in each chunk
    },
)

result = data_sources.with_columns(parsed=parser(data_sources.data))

Unstructured chunking is character-based rather than token-based, meaning you do not have precise control over the maximum number of tokens each chunk will occupy in the context window.

DoclingParser

DoclingParser is a PDF parser that utilizes the docling library to extract structured content from PDFs. It extends docling's DocumentConverter with additional functionality to parse images from PDFs using vision-enabled language models. This allows for a more comprehensive extraction of content, including tables and embedded images.

It is recommended to use this parser when extracting text, tables, and images from PDFs.

Image parsing

If parse_images=True, the parser detects images within the document, processes them with a multimodal LLM (such as OpenAI's GPT-4o), and embeds its descriptions in the Markdown output. If disabled, images are replaced with placeholders.

Example:

from pathway.xpacks.llm.parsers import DoclingParser
from pathway.xpacks.llm.llms import OpenAIChat

multimodal_llm = OpenAIChat(model="gpt-4o-mini")

parser = DoclingParser(
    parse_images=True,
    multimodal_llm=multimodal_llm,
    pdf_pipeline_options={  # use it to override our default options for parsing pdfs with docling
        "do_formula_enrichment": True,
        "image_scale": 1.5,
    }
)

See PdfPipelineOptions for reference of possible configuration, like OCR options, picture classification, code OCR, scientific formula enrichment, etc.

PypdfParser

PypdfParser is a lightweight PDF parser that utilizes the pypdf library to extract text from PDF documents. It also includes an optional text cleanup feature to enhance readability by removing unnecessary line breaks and spaces.

Keep in mind that it might not be adequate for table extraction. No image extraction is supported.

ImageParser

This parser can be used to transform image (e.g. in .png or .jpg format) into a textual description made by multimodal LLM. On top of that it could be used to extract structured information from the image via predefined schema.

Example

Image that you have an application for detecting breed of dogs from the picture. You also want to know the color and surrounding of the dog.

Let's put an image of corgi into ./dogs directory:

wget https://media.os.fressnapf.com/cms/2020/07/ratgeber_hund_rasse_portraits_welsh-corgi-pembroke_1200x527.jpg?t=seoimgsqr_527 -P ./dogs

Now, lets build some simple Pathway pipeline that would try to parse the image and extract structured information defined in pydantic schema.

from pydantic import BaseModel
from pathway.xpacks.llm.llms import OpenAIChat
from pathway.xpacks.llm.parsers import ImageParser

data_sources = pw.io.fs.read(
    "./dogs",
    format="binary",
    mode="static",
)

chat = OpenAIChat(model="gpt-4o-mini")

# schema defining information we want to extract from the image
class DogDetails(BaseModel):
    breed: str
    surroundings: str
    color: str

prompt = "Please provide a description of the image."

parser = ImageParser(
    llm=chat,
    parse_prompt=prompt,
    detail_parse_schema=DogDetails,
)

result = data_sources.select(parsed=parser(data_sources.data))

The result (after writing to json) will be:

{
    "parsed": [
        [
            "The image shows a happy Corgi dog running in a grassy area. The Corgi has a reddish-brown and white coat, a fluffy tail, and its tongue is out, giving it a cheerful expression. Its ears are perked up, and it appears to be wearing a red collar. The background is slightly blurred, emphasizing the dog in motion.",
            {
                "breed": "Pembroke Welsh Corgi",
                "surroundings": "outdoors in a grassy area",
                "color": "tan and white"
            }
        ]
    ],
}

Under the hood, there are two requests to the llm model - the first one generates the basic description using provided prompt, while the second one uses instructor to extract information from the image and organize it into a provided detail_parse_schema.

The second step is optional - if you don't specify detail_parse_schema parameter, instructor won't call LLM.

SlideParser

SlideParser is a powerful parser designed to extract information from PowerPoint (PPTX) and PDF slide decks using vision-based LLMs. It converts slides into images before processing them by vision LLM that tries to describe the content of a slide.

As in case of ImageParser you can also extract information specified in pydantic schema.