pw.xpacks.llm.parsers
A library for document parsers: functions that take raw bytes and return a list of text chunks along with their metadata.
class DoclingParser(image_parsing_strategy=None, table_parsing_strategy='docling', multimodal_llm=None, cache_strategy=None, pdf_pipeline_options={}, chunk=True, *, async_mode='batch_async')
[source]Parse PDFs using docling library. This class is a wrapper around the DocumentConverter from docling library with some extra functionality to also parse images from the PDFs using vision LLMs.
- Parameters
- image_parsing_strategy (
Literal["llm"] | None
) – Strategy for parsing images. If set to"llm"
, images will be replaced with descriptions generated by the vision LLM. In that case you have to provide a vision LLM in themultimodal_llm
argument. Defaults to None. - table_parsing_strategy (
Literal["docling", "llm"]
) – Strategy for parsing tables. If set to"docling"
, tables will be parsed usingdocling
library. If set to"llm"
, tables will be replaced with descriptions generated by the vision LLM. This description will contain table data parsed with LLM. Defaults to"docling"
. - multimodal_llm (
llms.OpenAIChat
|
llms.LiteLLMChat
| None
) – LLM for parsing the image. Provided LLM should support image inputs in the same API format as OpenAI does. Required ifparse_images
is set toTrue
. - cache_strategy (
udfs.CacheStrategy
| None
) – Defines the caching mechanism. - pdf_pipeline_options (
dict
) – Additional options for theDocumentConverter
fromdocling
. These options will be passed to thePdfPipelineOptions
object and will override the defaults that are dynamically created based on other arguments set in this constructor. See original code for reference: https://github.com/DS4SD/docling/blob/main/docling/datamodel/pipeline_options.py#L288 Keep in mind that you can also change lower-level configurations like TableStructureOptions; e.g.:pdf_pipeline_options={"table_structure_options": {"mode": "accurate"}}
. - chunk (
bool
) – Whether to chunk parsed document into smaller structurally coherent parts. Under the hood it will use modifiedHybridChunker
fromdocling
library. Modification has been made to properly handle additional functionality of this class that allows to parse images and tables using vision LLMs. It also modifies how tables are transformed into text (instead of creating triplets of row, column and value we directly transform table into its markdown format). All images and tables are split into separate chunks. Each of them contains a caption. Chunks that have similar metadata will be merged together. As of now, this chunker is not sensitive for length of the chunks (neither if measured in characters or tokens). It will chunk the document based only on the structure. If set to False the entire document will be returned as a single chunk. Defaults toTrue
.
- image_parsing_strategy (
__call__(*args, **kwargs)
sourceCall self as a function.
async parse_visual_data(b64_imgs, prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT)
sourcePerform OCR using the vision LLM on the given images. In this context image could be an actual picture (e.g. of a corgi) or an image of a table. Image must be encoded using base 64 format (with the prefix “data:image/jpeg;base64,”).
- Parameters
- b64_imgs (
list[str] | str
) – List of base64 encoded images. - prompt (
str
) – The prompt used by the language model for parsing.
- b64_imgs (
class ImageParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, downsize_horizontal_width=1280, max_image_size=15 * 1024 * 1024, run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None, *, async_mode='batch_async')
[source]A class to parse images using vision LLMs.
- Parameters
- llm (
pw.UDF
) – LLM for parsing the image. Provided LLM should support image inputs. - parse_prompt (
str
) – The prompt used by the language model for parsing. - detail_parse_schema (
type
[BaseModel
] |None
) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step. - downsize_horizontal_width (
int
) – Width to which images are downsized if necessary. Default is 1920. - include_schema_in_text (
bool
) – If the parsed schema should be included in thetext
description. May help with search and retrieval. Defaults toFalse
. Only usable ifdetail_parse_schema
is provided. - max_image_size (
int
) – Maximum allowed size of the images in bytes. Default is 15 MB. - run_mode (
Literal
['sequential'
,'parallel'
]) – Mode of execution, either"sequential"
or"parallel"
. Default is"parallel"
."parallel"
mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern,"sequential"
may be better. - retry_strategy (
AsyncRetryStrategy
|None
) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- llm (
__call__(*args, **kwargs)
sourceCall self as a function.
class PypdfParser(apply_text_cleanup=True, cache_strategy=None)
[source]Parse PDF document using pypdf
library.
Optionally, applies additional text cleanups for readability.
- Parameters
- apply_text_cleanup (
bool
) – Apply text cleanup for line breaks and repeated spaces. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- apply_text_cleanup (
__call__(*args, **kwargs)
sourceCall self as a function.
class SlideParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, intermediate_image_format='jpg', image_size=(1280, 720), run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None, *, async_mode='batch_async')
[source]A class to parse PPTX and PDF slides using vision LLMs.
Use of this class requires Pathway Scale account. Get your license here to gain access.
- Parameters
- llm (
UDF
) – LLM for parsing the image. Provided LLM should support image inputs. - parse_prompt (
str
) – The prompt used by the language model for parsing. - detail_parse_schema (
type
[BaseModel
] |None
) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step. - include_schema_in_text (
bool
) – If the parsed schema should be included in thetext
description. May help with search and retrieval. Defaults toFalse
. Only usable ifdetail_parse_schema
is provided. - intermediate_image_format (
str
) – Intermediate image format used when converting PDFs to images. Defaults to"jpg"
for speed and memory use. - image_size (
tuple[int, int], optional
) – The target size of the images. Default is (1280, 720). Note that setting higher resolution will increase the cost and latency. Since vision LLMs will resize the given image into certain resolution, setting high resolutions may not help with the accuracy. - run_mode (
Literal
['sequential'
,'parallel'
]) – Mode of execution, either"sequential"
or"parallel"
. Default is"parallel"
."parallel"
mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern,"sequential"
may be better. - retry_strategy (
AsyncRetryStrategy
|None
) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- llm (
__call__(*args, **kwargs)
sourceCall self as a function.
class UnstructuredParser(chunking_mode='single', partition_kwargs={}, post_processors=None, chunking_kwargs={}, cache_strategy=None)
[source]Parse document using https://unstructured.io/.
All arguments can be overridden during UDF application.
- Parameters
- chunking_mode (
Literal
['single'
,'elements'
,'paged'
,'basic'
,'by_title'
]) – Mode used to chunk the document. When"basic"
it uses default Unstructured’s chunking strategy. When"by_title"
, same as"basic"
but it chunks the document preserving section boundaries. When"single"
, each document is parsed as one long text string. When"elements"
, each document is split into Unstructured’s elements. When"paged"
, each page’s text is separately extracted. Defaults to"single"
. - post_processors (
list
[Callable
] |None
) – list of callables that will be applied to all extracted texts. - partition_kwargs (
dict
) – extra kwargs to be passed to unstructured.io’spartition
function - chunking_kwargs (
dict
) – extra kwargs to be passed to unstructured.io’schunk_elements
orchunk_by_title
function
- chunking_mode (
__call__(contents, chunking_mode=None, partition_kwargs={}, post_processors=None, chunking_kwargs={})
sourceParse the given document. Providing chunking_mode
, partition_kwargs
, post_processors
or
chunking_kwargs
is used for overriding values set during initialization.
- Parameters
- contents (
ColumnExpression
) – document contents - chunking_mode (
Union
[ColumnExpression
,Literal
['single'
,'elements'
,'paged'
,'basic'
,'by_title'
],None
]) – Mode used to chunk the document. - partition_kwargs (
ColumnExpression
|dict
) – extra kwargs to be passed to unstructured.io’spartition
function - post_processors (
ColumnExpression
|list
[Callable
] |None
) – list of callables that will be applied to all extracted texts. - chunking_kwargs (
ColumnExpression
|dict
) – extra kwargs to be passed to unstructured.io’schunk_elements
- function (
or chunk_by_title
) –
- contents (
- Returns
A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is obtained from Unstructured, you can check possible values in the Unstructed documentation https://unstructured-io.github.io/unstructured/metadata.html Note that whenchunking_mode
is set to"single"
or"paged"
some of these fields are removed if they are specific to a single element, e.g.category_depth
.
class Utf8Parser(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)
[source]Decode text encoded as UTF-8.
__call__(contents, **kwargs)
sourceParse the given document.
- Parameters
contents (ColumnExpression
) – document contents - Returns
A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is an empty dictionary.