pw.xpacks.llm.parsers
A library for document parsers: functions that take raw bytes and return a list of text chunks along with their metadata.
class DoclingParser(parse_images=False, multimodal_llm=None, cache_strategy=None, pdf_pipeline_options={})
[source]Parse PDFs using docling library. This class is a wrapper around the DocumentConverter from docling library with some extra functionality to also parse images from the PDFs using vision LLMs.
- Parameters
- parse_images (
bool
) – whether to parse the detected images from the PDF. Detected images will be cropped and described by the vision LLM and embedded in the markdown output. If set to True, multimodal_llm should be provided. If set to False, images will be replaced with placeholders in the markdown output. - multimodal_llm (
llms.OpenAIChat
|
llms.LiteLLMChat
| None
) – LLM for parsing the image. Provided LLM should support image inputs in the same API format as OpenAI does. Required if parse_images is set to True. - cache_strategy (
udfs.CacheStrategy
| None
) – Defines the caching mechanism. - pdf_pipeline_options (
dict
) – Additional options for the DocumentConverter from docling.
- parse_images (
__call__(*args, **kwargs)
sourceCall self as a function.
class ImageParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, downsize_horizontal_width=1280, max_image_size=15 * 1024 * 1024, run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)
[source]A class to parse images using vision LLMs.
- Parameters
- llm (
pw.UDF
) – LLM for parsing the image. Provided LLM should support image inputs. - parse_prompt (
str
) – The prompt used by the language model for parsing. - detail_parse_schema (
type
[BaseModel
] |None
) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step. - downsize_horizontal_width (
int
) – Width to which images are downsized if necessary. Default is 1920. - include_schema_in_text (
bool
) – If the parsed schema should be included in thetext
description. May help with search and retrieval. Defaults toFalse
. Only usable ifdetail_parse_schema
is provided. - max_image_size (
int
) – Maximum allowed size of the images in bytes. Default is 15 MB. - run_mode (
Literal
['sequential'
,'parallel'
]) – Mode of execution, either"sequential"
or"parallel"
. Default is"parallel"
."parallel"
mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern,"sequential"
may be better. - retry_strategy (
AsyncRetryStrategy
|None
) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- llm (
__call__(*args, **kwargs)
sourceCall self as a function.
class PypdfParser(apply_text_cleanup=True, cache_strategy=None)
[source]Parse PDF document using pypdf
library.
Optionally, applies additional text cleanups for readability.
- Parameters
- apply_text_cleanup (
bool
) – Apply text cleanup for line breaks and repeated spaces. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- apply_text_cleanup (
__call__(*args, **kwargs)
sourceCall self as a function.
class SlideParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, intermediate_image_format='jpg', image_size=(1280, 720), run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)
[source]A class to parse PPTX and PDF slides using vision LLMs.
Use of this class requires Pathway Scale account. Get your license here to gain access.
- Parameters
- llm (
UDF
) – LLM for parsing the image. Provided LLM should support image inputs. - parse_prompt (
str
) – The prompt used by the language model for parsing. - detail_parse_schema (
type
[BaseModel
] |None
) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step. - include_schema_in_text (
bool
) – If the parsed schema should be included in thetext
description. May help with search and retrieval. Defaults toFalse
. Only usable ifdetail_parse_schema
is provided. - intermediate_image_format (
str
) – Intermediate image format used when converting PDFs to images. Defaults to"jpg"
for speed and memory use. - image_size (
tuple[int, int], optional
) – The target size of the images. Default is (1280, 720). Note that setting higher resolution will increase the cost and latency. Since vision LLMs will resize the given image into certain resolution, setting high resolutions may not help with the accuracy. - run_mode (
Literal
['sequential'
,'parallel'
]) – Mode of execution, either"sequential"
or"parallel"
. Default is"parallel"
."parallel"
mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern,"sequential"
may be better. - retry_strategy (
AsyncRetryStrategy
|None
) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested. - cache_strategy (
CacheStrategy
|None
) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy
should be provided. Defaults to None.
- llm (
__call__(*args, **kwargs)
sourceCall self as a function.
class UnstructuredParser(chunking_mode='single', partition_kwargs={}, post_processors=None, chunking_kwargs={}, cache_strategy=None)
[source]Parse document using https://unstructured.io/.
All arguments can be overridden during UDF application.
- Parameters
- chunking_mode (
Literal
['single'
,'elements'
,'paged'
,'basic'
,'by_title'
]) – Mode used to chunk the document. When"basic"
it uses default Unstructured’s chunking strategy. When"by_title"
, same as"basic"
but it chunks the document preserving section boundaries. When"single"
, each document is parsed as one long text string. When"elements"
, each document is split into Unstructured’s elements. When"paged"
, each page’s text is separately extracted. Defaults to"single"
. - post_processors (
list
[Callable
] |None
) – list of callables that will be applied to all extracted texts. - partition_kwargs (
dict
) – extra kwargs to be passed to unstructured.io’spartition
function - chunking_kwargs (
dict
) – extra kwargs to be passed to unstructured.io’schunk_elements
orchunk_by_title
function
- chunking_mode (
__call__(contents, chunking_mode=None, partition_kwargs={}, post_processors=None, chunking_kwargs={})
sourceParse the given document. Providing chunking_mode
, partition_kwargs
, post_processors
or
chunking_kwargs
is used for overriding values set during initialization.
- Parameters
- contents (
ColumnExpression
) – document contents - chunking_mode (
Union
[ColumnExpression
,Literal
['single'
,'elements'
,'paged'
,'basic'
,'by_title'
],None
]) – Mode used to chunk the document. - partition_kwargs (
ColumnExpression
|dict
) – extra kwargs to be passed to unstructured.io’spartition
function - post_processors (
ColumnExpression
|list
[Callable
] |None
) – list of callables that will be applied to all extracted texts. - chunking_kwargs (
ColumnExpression
|dict
) – extra kwargs to be passed to unstructured.io’schunk_elements
- function (
or chunk_by_title
) –
- contents (
- Returns
A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is obtained from Unstructured, you can check possible values in the Unstructed documentation https://unstructured-io.github.io/unstructured/metadata.html Note that whenchunking_mode
is set to"single"
or"paged"
some of these fields are removed if they are specific to a single element, e.g.category_depth
.
class Utf8Parser(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)
[source]Decode text encoded as UTF-8.
__call__(contents, **kwargs)
sourceParse the given document.
- Parameters
contents (ColumnExpression
) – document contents - Returns
A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is an empty dictionary.