pw.xpacks.llm.parsers

A library for document parsers: functions that take raw bytes and return a list of text chunks along with their metadata.

class DoclingParser(parse_images=False, multimodal_llm=None, cache_strategy=None, pdf_pipeline_options={})

[source]

Parse PDFs using docling library. This class is a wrapper around the DocumentConverter from docling library with some extra functionality to also parse images from the PDFs using vision LLMs.

  • Parameters
    • parse_images (bool) – whether to parse the detected images from the PDF. Detected images will be cropped and described by the vision LLM and embedded in the markdown output. If set to True, multimodal_llm should be provided. If set to False, images will be replaced with placeholders in the markdown output.
    • multimodal_llm (llms.OpenAIChat|llms.LiteLLMChat | None) – LLM for parsing the image. Provided LLM should support image inputs in the same API format as OpenAI does. Required if parse_images is set to True.
    • cache_strategy (udfs.CacheStrategy | None) – Defines the caching mechanism.
    • pdf_pipeline_options (dict) – Additional options for the DocumentConverter from docling.

__call__(*args, **kwargs)

sourceCall self as a function.

class ImageParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, downsize_horizontal_width=1280, max_image_size=15 * 1024 * 1024, run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)

[source]

A class to parse images using vision LLMs.

  • Parameters
    • llm (pw.UDF) – LLM for parsing the image. Provided LLM should support image inputs.
    • parse_prompt (str) – The prompt used by the language model for parsing.
    • detail_parse_schema (type[BaseModel] | None) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step.
    • downsize_horizontal_width (int) – Width to which images are downsized if necessary. Default is 1920.
    • include_schema_in_text (bool) – If the parsed schema should be included in the text description. May help with search and retrieval. Defaults to False. Only usable if detail_parse_schema is provided.
    • max_image_size (int) – Maximum allowed size of the images in bytes. Default is 15 MB.
    • run_mode (Literal['sequential', 'parallel']) – Mode of execution, either "sequential" or "parallel". Default is "parallel". "parallel" mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern, "sequential" may be better.
    • retry_strategy (AsyncRetryStrategy | None) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

class PypdfParser(apply_text_cleanup=True, cache_strategy=None)

[source]

Parse PDF document using pypdf library. Optionally, applies additional text cleanups for readability.

  • Parameters
    • apply_text_cleanup (bool) – Apply text cleanup for line breaks and repeated spaces.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

class SlideParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, intermediate_image_format='jpg', image_size=(1280, 720), run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)

[source]

A class to parse PPTX and PDF slides using vision LLMs.

Use of this class requires Pathway Scale account. Get your license here to gain access.

  • Parameters
    • llm (UDF) – LLM for parsing the image. Provided LLM should support image inputs.
    • parse_prompt (str) – The prompt used by the language model for parsing.
    • detail_parse_schema (type[BaseModel] | None) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step.
    • include_schema_in_text (bool) – If the parsed schema should be included in the text description. May help with search and retrieval. Defaults to False. Only usable if detail_parse_schema is provided.
    • intermediate_image_format (str) – Intermediate image format used when converting PDFs to images. Defaults to "jpg" for speed and memory use.
    • image_size (tuple[int, int], optional) – The target size of the images. Default is (1280, 720). Note that setting higher resolution will increase the cost and latency. Since vision LLMs will resize the given image into certain resolution, setting high resolutions may not help with the accuracy.
    • run_mode (Literal['sequential', 'parallel']) – Mode of execution, either "sequential" or "parallel". Default is "parallel". "parallel" mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern, "sequential" may be better.
    • retry_strategy (AsyncRetryStrategy | None) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

class UnstructuredParser(chunking_mode='single', partition_kwargs={}, post_processors=None, chunking_kwargs={}, cache_strategy=None)

[source]

Parse document using https://unstructured.io/.

All arguments can be overridden during UDF application.

  • Parameters
    • chunking_mode (Literal['single', 'elements', 'paged', 'basic', 'by_title']) – Mode used to chunk the document. When "basic" it uses default Unstructured’s chunking strategy. When "by_title", same as "basic" but it chunks the document preserving section boundaries. When "single", each document is parsed as one long text string. When "elements", each document is split into Unstructured’s elements. When "paged", each page’s text is separately extracted. Defaults to "single".
    • post_processors (list[Callable] | None) – list of callables that will be applied to all extracted texts.
    • partition_kwargs (dict) – extra kwargs to be passed to unstructured.io’s partition function
    • chunking_kwargs (dict) – extra kwargs to be passed to unstructured.io’s chunk_elements or chunk_by_title function

__call__(contents, chunking_mode=None, partition_kwargs={}, post_processors=None, chunking_kwargs={})

sourceParse the given document. Providing chunking_mode, partition_kwargs, post_processors or chunking_kwargs is used for overriding values set during initialization.

  • Parameters
    • contents (ColumnExpression) – document contents
    • chunking_mode (Union[ColumnExpression, Literal['single', 'elements', 'paged', 'basic', 'by_title'], None]) – Mode used to chunk the document.
    • partition_kwargs (ColumnExpression | dict) – extra kwargs to be passed to unstructured.io’s partition function
    • post_processors (ColumnExpression | list[Callable] | None) – list of callables that will be applied to all extracted texts.
    • chunking_kwargs (ColumnExpression | dict) – extra kwargs to be passed to unstructured.io’s chunk_elements
    • function (or chunk_by_title) –
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is obtained from Unstructured, you can check possible values in the Unstructed documentation https://unstructured-io.github.io/unstructured/metadata.html Note that when chunking_mode is set to "single" or "paged" some of these fields are removed if they are specific to a single element, e.g. category_depth.

class Utf8Parser(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

Decode text encoded as UTF-8.

__call__(contents, **kwargs)

sourceParse the given document.

  • Parameters
    contents (ColumnExpression) – document contents
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is an empty dictionary.