pw.xpacks.llm.parsers

A library for document parsers: functions that take raw bytes and return a list of text chunks along with their metadata.

class pw.xpacks.llm.parsers.ImageParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, downsize_horizontal_width=1280, max_image_size=15 * 1024 * 1024, run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)

[source]

A class to parse images using vision LLMs.

  • Parameters
    • llm (pw.UDF) – LLM for parsing the image. Provided LLM should support image inputs.
    • parse_prompt (str) – The prompt used by the language model for parsing.
    • detail_parse_schema (type[BaseModel] | None) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step.
    • downsize_horizontal_width (int) – Width to which images are downsized if necessary. Default is 1920.
    • include_schema_in_text (bool) – If the parsed schema should be included in the text description. May help with search and retrieval. Defaults to False. Only usable if detail_parse_schema is provided.
    • max_image_size (int) – Maximum allowed size of the images in bytes. Default is 15 MB.
    • run_mode (Literal['sequential', 'parallel']) – Mode of execution, either "sequential" or "parallel". Default is "parallel". "parallel" mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern, "sequential" may be better.
    • retry_strategy (AsyncRetryStrategy | None) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

class pw.xpacks.llm.parsers.OpenParse(table_args=None, image_args=None, parse_images=False, processing_pipeline=None, cache_strategy=None)

[source]

Parse PDFs using open-parse library.

When used in the VectorStoreServer, splitter can be set to None as OpenParse already chunks the documents.

  • Parameters
    • table_args (dict | None) – dictionary containing the table parser arguments. Needs to have key parsing_algorithm, with the value being one of "llm", "unitable", "pymupdf", "table-transformers". "llm" parameter can be specified to modify the vision LLM used for parsing. Will default to OpenAI gpt-4o, with markdown table parsing prompt. Default config requires OPENAI_API_KEY environment variable to be set. For information on other parsing algorithms and supported arguments check the OpenParse documentation.
    • image_args (dict | None) – dictionary containing the image parser arguments. Needs to have the following keys parsing_algorithm, llm, prompt. Currently, only supported parsing_algorithm is "llm". "llm" parameter can be specified to modify the vision LLM used for parsing. Will default to OpenAI gpt-4o, with markdown image parsing prompt. Default config requires OPENAI_API_KEY environment variable to be set.
    • parse_images (bool) – whether to parse the images from the PDF. Detected images will be indexed by their description from the parsing algorithm. Note that images are parsed with separate OCR model, parsing may take a while.
    • processing_pipeline (IngestionPipeline | str | None) – str or IngestionPipeline. Specifies the pipeline used for post-processing extracted elements.
      • "pathway_pdf_default": Uses SimpleIngestionPipeline from Pathway. This is a simple processor that combines close elements, combines the headers with the text body, and removes weirdly formatted/small elements. Can be set with: "pathway_pdf_default" or using the class, from pathway.xpacks.llm.openparse_utils import SimpleIngestionPipeline.
      • "merge_same_page": Uses SamePageIngestionPipeline to chunk based on pages.
      • Any other pipeline from the openparse.processing can also be used. Defaults to SimpleIngestionPipeline.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

Example:

import pathway as pw
from pathway.xpacks.llm import llms, parsers, prompts
chat = llms.OpenAIChat(model="gpt-4o")
table_args = {
   "parsing_algorithm": "llm",
   "llm": chat,
   "prompt": prompts.DEFAULT_MD_TABLE_PARSE_PROMPT,
}
image_args = {
    "parsing_algorithm": "llm",
    "llm": chat,
    "prompt": prompts.DEFAULT_IMAGE_PARSE_PROMPT,
}
parser = parsers.OpenParse(table_args=table_args, image_args=image_args)

__call__(contents)

sourceParse the given PDFs.

  • Parameters
    contents (ColumnExpression[bytes]) – A column with PDFs to be parsed, passed as bytes.
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and metadata, which in case of OpenParse is an empty dictionary.

class pw.xpacks.llm.parsers.ParseUnstructured(mode='single', post_processors=None, cache_strategy=None, **unstructured_kwargs)

[source]

Parse document using https://unstructured.io/.

All arguments can be overridden during UDF application.

  • Parameters
    • mode (Literal['single', 'elements', 'paged']) – single, elements or paged. When single, each document is parsed as one long text string. When elements, each document is split into unstructured’s elements. When paged, each pages’s text is separately extracted.
    • post_processors (list[Callable] | None) – list of callables that will be applied to all extracted texts.
    • **unstructured_kwargs (Any) – extra kwargs to be passed to unstructured.io’s partition function

__call__(contents, **kwargs)

sourceParse the given document.

  • Parameters
    • contents (ColumnExpression) – document contents
    • **kwargs – override for defaults set in the constructor
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is obtained from Unstructured, you can check possible values in the Unstructed documentation https://unstructured-io.github.io/unstructured/metadata.html Note that when mode is set to single or paged some of these fields are removed if they are specific to a single element, e.g. category_depth.

class pw.xpacks.llm.parsers.ParseUtf8(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

Decode text encoded as UTF-8.

__call__(contents, **kwargs)

sourceParse the given document.

  • Parameters
    contents (ColumnExpression) – document contents
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is an empty dictionary.

class pw.xpacks.llm.parsers.PypdfParser(apply_text_cleanup=True, cache_strategy=None)

[source]

Parse PDF document using pypdf library. Optionally, applies additional text cleanups for readability.

  • Parameters
    • apply_text_cleanup (bool) – Apply text cleanup for line breaks and repeated spaces.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

class pw.xpacks.llm.parsers.SlideParser(llm=DEFAULT_VISION_LLM, parse_prompt=prompts.DEFAULT_IMAGE_PARSE_PROMPT, detail_parse_schema=None, include_schema_in_text=False, intermediate_image_format='jpg', image_size=(1280, 720), run_mode='parallel', retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None)

[source]

A class to parse PPTX and PDF slides using vision LLMs.

Use of this class requires Pathway Scale account. Get your license here to gain access.

  • Parameters
    • llm (UDF) – LLM for parsing the image. Provided LLM should support image inputs.
    • parse_prompt (str) – The prompt used by the language model for parsing.
    • detail_parse_schema (type[BaseModel] | None) – A schema for detailed parsing, if applicable. Providing a Pydantic schema will call the LLM second time to parse necessary information, leaving it as None will skip this step.
    • include_schema_in_text (bool) – If the parsed schema should be included in the text description. May help with search and retrieval. Defaults to False. Only usable if detail_parse_schema is provided.
    • intermediate_image_format (str) – Intermediate image format used when converting PDFs to images. Defaults to "jpg" for speed and memory use.
    • image_size (tuple[int, int], optional) – The target size of the images. Default is (1280, 720). Note that setting higher resolution will increase the cost and latency. Since vision LLMs will resize the given image into certain resolution, setting high resolutions may not help with the accuracy.
    • run_mode (Literal['sequential', 'parallel']) – Mode of execution, either "sequential" or "parallel". Default is "parallel". "parallel" mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern, "sequential" may be better.
    • retry_strategy (AsyncRetryStrategy | None) – Retrying strategy for the LLM calls. Defining a retrying strategy with propriety LLMs is strongly suggested.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid :py:class:~pathway.udfs.CacheStrategy should be provided. Defaults to None.

__call__(*args, **kwargs)

sourceCall self as a function.

async pw.xpacks.llm.parsers.parse_images(images, llm, parse_prompt, *, run_mode='parallel', parse_details=False, detail_parse_schema=None, parse_fn, parse_image_details_fn)

sourceParse images and optional Pydantic model with a multi-modal LLM. parse_prompt will be only used for the regular parsing.

  • Parameters
    • images (list[Image]) – Image list to be parsed. Images are expected to be PIL.Image.Image.
    • llm (UDF) – LLM model to be used for parsing. Needs to support image input.
    • parse_details (bool) – Whether to make second LLM call to parse specific Pydantic model from the image.
    • run_mode (Literal['sequential', 'parallel']) – Mode of execution, either "sequential" or "parallel". Default is "parallel". "parallel" mode is suggested for speed, but if timeouts or memory usage in local LLMs are concern, "sequential" may be better.
    • parse_details – Whether a schema should be parsed.
    • detail_parse_schema (type[BaseModel] | None) – Pydantic model for schema to be parsed.
    • parse_fn (Callable) – Awaitable image parsing function.
    • parse_image_details_fn (Callable | None) – Awaitable image schema parsing function.