pw.xpacks.llm.parsers

A library for document parsers: functions that take raw bytes and return a list of text chunks along with their metadata.

class pw.xpacks.llm.parsers.OpenParse(table_args={'parsing_algorithm': 'llm'}, cache_strategy=None)

[source]
Parse PDFs using [open-parse library](https://github.com/Filimoa/open-parse).

When used in the VectorStoreServer, splitter can be set to None as OpenParse already chunks the documents.

  • Parameters
    • table_args (-) – dict containing the table parser arguments. Needs to have key parsing_algorithm, with the value being one of llm, unitable, pymupdf, table-transformers. If llm is chosen, gpt-4o is used for parsing and OPENAI_API_KEY environmental variable needs to be set. For information on other parsing algorithms and supported arguments check the OpenParse documentation.
    • cache_strategy (-) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. Defaults to None.

__call__(contents)

sourceParse the given PDFs.

  • Parameters
    contents (-) – A column with PDFs to be parsed, passed as bytes.
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and metadata, which in case of OpenParse is an empty dictionary.

class pw.xpacks.llm.parsers.ParseUnstructured(mode='single', post_processors=None, **unstructured_kwargs)

[source]
Parse document using [https://unstructured.io/](https://unstructured.io/).

All arguments can be overridden during UDF application.

  • Parameters
    • mode (-) – single, elements or paged. When single, each document is parsed as one long text string. When elements, each document is split into unstructured’s elements. When paged, each pages’s text is separately extracted.
    • post_processors (-) – list of callables that will be applied to all extracted texts.
    • **unstructured_kwargs (-) – extra kwargs to be passed to unstructured.io’s partition function

__call__(contents, **kwargs)

sourceParse the given document.

  • Parameters
    • contents (-) – document contents
    • **kwargs (-) – override for defaults set in the constructor
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is obtained from Unstructured, you can check possible values in the Unstructed documentation https://unstructured-io.github.io/unstructured/metadata.html Note that when mode is set to single or paged some of these fields are removed if they are specific to a single element, e.g. category_depth.

class pw.xpacks.llm.parsers.ParseUtf8(*, return_type=Ellipsis, deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]
Decode text encoded as UTF-8.

__call__(contents, **kwargs)

sourceParse the given document.

  • Parameters
    contents (-) – document contents
  • Returns
    A column with a list of pairs for each query. Each pair is a text chunk and associated metadata. The metadata is an empty dictionary.