Parsers

Parsers play a crucial role in the Retrieval-Augmented Generation (RAG) pipeline by transforming raw, unstructured data into structured formats that can be effectively indexed, retrieved, and processed by language models. In a RAG system, data often comes from diverse sources such as documents, web pages, APIs, and databases, each with its own structure and format. Parsers help extract relevant content, normalize it into a consistent structure, and enhance the retrieval process by making information more accessible and usable.

This article explores the different types of parsers that can be used in our pipelines, highlighting their specific functions and how they differ from one another. Understanding these parsers is key to optimizing data ingestion, improving retrieval accuracy, and ultimately enhancing the quality of generated responses.

Here is a table listing all available parsers and some details about them:

NameDataDescription
Utf8ParserTextDecodes text encoded in UTF-8.
UnstructuredParserText + tablesLeverages Unstructured library to parse various document types.
DoclingParserPDF + tables + imagesUtilizes docling library to extract structured content from PDFs, including images.
PypdfParserPDFUses pypdf library to extract text from PDFs with optional text cleanup.
ImageParserImageTransforms images into textual descriptions and extracts structured information.
SlideParserSlideExtracts information from PPTX and PDF slide decks using vision-based LLMs.

Utf8Parser

Utf8Parser is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.

UnstructuredParser

UnstructuredParser leverages the parsing capabilities of Unstructured. It supports various document types, including PDFs, HTML, Word documents, and more, making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.

However, there are some limitations associated with the open-source library, such as reduced performance in document and table extraction and reliance on older, less sophisticated vision transformer models. Moreover, Unstructured does not support image extraction.

Chunking modes

Many parsers include chunking functionality, allowing them to use a document's structure to split content into smaller, semantically consistent chunks. Pathway's UnstructuredParser supports five chunking modes:

  • basic - Uses Unstructured's basic chunking strategy, which splits text into chunks shorter than the specified max_characters length (set via the chunking_kwargs argument). It also supports a soft threshold for chunk length using new_after_n_chars.
  • by_title - Uses Unstructured's chunk-by-title strategy strategy, similar to basic chunking but with additional constraints to split chunks at section or page breaks, resulting in more structured chunks. Like basic chunking, it can be configured via chunking_kwargs.
  • elements - Breaks down a document into homogeneous Unstructured elements such as Title, NarrativeText, Footer, ListItem etc. Not recommended for PDFs or other complex data sources. Best suited for simple input data where individual elements need to be separated.
  • paged - Collects all elements found on a single page into one chunk. Useful for documents where content is well-separated across pages.
  • single - Aggregates all Unstructured elements into a single large chunk. Use this mode when applying other chunking strategies available in Pathway or when using a custom chunking approach.

Example of YAML configuration

$parser: !pw.xpacks.llm.parsers.UnstructuredParser
  chunking_mode: "by_title"
  chunking_kwargs:
    max_characters: 3000        # hard limit on number of characters in each chunk
    new_after_n_chars: 2000     # soft limit on number of characters in each chunk

Unstructured chunking is character-based rather than token-based, meaning you do not have precise control over the maximum number of tokens each chunk will occupy in the context window.

DoclingParser

DoclingParser is a PDF parser that utilizes the docling library to extract structured content from PDFs. It extends docling's DocumentConverter with additional functionality to parse images from PDFs using vision-enabled language models. This allows for a more comprehensive extraction of content, including tables and embedded images.

It is recommended to use this parser when extracting text, tables, and images from PDFs.

Image parsing

If parse_images=True, the parser detects images within the document, processes them with a multimodal LLM (such as OpenAI's GPT-4o), and embeds its descriptions in the Markdown output. If disabled, images are replaced with placeholders.

Example of YAML configuration

$multimodal_llm: !pw.xpacks.llm.llms.OpenAIChat
  model: "gpt-4o-mini"

$parser: !pw.xpacks.llm.parsers.DoclingParser
  parse_images: True
  multimodal_llm: $multimodal_llm
  pdf_pipeline_options:
    do_formula_enrichment: True
    image_scale: 1.5

See PdfPipelineOptions for reference of possible configuration, like OCR options, picture classification, code OCR, scientific formula enrichment, etc.

PypdfParser

PypdfParser is a lightweight PDF parser that utilizes the pypdf library to extract text from PDF documents. It also includes an optional text cleanup feature to enhance readability by removing unnecessary line breaks and spaces.

Keep in mind that it might not be adequate for table extraction. No image extraction is supported.

ImageParser

This parser can be used to transform image (e.g. in .png or .jpg format) into a textual description made by multimodal LLM. On top of that it could be used to extract structured information from the image via predefined schema.

SlideParser

SlideParser is a powerful parser designed to extract information from PowerPoint (PPTX) and PDF slide decks using vision-based LLMs. It converts slides into images before processing them by vision LLM that tries to describe the content of a slide.

As in case of ImageParser you can also extract information specified in pydantic schema.