Parsers
Parsers play a crucial role in the Retrieval-Augmented Generation (RAG) pipeline by transforming raw, unstructured data into structured formats that can be effectively indexed, retrieved, and processed by language models. In a RAG system, data often comes from diverse sources such as documents, web pages, APIs, and databases, each with its own structure and format. Parsers help extract relevant content, normalize it into a consistent structure, and enhance the retrieval process by making information more accessible and usable.
This article explores the different types of parsers that can be used in our pipelines, highlighting their specific functions and how they differ from one another. Understanding these parsers is key to optimizing data ingestion, improving retrieval accuracy, and ultimately enhancing the quality of generated responses.
Here is a table listing all available parsers and some details about them:
Name | Data | Description |
---|---|---|
Utf8Parser | Text | Decodes text encoded in UTF-8. |
UnstructuredParser | Text + tables | Leverages Unstructured library to parse various document types. |
DoclingParser | PDF + tables + images | Utilizes docling library to extract structured content from PDFs, including images. |
PypdfParser | Uses pypdf library to extract text from PDFs with optional text cleanup. | |
ImageParser | Image | Transforms images into textual descriptions and extracts structured information. |
SlideParser | Slide | Extracts information from PPTX and PDF slide decks using vision-based LLMs. |
Utf8Parser
Utf8Parser
is a simple parser designed to decode text encoded in UTF-8. It ensures that raw byte-encoded content is converted into a readable string format for further processing in a RAG pipeline.
UnstructuredParser
UnstructuredParser
leverages the parsing capabilities of Unstructured. It supports various document types, including PDFs, HTML, Word documents, and more, making it a robust out-of-the-box solution for most use cases. Additionally, it offers good performance in terms of speed.
However, there are some limitations associated with the open-source library, such as reduced performance in document and table extraction and reliance on older, less sophisticated vision transformer models. Moreover, Unstructured does not support image extraction.
Chunking modes
Many parsers include chunking functionality, allowing them to use a document's structure to split content into smaller, semantically consistent chunks.
Pathway's UnstructuredParser
supports five chunking modes:
basic
- Uses Unstructured's basic chunking strategy, which splits text into chunks shorter than the specifiedmax_characters
length (set via thechunking_kwargs
argument). It also supports a soft threshold for chunk length usingnew_after_n_chars
.by_title
- Uses Unstructured's chunk-by-title strategy strategy, similar to basic chunking but with additional constraints to split chunks at section or page breaks, resulting in more structured chunks. Like basic chunking, it can be configured viachunking_kwargs
.elements
- Breaks down a document into homogeneous Unstructured elements such asTitle
,NarrativeText
,Footer
,ListItem
etc. Not recommended for PDFs or other complex data sources. Best suited for simple input data where individual elements need to be separated.paged
- Collects all elements found on a single page into one chunk. Useful for documents where content is well-separated across pages.single
- Aggregates all Unstructured elements into a single large chunk. Use this mode when applying other chunking strategies available in Pathway or when using a custom chunking approach.
Example of YAML configuration
$parser: !pw.xpacks.llm.parsers.UnstructuredParser
chunking_mode: "by_title"
chunking_kwargs:
max_characters: 3000 # hard limit on number of characters in each chunk
new_after_n_chars: 2000 # soft limit on number of characters in each chunk
Unstructured chunking is character-based rather than token-based, meaning you do not have precise control over the maximum number of tokens each chunk will occupy in the context window.
DoclingParser
DoclingParser
is a PDF parser that utilizes the docling library to extract structured content from PDFs. It extends docling's DocumentConverter with additional functionality to parse images from PDFs using vision-enabled language models. This allows for a more comprehensive extraction of content, including tables and embedded images.
It is recommended to use this parser when extracting text, tables, and images from PDFs.
Image parsing
If parse_images=True
, the parser detects images within the document, processes them with a multimodal LLM (such as OpenAI's GPT-4o), and embeds its descriptions in the Markdown output. If disabled, images are replaced with placeholders.
Example of YAML configuration
$multimodal_llm: !pw.xpacks.llm.llms.OpenAIChat
model: "gpt-4o-mini"
$parser: !pw.xpacks.llm.parsers.DoclingParser
parse_images: True
multimodal_llm: $multimodal_llm
pdf_pipeline_options:
do_formula_enrichment: True
image_scale: 1.5
See PdfPipelineOptions for reference of possible configuration, like OCR options, picture classification, code OCR, scientific formula enrichment, etc.
PypdfParser
PypdfParser
is a lightweight PDF parser that utilizes the pypdf library to extract text from PDF documents. It also includes an optional text cleanup feature to enhance readability by removing unnecessary line breaks and spaces.
Keep in mind that it might not be adequate for table extraction. No image extraction is supported.
ImageParser
This parser can be used to transform image (e.g. in .png
or .jpg
format) into a textual description made by multimodal LLM. On top of that it could be used to extract structured information from the image via predefined schema.
SlideParser
SlideParser
is a powerful parser designed to extract information from PowerPoint (PPTX) and PDF slide decks using vision-based LLMs.
It converts slides into images before processing them by vision LLM that tries to describe the content of a slide.
As in case of ImageParser you can also extract information specified in pydantic schema.