Document Indexing

Document indexing organizes and categorizes documents to enable efficient search and retrieval. By creating an index—a structured representation of the document's content—you can quickly access information based on search queries. In the context of large language models (LLMs) like GPT, indexing enhances their ability to generate relevant responses by organizing a knowledge repository.

Connecting to Documents

Use the Pathway connector to connect to your documents:

import pathway as pw

data_sources = pw.io.fs.read(
    "./sample_docs",
    format="binary",
    with_metadata=True,
)

We use the binary format because parsers require bytes as input, and the document store will use the metadata for filtering.

Chunking

Large documents, like books, can be inefficient to process in a single prompt due to their size. Instead, break the document into smaller, more manageable chunks that are easier to search and require fewer tokens in API calls.

A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.

A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.

Pathway offers a TokenCountSplitter for token-based chunking. Here’s how to use it:

from pathway.xpacks.llm.splitters import TokenCountSplitter

text_splitter = TokenCountSplitter(
    min_tokens=100,
    max_tokens=500,
    encoding_name="cl100k_base"
)

This configuration creates chunks of 100–500 tokens using the cl100k_base tokenizer, compatible with OpenAI’s text-embedding-ada-002 model.

For more on token encodings, refer to OpenAI's tiktoken guide.

Embedding

Embedding transforms text into fixed-size vectors for indexing and retrieval. It is required only when using vector indices, such as approximate nearest neighbor (ANN) search. Traditional indices like BM25 (e.g., TantivyBM25) do not require embedding. Pathway provides several embedding models, including:

Example:

from pathway.xpacks.llm.embedders import OpenAIEmbedder

embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])

The default model for OpenAIEmbedder is text-embedding-ada-002.

Retriever

The retriever creates an index to find relevant documents for a query. You can use the BruteForceKnnFactory to set up the retriever:

from pathway.stdlib.indexing.nearest_neighbors import BruteForceKnnFactory

retriever_factory = BruteForceKnnFactory(
    embedder=embedder,
)

Pathway comes with several indexing engine factories:

Assembling the Document Store

With all components ready, construct the DocumentStore. This object will handle processing of documents (which include parsing, post-processing and splitting) and then building an index (retriever) out of them.

from pathway.xpacks.llm.document_store import DocumentStore

store = DocumentStore(
    docs=data_sources,
    retriever_factory=retriever_factory,
    splitter=text_splitter,
)

Preparing Queries

Save queries in a CSV file with the following columns:

  1. query: Your question
  2. k: Number of documents to retrieve
  3. metadata_filter (optional): Filter files by metadata
  4. filepath_globpattern (optional): Narrow files by path pattern

Example:

printf "query,k,metadata_filter,filepath_globpattern\n\"Who is Regina Phalange?\",3,,\n" > queries.csv

Let's connect to the CSV:

query = pw.io.fs.read(
    "queries.csv",
    format="csv",
    # predefined schema for query table
    schema=DocumentStore.RetrieveQuerySchema
)

Retrieval

Now you can simply run retrieve_query function on your store object and see which document chunks might contain useful information for answering your query.

result = store.retrieve_query(query)

Advanced topics

Interacting with Document Store via REST Server

Pathway's REST server allows you to expose a DocumentStore as a service that can be accessed via API requests. This is useful when integrating the DocumentStore into a larger system, especially if it needs to be accessed from an external process.

from pathway.xpacks.llm.servers import DocumentStoreServer

PATHWAY_PORT = 8765
server = DocumentStoreServer(
    host="127.0.0.1",
    port=PATHWAY_PORT,
    document_store=store,
)
server.run(threaded=True, with_cache=False)

Once the server is running you can send a request to the API:

curl -X POST http://localhost:8765/v1/retrieve \
     -H "Content-Type: application/json" \
     -d '{
           "query": "Who is Regina Phalange?",
           "k": 2
         }'

Filtering files

DocumentStore allows you to narrow down search of relevant documents based on files metadata or their paths. There are two fields in query that one can use in order to facilitate this functionality (these were also mentioned above in Preparing queries subsection):

  • metadata_filter (optional): Filter files by jmespath metadata such as modified_at, owner, contains
  • filepath_globpattern (optional): Narrow files by glob path pattern

Example:

printf 'query,k,metadata_filter,filepath_globpattern\n"Who is Regina Phalange?",3,owner==`albert`,**/phoebe*\n' > queries.csv
querykmetadata_filterfilepath_globpattern
"Who is Regina Phalange?"3owner==`albert`**/phoebe*
query = pw.io.fs.read(
    "queries.csv",
    format="csv",
    schema=DocumentStore.RetrieveQuerySchema
)

result = store.retrieve_query(query)

The available metadata fields depend on the type of connector you are using. You can find the extracted metadata fields by referring to the API documentation of the connector's read function, specifically the with_metadata parameter. For example, in CSV connector if you set with_metadata=True you will have access to created_at, modified_at, owner, size, path, seen_at metadata fields that you can use for filtering.

Finding documents

You can also use inputs_query function to search your documents based only on glob pattern and metadata without involving retrieval. You only need to provide a Pathway table with only two columns (metadata_filter and filepath_globpattern). It should follow the DocumentStore.InputsQuerySchema.