Document Indexing

Document indexing organizes and categorizes documents to enable efficient search and retrieval. By creating an index—a structured representation of the document's content—you can quickly access information based on search queries. In the context of large language models (LLMs) like GPT, indexing enhances their ability to generate relevant responses by organizing a knowledge repository.

Indexing Methods

Document indexing can be categorized into two primary methods:

  • Vector-based Indexing: Uses embeddings to represent documents as numerical vectors for similarity search.
  • Non-Vector Indexing: Relies on traditional text-based retrieval methods that do not require embeddings.

Embedding

Embedding transforms text into fixed-size vectors for indexing and retrieval. It is required only when using vector indices, such as approximate nearest neighbor (ANN) search. Pathway provides several embedding models, including:

More information can be find on the Embedders page

Non-Vector Indexing

Non-vector indexing is based on traditional text search methods, such as BM25, which do not require embeddings. This approach is well-suited for exact keyword matching and full-text search. Pathway supports the following non-vector indexing method: TantivyBM25Factory.

Retrievers

Retrievers are responsible for creating and managing indices to locate relevant documents efficiently. Pathway provides several retrieval methods:

Here is an example on how to build BruteForceKnnFactory that is a key component to DocumentStore:

from pathway.stdlib.indexing.nearest_neighbors import BruteForceKnnFactory
from pathway.xpacks.llm.embedders import OpenAIEmbedder

embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
retriever_factory = BruteForceKnnFactory(
    embedder=embedder,
)

How to query retriever?

To interact with the index and retrieve relevant documents, we need to create a DocumentStore. This object will handle processing of documents (which include parsing, post-processing and splitting) and then building an index (retriever) out of them. The DocumentStore acts as the interface to query the index, allowing for document retrieval using the selected retriever.

Here is some minimal example:

from pathway.xpacks.llm.document_store import DocumentStore
from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw

data_sources = pw.io.fs.read(
    "./sample_docs",
    format="binary",
    with_metadata=True,
)
text_splitter = TokenCountSplitter()
store = DocumentStore(
    docs=data_sources,
    retriever_factory=retriever_factory,
    splitter=text_splitter,
)

As you can see, in order to build DocumentStore object you need to prepare splitter and define data source. You can read more about splitters here.

Preparing Queries

Save queries in a CSV file with the following columns:

  1. query: Your question
  2. k: Number of documents to retrieve
  3. metadata_filter (optional): Filter files by metadata
  4. filepath_globpattern (optional): Narrow files by path pattern

Example:

printf "query,k,metadata_filter,filepath_globpattern\n\"Who is Regina Phalange?\",3,,\n" > queries.csv

Let's connect to the CSV:

query = pw.io.fs.read(
    "queries.csv",
    format="csv",
    # predefined schema for query table
    schema=DocumentStore.RetrieveQuerySchema
)

Retrieval

Now you can simply run retrieve_query function on your store object and see which document chunks might contain useful information for answering your query.

result = store.retrieve_query(query)

Interacting with Document Store via REST Server

Pathway's REST server allows you to expose a DocumentStore as a service that can be accessed via API requests. This is useful when integrating the DocumentStore into a larger system, especially if it needs to be accessed from an external process.

from pathway.xpacks.llm.servers import DocumentStoreServer

PATHWAY_PORT = 8765
server = DocumentStoreServer(
    host="127.0.0.1",
    port=PATHWAY_PORT,
    document_store=store,
)
server.run(threaded=True, with_cache=False)

Once the server is running you can send a request to the API:

curl -X POST http://localhost:8765/v1/retrieve \
     -H "Content-Type: application/json" \
     -d '{
           "query": "Who is Regina Phalange?",
           "k": 2
         }'

Filtering files

DocumentStore allows you to narrow down search of relevant documents based on files metadata or their paths. There are two fields in query that one can use in order to facilitate this functionality (these were also mentioned above in Preparing queries subsection):

  • metadata_filter (optional): Filter files by jmespath metadata such as modified_at, owner, contains
  • filepath_globpattern (optional): Narrow files by glob path pattern

Example:

printf 'query,k,metadata_filter,filepath_globpattern\n"Who is Regina Phalange?",3,owner==`albert`,**/phoebe*\n' > queries.csv
querykmetadata_filterfilepath_globpattern
"Who is Regina Phalange?"3owner==`albert`**/phoebe*
query = pw.io.fs.read(
    "queries.csv",
    format="csv",
    schema=DocumentStore.RetrieveQuerySchema
)

result = store.retrieve_query(query)

The available metadata fields depend on the type of connector you are using. You can find the extracted metadata fields by referring to the API documentation of the connector's read function, specifically the with_metadata parameter. For example, in CSV connector if you set with_metadata=True you will have access to created_at, modified_at, owner, size, path, seen_at metadata fields that you can use for filtering.

Finding documents

You can also use inputs_query function to search your documents based only on glob pattern and metadata without involving retrieval. You only need to provide a Pathway table with only two columns (metadata_filter and filepath_globpattern). It should follow the DocumentStore.InputsQuerySchema.