Document Indexing
Document indexing organizes and categorizes documents to enable efficient search and retrieval. By creating an index—a structured representation of the document's content—you can quickly access information based on search queries. In the context of large language models (LLMs) like GPT, indexing enhances their ability to generate relevant responses by organizing a knowledge repository.
Connecting to Documents
Use the Pathway connector to connect to your documents:
import pathway as pw
data_sources = pw.io.fs.read(
"./sample_docs",
format="binary",
with_metadata=True,
)
We use the binary format because parsers require bytes as input, and the document store will use the metadata for filtering.
Chunking
Large documents, like books, can be inefficient to process in a single prompt due to their size. Instead, break the document into smaller, more manageable chunks that are easier to search and require fewer tokens in API calls.
A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.
A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.
Pathway offers a TokenCountSplitter
for token-based chunking. Here’s how to use it:
from pathway.xpacks.llm.splitters import TokenCountSplitter
text_splitter = TokenCountSplitter(
min_tokens=100,
max_tokens=500,
encoding_name="cl100k_base"
)
This configuration creates chunks of 100–500 tokens using the cl100k_base
tokenizer, compatible with OpenAI’s text-embedding-ada-002
model.
For more on token encodings, refer to OpenAI's tiktoken guide.
Embedding
Embedding transforms text into fixed-size vectors for indexing and retrieval. It is required only when using vector indices, such as approximate nearest neighbor (ANN) search. Traditional indices like BM25 (e.g., TantivyBM25) do not require embedding. Pathway provides several embedding models, including:
Example:
from pathway.xpacks.llm.embedders import OpenAIEmbedder
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
The default model for OpenAIEmbedder
is text-embedding-ada-002
.
Retriever
The retriever creates an index to find relevant documents for a query. You can use the BruteForceKnnFactory
to set up the retriever:
from pathway.stdlib.indexing.nearest_neighbors import BruteForceKnnFactory
retriever_factory = BruteForceKnnFactory(
embedder=embedder,
)
Pathway comes with several indexing engine factories:
Assembling the Document Store
With all components ready, construct the DocumentStore
.
This object will handle processing of documents (which include parsing, post-processing and splitting) and then building an index (retriever) out of them.
from pathway.xpacks.llm.document_store import DocumentStore
store = DocumentStore(
docs=data_sources,
retriever_factory=retriever_factory,
splitter=text_splitter,
)
Preparing Queries
Save queries in a CSV file with the following columns:
query
: Your questionk
: Number of documents to retrievemetadata_filter
(optional): Filter files by metadatafilepath_globpattern
(optional): Narrow files by path pattern
Example:
printf "query,k,metadata_filter,filepath_globpattern\n\"Who is Regina Phalange?\",3,,\n" > queries.csv
Let's connect to the CSV:
query = pw.io.fs.read(
"queries.csv",
format="csv",
# predefined schema for query table
schema=DocumentStore.RetrieveQuerySchema
)
Retrieval
Now you can simply run retrieve_query
function on your store object and see which document chunks might contain useful information for answering your query.
result = store.retrieve_query(query)
Advanced topics
Interacting with Document Store via REST Server
Pathway's REST server allows you to expose a DocumentStore
as a service that can be accessed via API requests. This is useful when integrating the DocumentStore
into a larger system, especially if it needs to be accessed from an external process.
from pathway.xpacks.llm.servers import DocumentStoreServer
PATHWAY_PORT = 8765
server = DocumentStoreServer(
host="127.0.0.1",
port=PATHWAY_PORT,
document_store=store,
)
server.run(threaded=True, with_cache=False)
Once the server is running you can send a request to the API:
curl -X POST http://localhost:8765/v1/retrieve \
-H "Content-Type: application/json" \
-d '{
"query": "Who is Regina Phalange?",
"k": 2
}'
Filtering files
DocumentStore
allows you to narrow down search of relevant documents based on files metadata or their paths.
There are two fields in query that one can use in order to facilitate this functionality (these were also mentioned above in Preparing queries subsection):
metadata_filter
(optional): Filter files by jmespath metadata such asmodified_at
,owner
,contains
filepath_globpattern
(optional): Narrow files by glob path pattern
Example:
printf 'query,k,metadata_filter,filepath_globpattern\n"Who is Regina Phalange?",3,owner==`albert`,**/phoebe*\n' > queries.csv
query | k | metadata_filter | filepath_globpattern |
---|---|---|---|
"Who is Regina Phalange?" | 3 | owner==`albert` | **/phoebe* |
query = pw.io.fs.read(
"queries.csv",
format="csv",
schema=DocumentStore.RetrieveQuerySchema
)
result = store.retrieve_query(query)
The available metadata fields depend on the type of connector you are using. You can find the extracted metadata fields by referring to the API documentation of the connector's read
function, specifically the with_metadata
parameter.
For example, in CSV connector if you set with_metadata=True
you will have access to created_at
, modified_at
, owner
, size
, path
, seen_at
metadata fields that you can use for filtering.
Finding documents
You can also use inputs_query
function to search your documents based only on glob pattern and metadata without involving retrieval. You only need to provide a Pathway table with only two columns (metadata_filter
and filepath_globpattern
). It should follow the DocumentStore.InputsQuerySchema
.