Document Indexing
Document indexing organizes and categorizes documents to enable efficient search and retrieval. By creating an index—a structured representation of the document's content—you can quickly access information based on search queries. In the context of large language models (LLMs) like GPT, indexing enhances their ability to generate relevant responses by organizing a knowledge repository.
Indexing Methods
Document indexing can be categorized into two primary methods:
- Vector-based Indexing: Uses embeddings to represent documents as numerical vectors for similarity search.
- Non-Vector Indexing: Relies on traditional text-based retrieval methods that do not require embeddings.
Embedding
Embedding transforms text into fixed-size vectors for indexing and retrieval. It is required only when using vector indices, such as approximate nearest neighbor (ANN) search. Pathway provides several embedding models, including:
More information can be find on the Embedders page
Non-Vector Indexing
Non-vector indexing is based on traditional text search methods, such as BM25, which do not require embeddings. This approach is well-suited for exact keyword matching and full-text search. Pathway supports the following non-vector indexing method: TantivyBM25Factory.
Retrievers
Retrievers are responsible for creating and managing indices to locate relevant documents efficiently. Pathway provides several retrieval methods:
- Vector-Based Retrieval:
- Non-Vector Retrieval:
- Hybrid Retrieval:
Here is an example on how to build BruteForceKnnFactory
that is a key component to DocumentStore
:
from pathway.stdlib.indexing.nearest_neighbors import BruteForceKnnFactory
from pathway.xpacks.llm.embedders import OpenAIEmbedder
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
retriever_factory = BruteForceKnnFactory(
embedder=embedder,
)
How to query retriever?
To interact with the index and retrieve relevant documents, we need to create a DocumentStore
.
This object will handle processing of documents (which include parsing, post-processing and splitting) and then building an index (retriever) out of them.
The DocumentStore acts as the interface to query the index, allowing for document retrieval using the selected retriever.
Here is some minimal example:
from pathway.xpacks.llm.document_store import DocumentStore
from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
data_sources = pw.io.fs.read(
"./sample_docs",
format="binary",
with_metadata=True,
)
text_splitter = TokenCountSplitter()
store = DocumentStore(
docs=data_sources,
retriever_factory=retriever_factory,
splitter=text_splitter,
)
As you can see, in order to build DocumentStore
object you need to prepare splitter and define data source. You can read more about splitters here.
Preparing Queries
Save queries in a CSV file with the following columns:
query
: Your questionk
: Number of documents to retrievemetadata_filter
(optional): Filter files by metadatafilepath_globpattern
(optional): Narrow files by path pattern
Example:
printf "query,k,metadata_filter,filepath_globpattern\n\"Who is Regina Phalange?\",3,,\n" > queries.csv
Let's connect to the CSV:
query = pw.io.fs.read(
"queries.csv",
format="csv",
# predefined schema for query table
schema=DocumentStore.RetrieveQuerySchema
)
Retrieval
Now you can simply run retrieve_query
function on your store object and see which document chunks might contain useful information for answering your query.
result = store.retrieve_query(query)
Interacting with Document Store via REST Server
Pathway's REST server allows you to expose a DocumentStore
as a service that can be accessed via API requests. This is useful when integrating the DocumentStore
into a larger system, especially if it needs to be accessed from an external process.
from pathway.xpacks.llm.servers import DocumentStoreServer
PATHWAY_PORT = 8765
server = DocumentStoreServer(
host="127.0.0.1",
port=PATHWAY_PORT,
document_store=store,
)
server.run(threaded=True, with_cache=False)
Once the server is running you can send a request to the API:
curl -X POST http://localhost:8765/v1/retrieve \
-H "Content-Type: application/json" \
-d '{
"query": "Who is Regina Phalange?",
"k": 2
}'
Filtering files
DocumentStore
allows you to narrow down search of relevant documents based on files metadata or their paths.
There are two fields in query that one can use in order to facilitate this functionality (these were also mentioned above in Preparing queries subsection):
metadata_filter
(optional): Filter files by jmespath metadata such asmodified_at
,owner
,contains
filepath_globpattern
(optional): Narrow files by glob path pattern
Example:
printf 'query,k,metadata_filter,filepath_globpattern\n"Who is Regina Phalange?",3,owner==`albert`,**/phoebe*\n' > queries.csv
query | k | metadata_filter | filepath_globpattern |
---|---|---|---|
"Who is Regina Phalange?" | 3 | owner==`albert` | **/phoebe* |
query = pw.io.fs.read(
"queries.csv",
format="csv",
schema=DocumentStore.RetrieveQuerySchema
)
result = store.retrieve_query(query)
The available metadata fields depend on the type of connector you are using. You can find the extracted metadata fields by referring to the API documentation of the connector's read
function, specifically the with_metadata
parameter.
For example, in CSV connector if you set with_metadata=True
you will have access to created_at
, modified_at
, owner
, size
, path
, seen_at
metadata fields that you can use for filtering.
Finding documents
You can also use inputs_query
function to search your documents based only on glob pattern and metadata without involving retrieval. You only need to provide a Pathway table with only two columns (metadata_filter
and filepath_globpattern
). It should follow the DocumentStore.InputsQuerySchema
.