Realtime Document Indexing with PathwayThis is a basic service for a real-time document indexing pipeline powered by Pathway.
The capabilities of the service include:
- Real-time document indexing from Microsoft 365 SharePoint
- Real-time document indexing from Google Drive
- Similarity search by user query
- Filtering by the metadata according to the condition given in JMESPath format
- Basic stats on the indexer's health
Supported document formats include plaintext, pdf, docx, and HTML. For the complete list, please refer to the supported formats of the unstructured library. In addition, this pipeline is capable of data removals: you can delete files and in a few seconds, a similarity search will undo the changes done to the index by their addition.
Please also keep in mind the following constraints and limitations:
- The maximum supported file size is 4 MB and 100 Kb of the plaintext is obtained after parsing. Anything of the greater size will be ignored by the indexer
- The files in the shared spaces are removed within 15 minutes after their addition
- You hold responsibility for the contents of the files you upload