Deploy with GCP
|
Deploy with Render
RAG pipeline with Pathway + Unstructured: getting answers based on PDFs
This example implements a RAG pipeline, similarly to contextful pipeline. It uses, however, Unstructured library for parsing documents, e.g. PDFs, which are then split into smaller chunks.
How to run the project
Setup environment:
Set your env variables in the .env file placed in this directory.
OPENAI_API_KEY=sk-...
PATHWAY_DATA_DIR= # If unset, defaults to ./data/. If running with Docker, when you change this variable you may need to change the volume mount.
PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching
Run with Docker
To run jointly the Alert pipeline and a simple UI execute:
docker compose up --build
Then, the UI will run at http://0.0.0.0:8501 by default. You can access it by following this URL in your web browser.
The docker-compose.yml
file declares a volume bind mount that makes changes to files under data/
made on your host computer visible inside the docker container. The files in data/live
are indexed by the pipeline - you can paste new files there and they will impact the computations.
Run manually
Alternatively, you can run each service separately.
Make sure you have installed poetry dependencies with --extras unstructured
.
poetry install --with examples --extras unstructured
Then run:
poetry run python app.py
If all dependencies are managed manually rather than using poetry, you can alternatively use:
python app.py
To run the Streamlit UI, run:
streamlit run ui/server.py --server.port 8501 --server.address 0.0.0.0
Querying the pipeline
To query the pipeline, you can call the REST API:
curl --data '{
"user": "user",
"query": "What are the trends of coal imports?"
}' http://localhost:8080/ | jq
or access the Streamlit UI at 0.0.0.0:8501
.