RAG pipeline with Pathway + Unstructured: getting answers based on PDFs

This example implements a RAG pipeline, similarly to contextful pipeline. It uses, however, Unstructured library for parsing documents, e.g. PDFs, which are then split into smaller chunks.

How to run the project

Setup environment:

Set your env variables in the .env file placed in this directory.

OPENAI_API_KEY=sk-...
PATHWAY_DATA_DIR= # If unset, defaults to ./data/. If running with Docker, when you change this variable you may need to change the volume mount.
PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching

Run with Docker

To run jointly the Alert pipeline and a simple UI execute:

docker compose up --build

Then, the UI will run at http://0.0.0.0:8501 by default. You can access it by following this URL in your web browser.

The docker-compose.yml file declares a volume bind mount that makes changes to files under data/ made on your host computer visible inside the docker container. The files in data/live are indexed by the pipeline - you can paste new files there and they will impact the computations.

Run manually

Alternatively, you can run each service separately.

Make sure you have installed poetry dependencies with --extras unstructured.

poetry install --with examples --extras unstructured

Then run:

poetry run python app.py

If all dependencies are managed manually rather than using poetry, you can alternatively use:

python app.py

To run the Streamlit UI, run:

streamlit run ui/server.py --server.port 8501 --server.address 0.0.0.0

Querying the pipeline

To query the pipeline, you can call the REST API:

curl --data '{
  "user": "user",
  "query": "What are the trends of coal imports?"
}' http://localhost:8080/ | jq

or access the Streamlit UI at 0.0.0.0:8501.