ColossalAI/examples/cloud/pdf-rag
2025-07-23 06:21:30 +00:00
..
embedding.py [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-07-23 06:21:30 +00:00
HNSW_retrieve.py [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-07-23 06:12:59 +00:00
parse.py [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-07-23 06:21:30 +00:00
README.md [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-07-23 06:12:59 +00:00
requirements.txt [pre-commit.ci] auto fixes from pre-commit.com hooks 2025-07-23 06:12:59 +00:00

PDF_RAG Workflow Demo

This project demonstrates a document retrieval and vectorization workflow based on Haystack, FlagEmbedding, HNSWLib, and BM25.

Directory Structure

  • RAG_workflow/parse.py: Parses and splits PDF documents, and generates the content in JSON format.
  • RAG_workflow/embedding.pyVectorizes the document content and produces the embedding vector base.
  • RAG_workflow/HNSW_retrieve.pyPerforms hybrid retrieval and recall using HNSW and BM25.

Environment Setup

Install dependencies with:

pip install -r requirements.txt

Workflow Steps

  1. PDF Parsing

    • Modify PATH_TO_YOUR_PDF_DIRECTORY in parse.py to your PDF folder path.
    • Update the output JSON pathPATH_TO_YOUR_JSON)。
    • Run:
      python PDF-RAG/parse.py
      
    • The generated JSON file will be used in the next embedding step.
  2. JSON Vectorization

    • In embedding.py, update the input JSON path (PATH_TO_YOUR_JSON.json) and the output embedding file path (PATH_TO_YOUR_EMBEDDING.npy).
    • Run:
      python PDF-RAG/embedding.py
      
  3. Retrieval and Recall

    • In HNSW_retrieve.py, update the embedding and JSON paths (PATH_TO_YOUR_JSON.json).
    • Run:
      python PDF-RAG/HNSW_retrieve.py
      
    • The script will output the construction and retrieval times for both HNSW and BM25, along with the merged retrieval results.
    • Adjust HNSW and BM25 parameters according to the descriptions to get desired results.
    • In hnswlib.Index(), use space='l2' for Squared L2, 'ip' for Inner Product, and 'cosine' for Cosine Similarity.

Dependencies

  • numpy
  • scikit-learn
  • hnswlib
  • rank_bm25
  • FlagEmbedding
  • haystack
  • haystack-integrations