mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-07-27 13:30:14 +00:00

History

pre-commit-ci[bot] 3fdd4e7733 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci		2025-07-23 06:21:30 +00:00
..
embedding.py	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-23 06:21:30 +00:00
HNSW_retrieve.py	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-23 06:12:59 +00:00
parse.py	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-23 06:21:30 +00:00
README.md	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-23 06:12:59 +00:00
requirements.txt	[pre-commit.ci] auto fixes from pre-commit.com hooks	2025-07-23 06:12:59 +00:00

PDF_RAG Workflow Demo

This project demonstrates a document retrieval and vectorization workflow based on Haystack, FlagEmbedding, HNSWLib, and BM25.

Directory Structure

RAG_workflow/parse.py: Parses and splits PDF documents, and generates the content in JSON format.
RAG_workflow/embedding.py：Vectorizes the document content and produces the embedding vector base.
RAG_workflow/HNSW_retrieve.py：Performs hybrid retrieval and recall using HNSW and BM25.

Install dependencies with:

pip install -r requirements.txt

PDF Parsing
- Modify PATH_TO_YOUR_PDF_DIRECTORY in parse.py to your PDF folder path.
- Update the output JSON path（PATH_TO_YOUR_JSON）。
- Run:
```
python PDF-RAG/parse.py
```
- The generated JSON file will be used in the next embedding step.
JSON Vectorization
- In embedding.py, update the input JSON path (PATH_TO_YOUR_JSON.json) and the output embedding file path (PATH_TO_YOUR_EMBEDDING.npy).
- Run:
```
python PDF-RAG/embedding.py
```
Retrieval and Recall
- In HNSW_retrieve.py, update the embedding and JSON paths (PATH_TO_YOUR_JSON.json).
- Run:
```
python PDF-RAG/HNSW_retrieve.py
```
- The script will output the construction and retrieval times for both HNSW and BM25, along with the merged retrieval results.
- Adjust HNSW and BM25 parameters according to the descriptions to get desired results.
- In hnswlib.Index(), use space='l2' for Squared L2, 'ip' for Inner Product, and 'cosine' for Cosine Similarity.