mirror of
https://github.com/csunny/DB-GPT.git
synced 2025-08-01 16:18:27 +00:00
Store connector (#84)
1.add milvus vector store 2.vector store connector
This commit is contained in:
commit
0ae64175ef
@ -81,3 +81,14 @@ DENYLISTED_PLUGINS=
|
||||
#*******************************************************************#
|
||||
# CHAT_MESSAGES_ENABLED - Enable chat messages (Default: False)
|
||||
# CHAT_MESSAGES_ENABLED=False
|
||||
|
||||
|
||||
#*******************************************************************#
|
||||
#** VECTOR STORE SETTINGS **#
|
||||
#*******************************************************************#
|
||||
VECTOR_STORE_TYPE=Chroma
|
||||
#MILVUS_URL=127.0.0.1
|
||||
#MILVUS_PORT=19530
|
||||
#MILVUS_USERNAME
|
||||
#MILVUS_PASSWORD
|
||||
#MILVUS_SECURE=
|
||||
|
19
README.md
19
README.md
@ -107,6 +107,7 @@ As the knowledge base is currently the most significant user demand scenario, we
|
||||
2. Custom addition of knowledge bases
|
||||
3. Various usage scenarios such as constructing knowledge bases through plugin capabilities and web crawling. Users only need to organize the knowledge documents, and they can use our existing capabilities to build the knowledge base required for the large model.
|
||||
|
||||
|
||||
### LLMs Management
|
||||
|
||||
In the underlying large model integration, we have designed an open interface that supports integration with various large models. At the same time, we have a very strict control and evaluation mechanism for the effectiveness of the integrated models. In terms of accuracy, the integrated models need to align with the capability of ChatGPT at a level of 85% or higher. We use higher standards to select models, hoping to save users the cumbersome testing and evaluation process in the process of use.
|
||||
@ -176,6 +177,7 @@ $ python pilot/server/webserver.py
|
||||
Notice: the webserver need to connect llmserver, so you need change the .env file. change the MODEL_SERVER = "http://127.0.0.1:8000" to your address. It's very important.
|
||||
|
||||
## Usage Instructions
|
||||
|
||||
We provide a user interface for Gradio, which allows you to use DB-GPT through our user interface. Additionally, we have prepared several reference articles (written in Chinese) that introduce the code and principles related to our project.
|
||||
- [LLM Practical In Action Series (1) — Combined Langchain-Vicuna Application Practical](https://medium.com/@cfqcsunny/llm-practical-in-action-series-1-combined-langchain-vicuna-application-practical-701cd0413c9f)
|
||||
|
||||
@ -183,6 +185,23 @@ We provide a user interface for Gradio, which allows you to use DB-GPT through o
|
||||
|
||||
To use multiple models, modify the LLM_MODEL parameter in the .env configuration file to switch between the models.
|
||||
|
||||
####Create your own knowledge repository:
|
||||
|
||||
1.Place personal knowledge files or folders in the pilot/datasets directory.
|
||||
|
||||
2.Run the knowledge repository script in the tools directory.
|
||||
|
||||
```
|
||||
python tools/knowledge_init.py
|
||||
|
||||
--vector_name : your vector store name default_value:default
|
||||
--append: append mode, True:append, False: not append default_value:False
|
||||
|
||||
```
|
||||
|
||||
3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
|
||||
|
||||
Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
|
||||
## Acknowledgement
|
||||
|
||||
The achievements of this project are thanks to the technical community, especially the following projects:
|
||||
|
20
README.zh.md
20
README.zh.md
@ -114,6 +114,8 @@ DB-GPT基于 [FastChat](https://github.com/lm-sys/FastChat) 构建大模型运
|
||||
|
||||
用户只需要整理好知识文档,即可用我们现有的能力构建大模型所需要的知识库能力。
|
||||
|
||||
|
||||
|
||||
### 大模型管理能力
|
||||
在底层大模型接入中,设计了开放的接口,支持对接多种大模型。同时对于接入模型的效果,我们有非常严格的把控与评审机制。对大模型能力上与ChatGPT对比,在准确率上需要满足85%以上的能力对齐。我们用更高的标准筛选模型,是期望在用户使用过程中,可以省去前面繁琐的测试评估环节。
|
||||
|
||||
@ -186,6 +188,22 @@ $ python webserver.py
|
||||
### 多模型使用
|
||||
在.env 配置文件当中, 修改LLM_MODEL参数来切换使用的模型。
|
||||
|
||||
####打造属于你的知识库:
|
||||
|
||||
1、将个人知识文件或者文件夹放入pilot/datasets目录中
|
||||
|
||||
2、在tools目录执行知识入库脚本
|
||||
|
||||
```
|
||||
python tools/knowledge_init.py
|
||||
|
||||
--vector_name : your vector store name default_value:default
|
||||
--append: append mode, True:append, False: not append default_value:False
|
||||
|
||||
```
|
||||
3、在界面上新增知识库输入你的知识库名(如果没指定输入default),就可以根据你的知识库进行问答
|
||||
|
||||
注意,这里默认向量模型是text2vec-large-chinese(模型比较大,如果个人电脑配置不够建议采用text2vec-base-chinese),因此确保需要将模型download下来放到models目录中。
|
||||
## 感谢
|
||||
|
||||
项目取得的成果,需要感谢技术社区,尤其以下项目。
|
||||
@ -203,12 +221,14 @@ $ python webserver.py
|
||||
<!-- GITCONTRIBUTOR_START -->
|
||||
|
||||
## 贡献者
|
||||
## Contributors
|
||||
|
||||
|[<img src="https://avatars.githubusercontent.com/u/17919400?v=4" width="100px;"/><br/><sub><b>csunny</b></sub>](https://github.com/csunny)<br/>|[<img src="https://avatars.githubusercontent.com/u/1011681?v=4" width="100px;"/><br/><sub><b>xudafeng</b></sub>](https://github.com/xudafeng)<br/>|[<img src="https://avatars.githubusercontent.com/u/7636723?s=96&v=4" width="100px;"/><br/><sub><b>明天</b></sub>](https://github.com/yhjun1026)<br/> | [<img src="https://avatars.githubusercontent.com/u/13723926?v=4" width="100px;"/><br/><sub><b>Aries-ckt</b></sub>](https://github.com/Aries-ckt)<br/>|[<img src="https://avatars.githubusercontent.com/u/95130644?v=4" width="100px;"/><br/><sub><b>thebigbone</b></sub>](https://github.com/thebigbone)<br/>|
|
||||
| :---: | :---: | :---: | :---: |:---: |
|
||||
|
||||
|
||||
[git-contributor 说明](https://github.com/xudafeng/git-contributor),自动生成时间:`Fri May 19 2023 00:24:18 GMT+0800`。
|
||||
This project follows the git-contributor [spec](https://github.com/xudafeng/git-contributor), auto updated at `Sun May 14 2023 23:02:43 GMT+0800`.
|
||||
|
||||
<!-- GITCONTRIBUTOR_END -->
|
||||
|
||||
|
@ -61,11 +61,12 @@ def generate(query):
|
||||
if chunk:
|
||||
data = json.loads(chunk.decode())
|
||||
if data["error_code"] == 0:
|
||||
|
||||
|
||||
if "vicuna" in CFG.LLM_MODEL:
|
||||
output = data["text"][skip_echo_len:].strip()
|
||||
else:
|
||||
output = data["text"].strip()
|
||||
|
||||
state.messages[-1][-1] = output + "▌"
|
||||
yield(output)
|
||||
|
||||
|
@ -109,6 +109,14 @@ class Config(metaclass=Singleton):
|
||||
self.MODEL_SERVER = os.getenv("MODEL_SERVER", "http://127.0.0.1" + ":" + str(self.MODEL_PORT))
|
||||
self.ISLOAD_8BIT = os.getenv("ISLOAD_8BIT", "True") == "True"
|
||||
|
||||
### Vector Store Configuration
|
||||
self.VECTOR_STORE_TYPE = os.getenv("VECTOR_STORE_TYPE", "Chroma")
|
||||
self.MILVUS_URL = os.getenv("MILVUS_URL", "127.0.0.1")
|
||||
self.MILVUS_PORT = os.getenv("MILVUS_PORT", "19530")
|
||||
self.MILVUS_USERNAME = os.getenv("MILVUS_USERNAME", None)
|
||||
self.MILVUS_PASSWORD = os.getenv("MILVUS_PASSWORD", None)
|
||||
|
||||
|
||||
def set_debug_mode(self, value: bool) -> None:
|
||||
"""Set the debug mode value"""
|
||||
self.debug_mode = value
|
||||
|
@ -27,8 +27,18 @@ LLM_MODEL_CONFIG = {
|
||||
"codet5p-2b": os.path.join(MODEL_PATH, "codet5p-2b"),
|
||||
"chatglm-6b-int4": os.path.join(MODEL_PATH, "chatglm-6b-int4"),
|
||||
"chatglm-6b": os.path.join(MODEL_PATH, "chatglm-6b"),
|
||||
"text2vec-base": os.path.join(MODEL_PATH, "text2vec-base-chinese"),
|
||||
"sentence-transforms": os.path.join(MODEL_PATH, "all-MiniLM-L6-v2")
|
||||
}
|
||||
|
||||
|
||||
VECTOR_SEARCH_TOP_K = 20
|
||||
LLM_MODEL = "vicuna-13b"
|
||||
LIMIT_MODEL_CONCURRENCY = 5
|
||||
MAX_POSITION_EMBEDDINGS = 4096
|
||||
# VICUNA_MODEL_SERVER = "http://121.41.227.141:8000"
|
||||
VICUNA_MODEL_SERVER = "http://120.79.27.110:8000"
|
||||
|
||||
# Load model config
|
||||
ISLOAD_8BIT = True
|
||||
ISDEBUG = False
|
||||
@ -36,4 +46,5 @@ ISDEBUG = False
|
||||
|
||||
VECTOR_SEARCH_TOP_K = 10
|
||||
VS_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "vs_store")
|
||||
KNOWLEDGE_UPLOAD_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
|
||||
KNOWLEDGE_UPLOAD_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
|
||||
KNOWLEDGE_CHUNK_SPLIT_SIZE = 100
|
@ -267,12 +267,12 @@ def http_bot(state, mode, sql_mode, db_selector, temperature, max_new_tokens, re
|
||||
skip_echo_len = len(prompt.replace("</s>", " ")) + 1
|
||||
|
||||
if mode == conversation_types["custome"] and not db_selector:
|
||||
persist_dir = os.path.join(KNOWLEDGE_UPLOAD_ROOT_PATH, vector_store_name["vs_name"] + ".vectordb")
|
||||
print("vector store path: ", persist_dir)
|
||||
print("vector store name: ", vector_store_name["vs_name"])
|
||||
vector_store_config = {"vector_store_name": vector_store_name["vs_name"], "text_field": "content",
|
||||
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH}
|
||||
knowledge_embedding_client = KnowledgeEmbedding(file_path="", model_name=LLM_MODEL_CONFIG["text2vec"],
|
||||
local_persist=False,
|
||||
vector_store_config={"vector_store_name": vector_store_name["vs_name"],
|
||||
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH})
|
||||
vector_store_config=vector_store_config)
|
||||
query = state.messages[-2][1]
|
||||
docs = knowledge_embedding_client.similar_search(query, VECTOR_SEARCH_TOP_K)
|
||||
context = [d.page_content for d in docs]
|
||||
@ -364,14 +364,14 @@ def http_bot(state, mode, sql_mode, db_selector, temperature, max_new_tokens, re
|
||||
for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
|
||||
if chunk:
|
||||
data = json.loads(chunk.decode())
|
||||
|
||||
|
||||
""" TODO Multi mode output handler, rewrite this for multi model, use adapter mode.
|
||||
"""
|
||||
if data["error_code"] == 0:
|
||||
|
||||
|
||||
if "vicuna" in CFG.LLM_MODEL:
|
||||
output = data["text"][skip_echo_len:].strip()
|
||||
else:
|
||||
else:
|
||||
output = data["text"].strip()
|
||||
|
||||
output = post_process_code(output)
|
||||
@ -507,6 +507,7 @@ def build_single_model_ui():
|
||||
files = gr.File(label="添加文件",
|
||||
file_types=[".txt", ".md", ".docx", ".pdf"],
|
||||
file_count="multiple",
|
||||
allow_flagged_uploads=True,
|
||||
show_label=False
|
||||
)
|
||||
|
||||
|
@ -9,33 +9,17 @@ class CHNDocumentSplitter(CharacterTextSplitter):
|
||||
self.pdf = pdf
|
||||
self.sentence_size = sentence_size
|
||||
|
||||
# def split_text_version2(self, text: str) -> List[str]:
|
||||
# if self.pdf:
|
||||
# text = re.sub(r"\n{3,}", "\n", text)
|
||||
# text = re.sub('\s', ' ', text)
|
||||
# text = text.replace("\n\n", "")
|
||||
# sent_sep_pattern = re.compile('([﹒﹔﹖﹗.。!?]["’”」』]{0,2}|(?=["‘“「『]{1,2}|$))') # del :;
|
||||
# sent_list = []
|
||||
# for ele in sent_sep_pattern.split(text):
|
||||
# if sent_sep_pattern.match(ele) and sent_list:
|
||||
# sent_list[-1] += ele
|
||||
# elif ele:
|
||||
# sent_list.append(ele)
|
||||
# return sent_list
|
||||
|
||||
def split_text(self, text: str) -> List[str]:
|
||||
if self.pdf:
|
||||
text = re.sub(r"\n{3,}", r"\n", text)
|
||||
text = re.sub('\s', " ", text)
|
||||
text = re.sub("\n\n", "", text)
|
||||
|
||||
text = re.sub(r'([;;.!?。!?\?])([^”’])', r"\1\n\2", text) # 单字符断句符
|
||||
text = re.sub(r'(\.{6})([^"’”」』])', r"\1\n\2", text) # 英文省略号
|
||||
text = re.sub(r'(\…{2})([^"’”」』])', r"\1\n\2", text) # 中文省略号
|
||||
text = re.sub(r'([;;.!?。!?\?])([^”’])', r"\1\n\2", text)
|
||||
text = re.sub(r'(\.{6})([^"’”」』])', r"\1\n\2", text)
|
||||
text = re.sub(r'(\…{2})([^"’”」』])', r"\1\n\2", text)
|
||||
text = re.sub(r'([;;!?。!?\?]["’”」』]{0,2})([^;;!?,。!?\?])', r'\1\n\2', text)
|
||||
# 如果双引号前有终止符,那么双引号才是句子的终点,把分句符\n放到双引号后,注意前面的几句都小心保留了双引号
|
||||
text = text.rstrip() # 段尾如果有多余的\n就去掉它
|
||||
# 很多规则中会考虑分号;,但是这里我把它忽略不计,破折号、英文双引号等同样忽略,需要的再做些简单调整即可。
|
||||
text = text.rstrip()
|
||||
ls = [i for i in text.split("\n") if i]
|
||||
for ele in ls:
|
||||
if len(ele) > self.sentence_size:
|
||||
|
@ -1,16 +1,20 @@
|
||||
import os
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
from langchain.document_loaders import PyPDFLoader, TextLoader, markdown
|
||||
from langchain.document_loaders import TextLoader, markdown, PyPDFLoader
|
||||
from langchain.embeddings import HuggingFaceEmbeddings
|
||||
from langchain.vectorstores import Chroma
|
||||
from pilot.configs.model_config import DATASETS_DIR
|
||||
|
||||
from pilot.configs.config import Config
|
||||
from pilot.configs.model_config import DATASETS_DIR, KNOWLEDGE_CHUNK_SPLIT_SIZE
|
||||
from pilot.source_embedding.chn_document_splitter import CHNDocumentSplitter
|
||||
from pilot.source_embedding.csv_embedding import CSVEmbedding
|
||||
from pilot.source_embedding.markdown_embedding import MarkdownEmbedding
|
||||
from pilot.source_embedding.pdf_embedding import PDFEmbedding
|
||||
import markdown
|
||||
|
||||
from pilot.vector_store.connector import VectorStoreConnector
|
||||
|
||||
CFG = Config()
|
||||
|
||||
class KnowledgeEmbedding:
|
||||
def __init__(self, file_path, model_name, vector_store_config, local_persist=True):
|
||||
@ -18,8 +22,9 @@ class KnowledgeEmbedding:
|
||||
self.file_path = file_path
|
||||
self.model_name = model_name
|
||||
self.vector_store_config = vector_store_config
|
||||
self.vector_store_type = "default"
|
||||
self.file_type = "default"
|
||||
self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)
|
||||
self.vector_store_config["embeddings"] = self.embeddings
|
||||
self.local_persist = local_persist
|
||||
if not self.local_persist:
|
||||
self.knowledge_embedding_client = self.init_knowledge_embedding()
|
||||
@ -40,7 +45,7 @@ class KnowledgeEmbedding:
|
||||
elif self.file_path.endswith(".csv"):
|
||||
embedding = CSVEmbedding(file_path=self.file_path, model_name=self.model_name,
|
||||
vector_store_config=self.vector_store_config)
|
||||
elif self.vector_store_type == "default":
|
||||
elif self.file_type == "default":
|
||||
embedding = MarkdownEmbedding(file_path=self.file_path, model_name=self.model_name, vector_store_config=self.vector_store_config)
|
||||
|
||||
return embedding
|
||||
@ -49,27 +54,10 @@ class KnowledgeEmbedding:
|
||||
return self.knowledge_embedding_client.similar_search(text, topk)
|
||||
|
||||
def knowledge_persist_initialization(self, append_mode):
|
||||
vector_name = self.vector_store_config["vector_store_name"]
|
||||
persist_dir = os.path.join(self.vector_store_config["vector_store_path"], vector_name + ".vectordb")
|
||||
print("vector db path: ", persist_dir)
|
||||
if os.path.exists(persist_dir):
|
||||
if append_mode:
|
||||
print("append knowledge return vector store")
|
||||
new_documents = self._load_knownlege(self.file_path)
|
||||
vector_store = Chroma.from_documents(documents=new_documents,
|
||||
embedding=self.embeddings,
|
||||
persist_directory=persist_dir)
|
||||
else:
|
||||
print("directly return vector store")
|
||||
vector_store = Chroma(persist_directory=persist_dir, embedding_function=self.embeddings)
|
||||
else:
|
||||
print(vector_name + "is new vector store, knowledge begin load...")
|
||||
documents = self._load_knownlege(self.file_path)
|
||||
vector_store = Chroma.from_documents(documents=documents,
|
||||
embedding=self.embeddings,
|
||||
persist_directory=persist_dir)
|
||||
vector_store.persist()
|
||||
return vector_store
|
||||
documents = self._load_knownlege(self.file_path)
|
||||
self.vector_client = VectorStoreConnector(CFG.VECTOR_STORE_TYPE, self.vector_store_config)
|
||||
self.vector_client.load_document(documents)
|
||||
return self.vector_client
|
||||
|
||||
def _load_knownlege(self, path):
|
||||
docments = []
|
||||
@ -88,7 +76,7 @@ class KnowledgeEmbedding:
|
||||
def _load_file(self, filename):
|
||||
if filename.lower().endswith(".md"):
|
||||
loader = TextLoader(filename)
|
||||
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=100)
|
||||
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE)
|
||||
docs = loader.load_and_split(text_splitter)
|
||||
i = 0
|
||||
for d in docs:
|
||||
@ -101,10 +89,14 @@ class KnowledgeEmbedding:
|
||||
i += 1
|
||||
elif filename.lower().endswith(".pdf"):
|
||||
loader = PyPDFLoader(filename)
|
||||
textsplitter = CHNDocumentSplitter(pdf=True, sentence_size=100)
|
||||
textsplitter = CHNDocumentSplitter(pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE)
|
||||
docs = loader.load_and_split(textsplitter)
|
||||
i = 0
|
||||
for d in docs:
|
||||
docs[i].page_content = d.page_content.replace("\n", " ").replace("<EFBFBD>", "")
|
||||
i += 1
|
||||
else:
|
||||
loader = TextLoader(filename)
|
||||
text_splitor = CHNDocumentSplitter(sentence_size=100)
|
||||
text_splitor = CHNDocumentSplitter(sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE)
|
||||
docs = loader.load_and_split(text_splitor)
|
||||
return docs
|
@ -7,6 +7,7 @@ from bs4 import BeautifulSoup
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.schema import Document
|
||||
import markdown
|
||||
from pilot.configs.model_config import KNOWLEDGE_CHUNK_SPLIT_SIZE
|
||||
|
||||
from pilot.source_embedding import SourceEmbedding, register
|
||||
from pilot.source_embedding.chn_document_splitter import CHNDocumentSplitter
|
||||
@ -26,7 +27,7 @@ class MarkdownEmbedding(SourceEmbedding):
|
||||
def read(self):
|
||||
"""Load from markdown path."""
|
||||
loader = TextLoader(self.file_path)
|
||||
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=100)
|
||||
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE)
|
||||
return loader.load_and_split(text_splitter)
|
||||
|
||||
@register
|
||||
|
@ -4,6 +4,7 @@ from typing import List
|
||||
|
||||
from langchain.document_loaders import PyPDFLoader
|
||||
from langchain.schema import Document
|
||||
from pilot.configs.model_config import KNOWLEDGE_CHUNK_SPLIT_SIZE
|
||||
|
||||
from pilot.source_embedding import SourceEmbedding, register
|
||||
from pilot.source_embedding.chn_document_splitter import CHNDocumentSplitter
|
||||
@ -22,8 +23,9 @@ class PDFEmbedding(SourceEmbedding):
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from pdf path."""
|
||||
# loader = UnstructuredPaddlePDFLoader(self.file_path)
|
||||
loader = PyPDFLoader(self.file_path)
|
||||
textsplitter = CHNDocumentSplitter(pdf=True, sentence_size=100)
|
||||
textsplitter = CHNDocumentSplitter(pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE)
|
||||
return loader.load_and_split(textsplitter)
|
||||
|
||||
@register
|
||||
|
@ -50,7 +50,7 @@
|
||||
#
|
||||
# # text_embeddings = Text2Vectors()
|
||||
# mivuls = MilvusStore(cfg={"url": "127.0.0.1", "port": "19530", "alias": "default", "table_name": "test_k"})
|
||||
#
|
||||
#
|
||||
# mivuls.insert(["textc","tezt2"])
|
||||
# print("success")
|
||||
# ct
|
||||
|
@ -1,14 +1,15 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from langchain.embeddings import HuggingFaceEmbeddings
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
from typing import List, Optional, Dict
|
||||
|
||||
from pilot.configs.config import Config
|
||||
from pilot.vector_store.connector import VectorStoreConnector
|
||||
|
||||
registered_methods = []
|
||||
CFG = Config()
|
||||
|
||||
|
||||
def register(method):
|
||||
@ -29,9 +30,9 @@ class SourceEmbedding(ABC):
|
||||
self.vector_store_config = vector_store_config
|
||||
self.embedding_args = embedding_args
|
||||
self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)
|
||||
persist_dir = os.path.join(self.vector_store_config["vector_store_path"],
|
||||
self.vector_store_config["vector_store_name"] + ".vectordb")
|
||||
self.vector_store_client = Chroma(persist_directory=persist_dir, embedding_function=self.embeddings)
|
||||
|
||||
vector_store_config["embeddings"] = self.embeddings
|
||||
self.vector_client = VectorStoreConnector(CFG.VECTOR_STORE_TYPE, vector_store_config)
|
||||
|
||||
@abstractmethod
|
||||
@register
|
||||
@ -54,16 +55,12 @@ class SourceEmbedding(ABC):
|
||||
@register
|
||||
def index_to_store(self, docs):
|
||||
"""index to vector store"""
|
||||
persist_dir = os.path.join(self.vector_store_config["vector_store_path"],
|
||||
self.vector_store_config["vector_store_name"] + ".vectordb")
|
||||
self.vector_store = Chroma.from_documents(docs, self.embeddings, persist_directory=persist_dir)
|
||||
self.vector_store.persist()
|
||||
self.vector_client.load_document(docs)
|
||||
|
||||
@register
|
||||
def similar_search(self, doc, topk):
|
||||
"""vector store similarity_search"""
|
||||
|
||||
return self.vector_store_client.similarity_search(doc, topk)
|
||||
return self.vector_client.similar_search(doc, topk)
|
||||
|
||||
def source_embedding(self):
|
||||
if 'read' in registered_methods:
|
||||
|
30
pilot/vector_store/chroma_store.py
Normal file
30
pilot/vector_store/chroma_store.py
Normal file
@ -0,0 +1,30 @@
|
||||
import os
|
||||
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
from pilot.configs.model_config import KNOWLEDGE_UPLOAD_ROOT_PATH
|
||||
from pilot.logs import logger
|
||||
from pilot.vector_store.vector_store_base import VectorStoreBase
|
||||
|
||||
|
||||
class ChromaStore(VectorStoreBase):
|
||||
"""chroma database"""
|
||||
|
||||
def __init__(self, ctx: {}) -> None:
|
||||
self.ctx = ctx
|
||||
self.embeddings = ctx["embeddings"]
|
||||
self.persist_dir = os.path.join(KNOWLEDGE_UPLOAD_ROOT_PATH,
|
||||
ctx["vector_store_name"] + ".vectordb")
|
||||
self.vector_store_client = Chroma(persist_directory=self.persist_dir, embedding_function=self.embeddings)
|
||||
|
||||
def similar_search(self, text, topk) -> None:
|
||||
logger.info("ChromaStore similar search")
|
||||
return self.vector_store_client.similarity_search(text, topk)
|
||||
|
||||
def load_document(self, documents):
|
||||
logger.info("ChromaStore load document")
|
||||
texts = [doc.page_content for doc in documents]
|
||||
metadatas = [doc.metadata for doc in documents]
|
||||
self.vector_store_client.add_texts(texts=texts, metadatas=metadatas)
|
||||
self.vector_store_client.persist()
|
||||
|
22
pilot/vector_store/connector.py
Normal file
22
pilot/vector_store/connector.py
Normal file
@ -0,0 +1,22 @@
|
||||
from pilot.vector_store.chroma_store import ChromaStore
|
||||
from pilot.vector_store.milvus_store import MilvusStore
|
||||
|
||||
connector = {
|
||||
"Chroma": ChromaStore,
|
||||
"Milvus": MilvusStore
|
||||
}
|
||||
|
||||
|
||||
class VectorStoreConnector:
|
||||
""" vector store connector, can connect different vector db provided load document api and similar search api
|
||||
"""
|
||||
def __init__(self, vector_store_type, ctx: {}) -> None:
|
||||
self.ctx = ctx
|
||||
self.connector_class = connector[vector_store_type]
|
||||
self.client = self.connector_class(ctx)
|
||||
|
||||
def load_document(self, docs):
|
||||
self.client.load_document(docs)
|
||||
|
||||
def similar_search(self, docs, topk):
|
||||
return self.client.similar_search(docs, topk)
|
@ -1,33 +1,53 @@
|
||||
from typing import List, Optional, Iterable, Tuple, Any
|
||||
|
||||
from pymilvus import DataType, FieldSchema, CollectionSchema, connections, Collection
|
||||
from pymilvus import connections, Collection, DataType
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
|
||||
from pilot.configs.config import Config
|
||||
from pilot.vector_store.vector_store_base import VectorStoreBase
|
||||
|
||||
|
||||
CFG = Config()
|
||||
class MilvusStore(VectorStoreBase):
|
||||
def __init__(self, cfg: {}) -> None:
|
||||
"""Construct a milvus memory storage connection.
|
||||
"""Milvus database"""
|
||||
def __init__(self, ctx: {}) -> None:
|
||||
"""init a milvus storage connection.
|
||||
|
||||
Args:
|
||||
cfg (Config): Auto-GPT global config.
|
||||
ctx ({}): MilvusStore global config.
|
||||
"""
|
||||
# self.configure(cfg)
|
||||
|
||||
connect_kwargs = {}
|
||||
self.uri = None
|
||||
self.uri = cfg["url"]
|
||||
self.port = cfg["port"]
|
||||
self.username = cfg.get("username", None)
|
||||
self.password = cfg.get("password", None)
|
||||
self.collection_name = cfg["table_name"]
|
||||
self.password = cfg.get("secure", None)
|
||||
self.uri = CFG.MILVUS_URL
|
||||
self.port = CFG.MILVUS_PORT
|
||||
self.username = CFG.MILVUS_USERNAME
|
||||
self.password = CFG.MILVUS_PASSWORD
|
||||
self.collection_name = ctx.get("vector_store_name", None)
|
||||
self.secure = ctx.get("secure", None)
|
||||
self.embedding = ctx.get("embeddings", None)
|
||||
self.fields = []
|
||||
|
||||
# use HNSW by default.
|
||||
self.index_params = {
|
||||
"metric_type": "IP",
|
||||
"metric_type": "L2",
|
||||
"index_type": "HNSW",
|
||||
"params": {"M": 8, "efConstruction": 64},
|
||||
}
|
||||
# use HNSW by default.
|
||||
self.index_params_map = {
|
||||
"IVF_FLAT": {"params": {"nprobe": 10}},
|
||||
"IVF_SQ8": {"params": {"nprobe": 10}},
|
||||
"IVF_PQ": {"params": {"nprobe": 10}},
|
||||
"HNSW": {"params": {"ef": 10}},
|
||||
"RHNSW_FLAT": {"params": {"ef": 10}},
|
||||
"RHNSW_SQ": {"params": {"ef": 10}},
|
||||
"RHNSW_PQ": {"params": {"ef": 10}},
|
||||
"IVF_HNSW": {"params": {"nprobe": 10, "ef": 10}},
|
||||
"ANNOY": {"params": {"search_k": 10}},
|
||||
}
|
||||
|
||||
self.text_field = "content"
|
||||
|
||||
if (self.username is None) != (self.password is None):
|
||||
raise ValueError(
|
||||
@ -38,54 +58,274 @@ class MilvusStore(VectorStoreBase):
|
||||
connect_kwargs["password"] = self.password
|
||||
|
||||
connections.connect(
|
||||
**connect_kwargs,
|
||||
host=self.uri or "127.0.0.1",
|
||||
port=self.port or "19530",
|
||||
alias="default"
|
||||
# secure=self.secure,
|
||||
)
|
||||
|
||||
self.init_schema()
|
||||
|
||||
def init_schema(self) -> None:
|
||||
"""Initialize collection in milvus database."""
|
||||
fields = [
|
||||
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
|
||||
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
|
||||
FieldSchema(name="raw_text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
]
|
||||
|
||||
# create collection if not exist and load it.
|
||||
self.schema = CollectionSchema(fields, "db-gpt memory storage")
|
||||
self.collection = Collection(self.collection_name, self.schema)
|
||||
self.index_params = {
|
||||
"metric_type": "IP",
|
||||
"index_type": "HNSW",
|
||||
"params": {"M": 8, "efConstruction": 64},
|
||||
}
|
||||
# create index if not exist.
|
||||
if not self.collection.has_index():
|
||||
self.collection.release()
|
||||
self.collection.create_index(
|
||||
"vector",
|
||||
self.index_params,
|
||||
index_name="vector",
|
||||
def init_schema_and_load(self, vector_name, documents):
|
||||
"""Create a Milvus collection, indexes it with HNSW, load document.
|
||||
Args:
|
||||
vector_name (Embeddings): your collection name.
|
||||
documents (List[str]): Text to insert.
|
||||
Returns:
|
||||
VectorStore: The MilvusStore vector store.
|
||||
"""
|
||||
try:
|
||||
from pymilvus import (
|
||||
Collection,
|
||||
CollectionSchema,
|
||||
DataType,
|
||||
FieldSchema,
|
||||
connections,
|
||||
)
|
||||
self.collection.load()
|
||||
from pymilvus.orm.types import infer_dtype_bydata
|
||||
except ImportError:
|
||||
raise ValueError(
|
||||
"Could not import pymilvus python package. "
|
||||
"Please install it with `pip install pymilvus`."
|
||||
)
|
||||
if not connections.has_connection("default"):
|
||||
connections.connect(
|
||||
host=self.uri or "127.0.0.1",
|
||||
port=self.port or "19530",
|
||||
alias="default"
|
||||
# secure=self.secure,
|
||||
)
|
||||
texts = [d.page_content for d in documents]
|
||||
metadatas = [d.metadata for d in documents]
|
||||
embeddings = self.embedding.embed_query(texts[0])
|
||||
dim = len(embeddings)
|
||||
# Generate unique names
|
||||
primary_field = "pk_id"
|
||||
vector_field = "vector"
|
||||
text_field = "content"
|
||||
self.text_field = text_field
|
||||
collection_name = vector_name
|
||||
fields = []
|
||||
# Determine metadata schema
|
||||
# if metadatas:
|
||||
# # Check if all metadata keys line up
|
||||
# key = metadatas[0].keys()
|
||||
# for x in metadatas:
|
||||
# if key != x.keys():
|
||||
# raise ValueError(
|
||||
# "Mismatched metadata. "
|
||||
# "Make sure all metadata has the same keys and datatype."
|
||||
# )
|
||||
# # Create FieldSchema for each entry in singular metadata.
|
||||
# for key, value in metadatas[0].items():
|
||||
# # Infer the corresponding datatype of the metadata
|
||||
# dtype = infer_dtype_bydata(value)
|
||||
# if dtype == DataType.UNKNOWN:
|
||||
# raise ValueError(f"Unrecognized datatype for {key}.")
|
||||
# elif dtype == DataType.VARCHAR:
|
||||
# # Find out max length text based metadata
|
||||
# max_length = 0
|
||||
# for subvalues in metadatas:
|
||||
# max_length = max(max_length, len(subvalues[key]))
|
||||
# fields.append(
|
||||
# FieldSchema(key, DataType.VARCHAR, max_length=max_length + 1)
|
||||
# )
|
||||
# else:
|
||||
# fields.append(FieldSchema(key, dtype))
|
||||
|
||||
# def add(self, data) -> str:
|
||||
# Find out max length of texts
|
||||
max_length = 0
|
||||
for y in texts:
|
||||
max_length = max(max_length, len(y))
|
||||
# Create the text field
|
||||
fields.append(
|
||||
FieldSchema(text_field, DataType.VARCHAR, max_length=max_length + 1)
|
||||
)
|
||||
# create the primary key field
|
||||
fields.append(
|
||||
FieldSchema(primary_field, DataType.INT64, is_primary=True, auto_id=True)
|
||||
)
|
||||
# create the vector field
|
||||
fields.append(FieldSchema(vector_field, DataType.FLOAT_VECTOR, dim=dim))
|
||||
# Create the schema for the collection
|
||||
schema = CollectionSchema(fields)
|
||||
# Create the collection
|
||||
collection = Collection(collection_name, schema)
|
||||
self.col = collection
|
||||
# Index parameters for the collection
|
||||
index = self.index_params
|
||||
# Create the index
|
||||
collection.create_index(vector_field, index)
|
||||
# Create the VectorStore
|
||||
# milvus = cls(
|
||||
# embedding,
|
||||
# kwargs.get("connection_args", {"port": 19530}),
|
||||
# collection_name,
|
||||
# text_field,
|
||||
# )
|
||||
# Add the texts.
|
||||
schema = collection.schema
|
||||
for x in schema.fields:
|
||||
self.fields.append(x.name)
|
||||
if x.auto_id:
|
||||
self.fields.remove(x.name)
|
||||
if x.is_primary:
|
||||
self.primary_field = x.name
|
||||
if x.dtype == DataType.FLOAT_VECTOR or x.dtype == DataType.BINARY_VECTOR:
|
||||
self.vector_field = x.name
|
||||
self._add_texts(texts, metadatas)
|
||||
|
||||
return self.collection_name
|
||||
|
||||
# def init_schema(self) -> None:
|
||||
# """Initialize collection in milvus database."""
|
||||
# fields = [
|
||||
# FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
|
||||
# FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=self.model_config["dim"]),
|
||||
# FieldSchema(name="raw_text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
# ]
|
||||
#
|
||||
# # create collection if not exist and load it.
|
||||
# self.schema = CollectionSchema(fields, "db-gpt memory storage")
|
||||
# self.collection = Collection(self.collection_name, self.schema)
|
||||
# self.index_params_map = {
|
||||
# "IVF_FLAT": {"params": {"nprobe": 10}},
|
||||
# "IVF_SQ8": {"params": {"nprobe": 10}},
|
||||
# "IVF_PQ": {"params": {"nprobe": 10}},
|
||||
# "HNSW": {"params": {"ef": 10}},
|
||||
# "RHNSW_FLAT": {"params": {"ef": 10}},
|
||||
# "RHNSW_SQ": {"params": {"ef": 10}},
|
||||
# "RHNSW_PQ": {"params": {"ef": 10}},
|
||||
# "IVF_HNSW": {"params": {"nprobe": 10, "ef": 10}},
|
||||
# "ANNOY": {"params": {"search_k": 10}},
|
||||
# }
|
||||
#
|
||||
# self.index_params = {
|
||||
# "metric_type": "IP",
|
||||
# "index_type": "HNSW",
|
||||
# "params": {"M": 8, "efConstruction": 64},
|
||||
# }
|
||||
# # create index if not exist.
|
||||
# if not self.collection.has_index():
|
||||
# self.collection.release()
|
||||
# self.collection.create_index(
|
||||
# "vector",
|
||||
# self.index_params,
|
||||
# index_name="vector",
|
||||
# )
|
||||
# info = self.collection.describe()
|
||||
# self.collection.load()
|
||||
|
||||
# def insert(self, text, model_config) -> str:
|
||||
# """Add an embedding of data into milvus.
|
||||
#
|
||||
# Args:
|
||||
# data (str): The raw text to construct embedding index.
|
||||
#
|
||||
# text (str): The raw text to construct embedding index.
|
||||
# Returns:
|
||||
# str: log.
|
||||
# """
|
||||
# embedding = get_ada_embedding(data)
|
||||
# result = self.collection.insert([[embedding], [data]])
|
||||
# # embedding = get_ada_embedding(data)
|
||||
# embeddings = HuggingFaceEmbeddings(model_name=self.model_config["model_name"])
|
||||
# result = self.collection.insert([embeddings.embed_documents(text), text])
|
||||
# _text = (
|
||||
# "Inserting data into memory at primary key: "
|
||||
# f"{result.primary_keys[0]}:\n data: {data}"
|
||||
# f"{result.primary_keys[0]}:\n data: {text}"
|
||||
# )
|
||||
# return _text
|
||||
# return _text
|
||||
|
||||
def _add_texts(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
partition_name: Optional[str] = None,
|
||||
timeout: Optional[int] = None,
|
||||
) -> List[str]:
|
||||
"""add text data into Milvus.
|
||||
"""
|
||||
insert_dict: Any = {self.text_field: list(texts)}
|
||||
try:
|
||||
insert_dict[self.vector_field] = self.embedding.embed_documents(
|
||||
list(texts)
|
||||
)
|
||||
except NotImplementedError:
|
||||
insert_dict[self.vector_field] = [
|
||||
self.embedding.embed_query(x) for x in texts
|
||||
]
|
||||
# Collect the metadata into the insert dict.
|
||||
if len(self.fields) > 2 and metadatas is not None:
|
||||
for d in metadatas:
|
||||
for key, value in d.items():
|
||||
if key in self.fields:
|
||||
insert_dict.setdefault(key, []).append(value)
|
||||
# Convert dict to list of lists for insertion
|
||||
insert_list = [insert_dict[x] for x in self.fields]
|
||||
# Insert into the collection.
|
||||
res = self.col.insert(
|
||||
insert_list, partition_name=partition_name, timeout=timeout
|
||||
)
|
||||
# make sure data is searchable.
|
||||
self.col.flush()
|
||||
return res.primary_keys
|
||||
|
||||
def load_document(self, documents) -> None:
|
||||
"""load document in vector database."""
|
||||
self.init_schema_and_load(self.collection_name, documents)
|
||||
|
||||
def similar_search(self, text, topk) -> None:
|
||||
"""similar_search in vector database."""
|
||||
self.col = Collection(self.collection_name)
|
||||
schema = self.col.schema
|
||||
for x in schema.fields:
|
||||
self.fields.append(x.name)
|
||||
if x.auto_id:
|
||||
self.fields.remove(x.name)
|
||||
if x.is_primary:
|
||||
self.primary_field = x.name
|
||||
if x.dtype == DataType.FLOAT_VECTOR or x.dtype == DataType.BINARY_VECTOR:
|
||||
self.vector_field = x.name
|
||||
_, docs_and_scores = self._search(text, topk)
|
||||
return [doc for doc, _, _ in docs_and_scores]
|
||||
|
||||
def _search(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
param: Optional[dict] = None,
|
||||
expr: Optional[str] = None,
|
||||
partition_names: Optional[List[str]] = None,
|
||||
round_decimal: int = -1,
|
||||
timeout: Optional[int] = None,
|
||||
**kwargs: Any,
|
||||
) -> Tuple[List[float], List[Tuple[Document, Any, Any]]]:
|
||||
self.col.load()
|
||||
# use default index params.
|
||||
if param is None:
|
||||
index_type = self.col.indexes[0].params["index_type"]
|
||||
param = self.index_params_map[index_type]
|
||||
# query text embedding.
|
||||
data = [self.embedding.embed_query(query)]
|
||||
# Determine result metadata fields.
|
||||
output_fields = self.fields[:]
|
||||
output_fields.remove(self.vector_field)
|
||||
# milvus search.
|
||||
res = self.col.search(
|
||||
data,
|
||||
self.vector_field,
|
||||
param,
|
||||
k,
|
||||
expr=expr,
|
||||
output_fields=output_fields,
|
||||
partition_names=partition_names,
|
||||
round_decimal=round_decimal,
|
||||
timeout=timeout,
|
||||
**kwargs,
|
||||
)
|
||||
ret = []
|
||||
for result in res[0]:
|
||||
meta = {x: result.entity.get(x) for x in output_fields}
|
||||
ret.append(
|
||||
(
|
||||
Document(page_content=meta.pop(self.text_field), metadata=meta),
|
||||
result.distance,
|
||||
result.id,
|
||||
)
|
||||
)
|
||||
|
||||
return data[0], ret
|
||||
|
@ -2,8 +2,14 @@ from abc import ABC, abstractmethod
|
||||
|
||||
|
||||
class VectorStoreBase(ABC):
|
||||
"""base class for vector store database"""
|
||||
|
||||
@abstractmethod
|
||||
def init_schema(self) -> None:
|
||||
def load_document(self, documents) -> None:
|
||||
"""load document in vector database."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def similar_search(self, text, topk) -> None:
|
||||
"""Initialize schema in vector database."""
|
||||
pass
|
@ -61,6 +61,7 @@ gTTS==2.3.1
|
||||
langchain
|
||||
nltk
|
||||
python-dotenv==1.0.0
|
||||
pymilvus==2.2.1
|
||||
vcrpy
|
||||
chromadb
|
||||
markdown2
|
||||
|
@ -2,8 +2,8 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import argparse
|
||||
|
||||
from pilot.configs.model_config import DATASETS_DIR, LLM_MODEL_CONFIG, VECTOR_SEARCH_TOP_K, \
|
||||
KNOWLEDGE_UPLOAD_ROOT_PATH
|
||||
from pilot.configs.model_config import DATASETS_DIR, LLM_MODEL_CONFIG, VECTOR_SEARCH_TOP_K, VECTOR_STORE_CONFIG, \
|
||||
VECTOR_STORE_TYPE
|
||||
from pilot.source_embedding.knowledge_embedding import KnowledgeEmbedding
|
||||
|
||||
|
||||
@ -12,15 +12,15 @@ class LocalKnowledgeInit:
|
||||
model_name = LLM_MODEL_CONFIG["text2vec"]
|
||||
top_k: int = VECTOR_SEARCH_TOP_K
|
||||
|
||||
def __init__(self) -> None:
|
||||
pass
|
||||
def __init__(self, vector_store_config) -> None:
|
||||
self.vector_store_config = vector_store_config
|
||||
|
||||
def knowledge_persist(self, file_path, vector_name, append_mode):
|
||||
def knowledge_persist(self, file_path, append_mode):
|
||||
""" knowledge persist """
|
||||
kv = KnowledgeEmbedding(
|
||||
file_path=file_path,
|
||||
model_name=LLM_MODEL_CONFIG["text2vec"],
|
||||
vector_store_config= {"vector_store_name":vector_name, "vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH})
|
||||
vector_store_config= self.vector_store_config)
|
||||
vector_store = kv.knowledge_persist_initialization(append_mode)
|
||||
return vector_store
|
||||
|
||||
@ -36,10 +36,13 @@ if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--vector_name", type=str, default="default")
|
||||
parser.add_argument("--append", type=bool, default=False)
|
||||
parser.add_argument("--store_type", type=str, default="Chroma")
|
||||
args = parser.parse_args()
|
||||
vector_name = args.vector_name
|
||||
append_mode = args.append
|
||||
kv = LocalKnowledgeInit()
|
||||
vector_store = kv.knowledge_persist(file_path=DATASETS_DIR, vector_name=vector_name, append_mode=append_mode)
|
||||
docs = vector_store.similarity_search("小明",1)
|
||||
store_type = VECTOR_STORE_TYPE
|
||||
vector_store_config = {"url": VECTOR_STORE_CONFIG["url"], "port": VECTOR_STORE_CONFIG["port"], "vector_store_name":vector_name}
|
||||
print(vector_store_config)
|
||||
kv = LocalKnowledgeInit(vector_store_config=vector_store_config)
|
||||
vector_store = kv.knowledge_persist(file_path=DATASETS_DIR, append_mode=append_mode)
|
||||
print("your knowledge embedding success...")
|
Loading…
Reference in New Issue
Block a user