mirror of
https://github.com/csunny/DB-GPT.git
synced 2025-09-01 09:06:55 +00:00
feat:embedding_engine add text_splitter param
This commit is contained in:
@@ -19,6 +19,7 @@ you will prepare embedding models from huggingface
|
||||
Notice make sure you have install git-lfs
|
||||
```{tip}
|
||||
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
|
||||
|
||||
git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
|
||||
```
|
||||
version:
|
||||
|
@@ -72,6 +72,24 @@ eg: git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
|
||||
vector_store_config=vector_store_config)
|
||||
embedding_engine.knowledge_embedding()
|
||||
|
||||
If you want to add your text_splitter, do this:
|
||||
|
||||
::
|
||||
|
||||
url = "https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html"
|
||||
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=100, chunk_overlap=50
|
||||
)
|
||||
embedding_engine = EmbeddingEngine(
|
||||
knowledge_source=url,
|
||||
knowledge_type=KnowledgeType.URL.value,
|
||||
model_name=embedding_model,
|
||||
vector_store_config=vector_store_config,
|
||||
text_splitter=text_splitter
|
||||
)
|
||||
|
||||
|
||||
4.init Document Type EmbeddingEngine api and embedding your document into vector store in your code.
|
||||
Document type can be .txt, .pdf, .md, .doc, .ppt.
|
||||
|
||||
|
@@ -1,49 +0,0 @@
|
||||
# Knownledge based qa
|
||||
|
||||
Chat with your own knowledge is a very interesting thing. In the usage scenarios of this chapter, we will introduce how to build your own knowledge base through the knowledge base API. Firstly, building a knowledge store can currently be initialized by executing "python tool/knowledge_init.py" to initialize the content of your own knowledge base, which was introduced in the previous knowledge base module. Of course, you can also call our provided knowledge embedding API to store knowledge.
|
||||
|
||||
|
||||
We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
|
||||
```
|
||||
vector_store_config = {
|
||||
"vector_store_name": name
|
||||
}
|
||||
|
||||
file_path = "your file path"
|
||||
|
||||
embedding_engine = EmbeddingEngine(file_path=file_path, model_name=LLM_MODEL_CONFIG["text2vec"], vector_store_config=vector_store_config)
|
||||
|
||||
embedding_engine.knowledge_embedding()
|
||||
|
||||
```
|
||||
|
||||
Now we currently support vector databases: Chroma (default) and Milvus. You can switch between them by modifying the "VECTOR_STORE_TYPE" field in the .env file.
|
||||
```
|
||||
#*******************************************************************#
|
||||
#** VECTOR STORE SETTINGS **#
|
||||
#*******************************************************************#
|
||||
VECTOR_STORE_TYPE=Chroma
|
||||
#MILVUS_URL=127.0.0.1
|
||||
#MILVUS_PORT=19530
|
||||
```
|
||||
|
||||
|
||||
Below is an example of using the knowledge base API to query knowledge:
|
||||
|
||||
```
|
||||
vector_store_config = {
|
||||
"vector_store_name": your_name,
|
||||
"vector_store_type": "Chroma",
|
||||
"chroma_persist_path": "your_persist_dir",
|
||||
}
|
||||
|
||||
integrate
|
||||
|
||||
query = "your query"
|
||||
|
||||
embedding_model = "your_model_path/all-MiniLM-L6-v2"
|
||||
|
||||
embedding_engine = EmbeddingEngine(knowledge_source=url, knowledge_type=KnowledgeType.URL.value, model_name=embedding_model, vector_store_config=vector_store_config)
|
||||
|
||||
embedding_engine.similar_search(query, 10)
|
||||
```
|
Reference in New Issue
Block a user