mirror of
https://github.com/csunny/DB-GPT.git
synced 2025-09-02 17:45:31 +00:00
doc:db-gpt doc
This commit is contained in:
@@ -1,4 +0,0 @@
|
||||
# Connections
|
||||
|
||||
In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.
|
||||
|
16
docs/modules/connections.rst
Normal file
16
docs/modules/connections.rst
Normal file
@@ -0,0 +1,16 @@
|
||||
Connections
|
||||
---------
|
||||
**In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.**
|
||||
|
||||
DB-GPT provides base class BaseConnect, you can inheriting and implement get_session(), get_table_names(), get_index_info(), get_database_list() and run().
|
||||
|
||||
- `mysql_connection <./connections/mysql_connection.html>`_: supported mysql_connection.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Connections
|
||||
:name: mysql_connection
|
||||
:hidden:
|
||||
|
||||
./connections/mysql/mysql_connection.md
|
18
docs/modules/connections/mysql/mysql_connection.md
Normal file
18
docs/modules/connections/mysql/mysql_connection.md
Normal file
@@ -0,0 +1,18 @@
|
||||
MYSQL Connection
|
||||
==================================
|
||||
MYSQL can connect mysql server.
|
||||
|
||||
inheriting the RDBMSDatabase
|
||||
```
|
||||
class MySQLConnect(RDBMSDatabase):
|
||||
"""Connect MySQL Database fetch MetaData
|
||||
Args:
|
||||
Usage:
|
||||
"""
|
||||
|
||||
type: str = "MySQL"
|
||||
dialect: str = "mysql"
|
||||
driver: str = "pymysql"
|
||||
|
||||
default_db = ["information_schema", "performance_schema", "sys", "mysql"]
|
||||
```
|
40
docs/modules/knowledge.rst
Normal file
40
docs/modules/knowledge.rst
Normal file
@@ -0,0 +1,40 @@
|
||||
Knowledge
|
||||
---------
|
||||
|
||||
| As the knowledge base is currently the most significant user demand scenario, we natively support the construction and processing of knowledge bases. At the same time, we also provide multiple knowledge base management strategies in this project, such as pdf knowledge,md knowledge, txt knowledge, word knowledge, ppt knowledge:
|
||||
|
||||
|
||||
**Create your own knowledge repository**
|
||||
|
||||
1.Place personal knowledge files or folders in the pilot/datasets directory.
|
||||
|
||||
We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
|
||||
|
||||
before execution: python -m spacy download zh_core_web_sm
|
||||
|
||||
2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma
|
||||
(now only support Chroma and Milvus, if you set Milvus, please set MILVUS_URL and MILVUS_PORT)
|
||||
|
||||
2.Run the knowledge repository script in the tools directory.
|
||||
|
||||
python tools/knowledge_init.py
|
||||
note : --vector_name : your vector store name default_value:default
|
||||
|
||||
3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
|
||||
|
||||
Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
|
||||
|
||||
- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Knowledge
|
||||
:name: pdf_embedding
|
||||
:hidden:
|
||||
|
||||
./knowledge/pdf/pdf_embedding.md
|
||||
./knowledge/markdown/markdown_embedding.md
|
||||
./knowledge/word/word_embedding.md
|
||||
./knowledge/url/url_embedding.md
|
||||
./knowledge/ppt/ppt_embedding.md
|
42
docs/modules/knowledge/markdown/markdown_embedding.md
Normal file
42
docs/modules/knowledge/markdown/markdown_embedding.md
Normal file
@@ -0,0 +1,42 @@
|
||||
MarkdownEmbedding
|
||||
==================================
|
||||
markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
|
||||
|
||||
inheriting the SourceEmbedding
|
||||
|
||||
```
|
||||
class MarkdownEmbedding(SourceEmbedding):
|
||||
"""pdf embedding for read pdf document."""
|
||||
|
||||
def __init__(self, file_path, vector_store_config):
|
||||
"""Initialize with pdf path."""
|
||||
super().__init__(file_path, vector_store_config)
|
||||
self.file_path = file_path
|
||||
self.vector_store_config = vector_store_config
|
||||
```
|
||||
implement read() and data_process()
|
||||
read() method allows you to read data and split data into chunk
|
||||
|
||||
```
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from markdown path."""
|
||||
loader = EncodeTextLoader(self.file_path)
|
||||
textsplitter = SpacyTextSplitter(
|
||||
pipeline="zh_core_web_sm",
|
||||
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
|
||||
chunk_overlap=100,
|
||||
)
|
||||
return loader.load_and_split(textsplitter)
|
||||
```
|
||||
|
||||
data_process() method allows you to pre processing your ways
|
||||
```
|
||||
@register
|
||||
def data_process(self, documents: List[Document]):
|
||||
i = 0
|
||||
for d in documents:
|
||||
documents[i].page_content = d.page_content.replace("\n", "")
|
||||
i += 1
|
||||
return documents
|
||||
```
|
43
docs/modules/knowledge/pdf/pdf_embedding.md
Normal file
43
docs/modules/knowledge/pdf/pdf_embedding.md
Normal file
@@ -0,0 +1,43 @@
|
||||
PDFEmbedding
|
||||
==================================
|
||||
pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
|
||||
|
||||
inheriting the SourceEmbedding
|
||||
```
|
||||
class PDFEmbedding(SourceEmbedding):
|
||||
"""pdf embedding for read pdf document."""
|
||||
|
||||
def __init__(self, file_path, vector_store_config):
|
||||
"""Initialize with pdf path."""
|
||||
super().__init__(file_path, vector_store_config)
|
||||
self.file_path = file_path
|
||||
self.vector_store_config = vector_store_config
|
||||
```
|
||||
|
||||
implement read() and data_process()
|
||||
read() method allows you to read data and split data into chunk
|
||||
```
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from pdf path."""
|
||||
loader = PyPDFLoader(self.file_path)
|
||||
# textsplitter = CHNDocumentSplitter(
|
||||
# pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
|
||||
# )
|
||||
textsplitter = SpacyTextSplitter(
|
||||
pipeline="zh_core_web_sm",
|
||||
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
|
||||
chunk_overlap=100,
|
||||
)
|
||||
return loader.load_and_split(textsplitter)
|
||||
```
|
||||
data_process() method allows you to pre processing your ways
|
||||
```
|
||||
@register
|
||||
def data_process(self, documents: List[Document]):
|
||||
i = 0
|
||||
for d in documents:
|
||||
documents[i].page_content = d.page_content.replace("\n", "")
|
||||
i += 1
|
||||
return documents
|
||||
```
|
40
docs/modules/knowledge/ppt/ppt_embedding.md
Normal file
40
docs/modules/knowledge/ppt/ppt_embedding.md
Normal file
@@ -0,0 +1,40 @@
|
||||
PPTEmbedding
|
||||
==================================
|
||||
ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
|
||||
|
||||
inheriting the SourceEmbedding
|
||||
```
|
||||
class PPTEmbedding(SourceEmbedding):
|
||||
"""ppt embedding for read ppt document."""
|
||||
|
||||
def __init__(self, file_path, vector_store_config):
|
||||
"""Initialize with pdf path."""
|
||||
super().__init__(file_path, vector_store_config)
|
||||
self.file_path = file_path
|
||||
self.vector_store_config = vector_store_config
|
||||
```
|
||||
|
||||
implement read() and data_process()
|
||||
read() method allows you to read data and split data into chunk
|
||||
```
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from ppt path."""
|
||||
loader = UnstructuredPowerPointLoader(self.file_path)
|
||||
textsplitter = SpacyTextSplitter(
|
||||
pipeline="zh_core_web_sm",
|
||||
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
|
||||
chunk_overlap=200,
|
||||
)
|
||||
return loader.load_and_split(textsplitter)
|
||||
```
|
||||
data_process() method allows you to pre processing your ways
|
||||
```
|
||||
@register
|
||||
def data_process(self, documents: List[Document]):
|
||||
i = 0
|
||||
for d in documents:
|
||||
documents[i].page_content = d.page_content.replace("\n", "")
|
||||
i += 1
|
||||
return documents
|
||||
```
|
47
docs/modules/knowledge/url/url_embedding.md
Normal file
47
docs/modules/knowledge/url/url_embedding.md
Normal file
@@ -0,0 +1,47 @@
|
||||
URL Embedding
|
||||
==================================
|
||||
url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
|
||||
|
||||
inheriting the SourceEmbedding
|
||||
```
|
||||
class URLEmbedding(SourceEmbedding):
|
||||
"""url embedding for read url document."""
|
||||
|
||||
def __init__(self, file_path, vector_store_config):
|
||||
"""Initialize with url path."""
|
||||
super().__init__(file_path, vector_store_config)
|
||||
self.file_path = file_path
|
||||
self.vector_store_config = vector_store_config
|
||||
```
|
||||
|
||||
implement read() and data_process()
|
||||
read() method allows you to read data and split data into chunk
|
||||
```
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from url path."""
|
||||
loader = WebBaseLoader(web_path=self.file_path)
|
||||
if CFG.LANGUAGE == "en":
|
||||
text_splitter = CharacterTextSplitter(
|
||||
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
|
||||
chunk_overlap=20,
|
||||
length_function=len,
|
||||
)
|
||||
else:
|
||||
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=1000)
|
||||
return loader.load_and_split(text_splitter)
|
||||
```
|
||||
data_process() method allows you to pre processing your ways
|
||||
```
|
||||
@register
|
||||
def data_process(self, documents: List[Document]):
|
||||
i = 0
|
||||
for d in documents:
|
||||
content = d.page_content.replace("\n", "")
|
||||
soup = BeautifulSoup(content, "html.parser")
|
||||
for tag in soup(["!doctype", "meta"]):
|
||||
tag.extract()
|
||||
documents[i].page_content = soup.get_text()
|
||||
i += 1
|
||||
return documents
|
||||
```
|
38
docs/modules/knowledge/word/word_embedding.md
Normal file
38
docs/modules/knowledge/word/word_embedding.md
Normal file
@@ -0,0 +1,38 @@
|
||||
WordEmbedding
|
||||
==================================
|
||||
word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
|
||||
|
||||
inheriting the SourceEmbedding
|
||||
```
|
||||
class WordEmbedding(SourceEmbedding):
|
||||
"""word embedding for read word document."""
|
||||
|
||||
def __init__(self, file_path, vector_store_config):
|
||||
"""Initialize with word path."""
|
||||
super().__init__(file_path, vector_store_config)
|
||||
self.file_path = file_path
|
||||
self.vector_store_config = vector_store_config
|
||||
```
|
||||
|
||||
implement read() and data_process()
|
||||
read() method allows you to read data and split data into chunk
|
||||
```
|
||||
@register
|
||||
def read(self):
|
||||
"""Load from word path."""
|
||||
loader = UnstructuredWordDocumentLoader(self.file_path)
|
||||
textsplitter = CHNDocumentSplitter(
|
||||
pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
|
||||
)
|
||||
return loader.load_and_split(textsplitter)
|
||||
```
|
||||
data_process() method allows you to pre processing your ways
|
||||
```
|
||||
@register
|
||||
def data_process(self, documents: List[Document]):
|
||||
i = 0
|
||||
for d in documents:
|
||||
documents[i].page_content = d.page_content.replace("\n", "")
|
||||
i += 1
|
||||
return documents
|
||||
```
|
Reference in New Issue
Block a user