doc:db-gpt doc

This commit is contained in:
aries-ckt
2023-06-14 15:31:11 +08:00
parent 333aad7bc4
commit 0c8b424b04
27 changed files with 843 additions and 130 deletions

View File

@@ -1,4 +0,0 @@
# Connections
In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.

View File

@@ -0,0 +1,16 @@
Connections
---------
**In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.**
DB-GPT provides base class BaseConnect, you can inheriting and implement get_session(), get_table_names(), get_index_info(), get_database_list() and run().
- `mysql_connection <./connections/mysql_connection.html>`_: supported mysql_connection.
.. toctree::
:maxdepth: 2
:caption: Connections
:name: mysql_connection
:hidden:
./connections/mysql/mysql_connection.md

View File

@@ -0,0 +1,18 @@
MYSQL Connection
==================================
MYSQL can connect mysql server.
inheriting the RDBMSDatabase
```
class MySQLConnect(RDBMSDatabase):
"""Connect MySQL Database fetch MetaData
Args:
Usage:
"""
type: str = "MySQL"
dialect: str = "mysql"
driver: str = "pymysql"
default_db = ["information_schema", "performance_schema", "sys", "mysql"]
```

View File

@@ -0,0 +1,40 @@
Knowledge
---------
| As the knowledge base is currently the most significant user demand scenario, we natively support the construction and processing of knowledge bases. At the same time, we also provide multiple knowledge base management strategies in this project, such as pdf knowledge,md knowledge, txt knowledge, word knowledge, ppt knowledge:
**Create your own knowledge repository**
1.Place personal knowledge files or folders in the pilot/datasets directory.
We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
before execution: python -m spacy download zh_core_web_sm
2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma
(now only support Chroma and Milvus, if you set Milvus, please set MILVUS_URL and MILVUS_PORT)
2.Run the knowledge repository script in the tools directory.
python tools/knowledge_init.py
note : --vector_name : your vector store name default_value:default
3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
.. toctree::
:maxdepth: 2
:caption: Knowledge
:name: pdf_embedding
:hidden:
./knowledge/pdf/pdf_embedding.md
./knowledge/markdown/markdown_embedding.md
./knowledge/word/word_embedding.md
./knowledge/url/url_embedding.md
./knowledge/ppt/ppt_embedding.md

View File

@@ -0,0 +1,42 @@
MarkdownEmbedding
==================================
markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class MarkdownEmbedding(SourceEmbedding):
"""pdf embedding for read pdf document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from markdown path."""
loader = EncodeTextLoader(self.file_path)
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=100,
)
return loader.load_and_split(textsplitter)
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```

View File

@@ -0,0 +1,43 @@
PDFEmbedding
==================================
pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class PDFEmbedding(SourceEmbedding):
"""pdf embedding for read pdf document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from pdf path."""
loader = PyPDFLoader(self.file_path)
# textsplitter = CHNDocumentSplitter(
# pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
# )
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=100,
)
return loader.load_and_split(textsplitter)
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```

View File

@@ -0,0 +1,40 @@
PPTEmbedding
==================================
ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class PPTEmbedding(SourceEmbedding):
"""ppt embedding for read ppt document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from ppt path."""
loader = UnstructuredPowerPointLoader(self.file_path)
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=200,
)
return loader.load_and_split(textsplitter)
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```

View File

@@ -0,0 +1,47 @@
URL Embedding
==================================
url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class URLEmbedding(SourceEmbedding):
"""url embedding for read url document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with url path."""
super().__init__(file_path, vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from url path."""
loader = WebBaseLoader(web_path=self.file_path)
if CFG.LANGUAGE == "en":
text_splitter = CharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=1000)
return loader.load_and_split(text_splitter)
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
content = d.page_content.replace("\n", "")
soup = BeautifulSoup(content, "html.parser")
for tag in soup(["!doctype", "meta"]):
tag.extract()
documents[i].page_content = soup.get_text()
i += 1
return documents
```

View File

@@ -0,0 +1,38 @@
WordEmbedding
==================================
word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class WordEmbedding(SourceEmbedding):
"""word embedding for read word document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with word path."""
super().__init__(file_path, vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from word path."""
loader = UnstructuredWordDocumentLoader(self.file_path)
textsplitter = CHNDocumentSplitter(
pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
)
return loader.load_and_split(textsplitter)
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```