doc:db-gpt doc

2025-09-02 17:45:31 +00:00 · 2023-06-14 15:31:11 +08:00
parent 333aad7bc4
commit 0c8b424b04
27 changed files with 843 additions and 130 deletions
--- a/docs/modules/connections.md
+++ b/docs/modules/connections.md
@@ -1,4 +0,0 @@
-# Connections
-
-In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.
-
--- a/docs/modules/connections.rst
+++ b/docs/modules/connections.rst
@@ -0,0 +1,16 @@
+Connections
+---------
+**In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.**
+
+DB-GPT provides base class BaseConnect, you can inheriting and implement get_session(), get_table_names(), get_index_info(), get_database_list() and run().
+
+- `mysql_connection <./connections/mysql_connection.html>`_: supported mysql_connection.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Connections
+   :name: mysql_connection
+   :hidden:
+
+   ./connections/mysql/mysql_connection.md
--- a/docs/modules/connections/mysql/mysql_connection.md
+++ b/docs/modules/connections/mysql/mysql_connection.md
@@ -0,0 +1,18 @@
+MYSQL Connection
+==================================
+MYSQL can connect mysql server.
+
+inheriting the RDBMSDatabase
+```
+class MySQLConnect(RDBMSDatabase):
+    """Connect MySQL Database fetch MetaData
+    Args:
+    Usage:
+    """
+
+    type: str = "MySQL"
+    dialect: str = "mysql"
+    driver: str = "pymysql"
+
+    default_db = ["information_schema", "performance_schema", "sys", "mysql"]
+```
--- a/docs/modules/knowledge.rst
+++ b/docs/modules/knowledge.rst
@@ -0,0 +1,40 @@
+Knowledge
+---------
+
+| As the knowledge base is currently the most significant user demand scenario, we natively support the construction and processing of knowledge bases. At the same time, we also provide multiple knowledge base management strategies in this project, such as pdf knowledge,md knowledge, txt knowledge, word knowledge, ppt knowledge:
+
+
+**Create your own knowledge repository**
+
+1.Place personal knowledge files or folders in the pilot/datasets directory.
+
+We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
+
+before execution:  python -m spacy download zh_core_web_sm
+
+2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma
+(now only support Chroma and Milvus, if you set Milvus, please set MILVUS_URL and MILVUS_PORT)
+
+2.Run the knowledge repository script in the tools directory.
+
+python tools/knowledge_init.py
+note : --vector_name : your vector store name  default_value:default
+
+3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
+
+Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
+
+- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Knowledge
+   :name: pdf_embedding
+   :hidden:
+
+   ./knowledge/pdf/pdf_embedding.md
+   ./knowledge/markdown/markdown_embedding.md
+   ./knowledge/word/word_embedding.md
+   ./knowledge/url/url_embedding.md
+   ./knowledge/ppt/ppt_embedding.md
--- a/docs/modules/knowledge/markdown/markdown_embedding.md
+++ b/docs/modules/knowledge/markdown/markdown_embedding.md
@@ -0,0 +1,42 @@
+MarkdownEmbedding
+==================================
+markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+
+```
+class  MarkdownEmbedding(SourceEmbedding):
+    """pdf embedding for read pdf document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+
+```
+@register
+    def read(self):
+        """Load from markdown path."""
+        loader = EncodeTextLoader(self.file_path)
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=100,
+        )
+        return loader.load_and_split(textsplitter)
+```
+
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/pdf/pdf_embedding.md
+++ b/docs/modules/knowledge/pdf/pdf_embedding.md
@@ -0,0 +1,43 @@
+PDFEmbedding
+==================================
+pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class PDFEmbedding(SourceEmbedding):
+    """pdf embedding for read pdf document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from pdf path."""
+        loader = PyPDFLoader(self.file_path)
+        # textsplitter = CHNDocumentSplitter(
+        #     pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
+        # )
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=100,
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/ppt/ppt_embedding.md
+++ b/docs/modules/knowledge/ppt/ppt_embedding.md
@@ -0,0 +1,40 @@
+PPTEmbedding
+==================================
+ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class PPTEmbedding(SourceEmbedding):
+    """ppt embedding for read ppt document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from ppt path."""
+        loader = UnstructuredPowerPointLoader(self.file_path)
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=200,
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/url/url_embedding.md
+++ b/docs/modules/knowledge/url/url_embedding.md
@@ -0,0 +1,47 @@
+URL Embedding
+==================================
+url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class URLEmbedding(SourceEmbedding):
+    """url embedding for read url document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with url path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from url path."""
+        loader = WebBaseLoader(web_path=self.file_path)
+        if CFG.LANGUAGE == "en":
+            text_splitter = CharacterTextSplitter(
+                chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+                chunk_overlap=20,
+                length_function=len,
+            )
+        else:
+            text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=1000)
+        return loader.load_and_split(text_splitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            content = d.page_content.replace("\n", "")
+            soup = BeautifulSoup(content, "html.parser")
+            for tag in soup(["!doctype", "meta"]):
+                tag.extract()
+            documents[i].page_content = soup.get_text()
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/word/word_embedding.md
+++ b/docs/modules/knowledge/word/word_embedding.md
@@ -0,0 +1,38 @@
+WordEmbedding
+==================================
+word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class WordEmbedding(SourceEmbedding):
+    """word embedding for read word document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with word path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from word path."""
+        loader = UnstructuredWordDocumentLoader(self.file_path)
+        textsplitter = CHNDocumentSplitter(
+            pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```