Merge branch 'dbgpt_doc' into ty_test

# Conflicts: # pilot/common/plugins.py
2025-09-08 20:39:44 +00:00 · 2023-06-14 21:53:08 +08:00
parent 6b1f00f7c3 6adc63aea7
commit ddd8e7a8c5
67 changed files with 1522 additions and 333 deletions
--- a/docs/modules/connections.md
+++ b/docs/modules/connections.md
@@ -1,4 +0,0 @@
-# Connections
-
-In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.
-
--- a/docs/modules/connections.rst
+++ b/docs/modules/connections.rst
@@ -0,0 +1,16 @@
+Connections
+---------
+**In order to interact more conveniently with users' private environments, the project has designed a connection module, which can support connection to databases, Excel, knowledge bases, and other environments to achieve information and data exchange.**
+
+DB-GPT provides base class BaseConnect, you can inheriting and implement get_session(), get_table_names(), get_index_info(), get_database_list() and run().
+
+- `mysql_connection <./connections/mysql_connection.html>`_: supported mysql_connection.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Connections
+   :name: mysql_connection
+   :hidden:
+
+   ./connections/mysql/mysql_connection.md
--- a/docs/modules/connections/mysql/mysql_connection.md
+++ b/docs/modules/connections/mysql/mysql_connection.md
@@ -0,0 +1,18 @@
+MYSQL Connection
+==================================
+MYSQL can connect mysql server.
+
+inheriting the RDBMSDatabase
+```
+class MySQLConnect(RDBMSDatabase):
+    """Connect MySQL Database fetch MetaData
+    Args:
+    Usage:
+    """
+
+    type: str = "MySQL"
+    dialect: str = "mysql"
+    driver: str = "pymysql"
+
+    default_db = ["information_schema", "performance_schema", "sys", "mysql"]
+```
--- a/docs/modules/knowledge.rst
+++ b/docs/modules/knowledge.rst
@@ -0,0 +1,40 @@
+Knowledge
+---------
+
+| As the knowledge base is currently the most significant user demand scenario, we natively support the construction and processing of knowledge bases. At the same time, we also provide multiple knowledge base management strategies in this project, such as pdf knowledge,md knowledge, txt knowledge, word knowledge, ppt knowledge:
+
+
+**Create your own knowledge repository**
+
+1.Place personal knowledge files or folders in the pilot/datasets directory.
+
+We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
+
+before execution:  python -m spacy download zh_core_web_sm
+
+2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma
+(now only support Chroma and Milvus, if you set Milvus, please set MILVUS_URL and MILVUS_PORT)
+
+2.Run the knowledge repository script in the tools directory.
+
+python tools/knowledge_init.py
+note : --vector_name : your vector store name  default_value:default
+
+3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
+
+Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
+
+- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Knowledge
+   :name: pdf_embedding
+   :hidden:
+
+   ./knowledge/pdf/pdf_embedding.md
+   ./knowledge/markdown/markdown_embedding.md
+   ./knowledge/word/word_embedding.md
+   ./knowledge/url/url_embedding.md
+   ./knowledge/ppt/ppt_embedding.md
--- a/docs/modules/knowledge/markdown/markdown_embedding.md
+++ b/docs/modules/knowledge/markdown/markdown_embedding.md
@@ -0,0 +1,42 @@
+MarkdownEmbedding
+==================================
+markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+
+```
+class  MarkdownEmbedding(SourceEmbedding):
+    """pdf embedding for read pdf document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+
+```
+@register
+    def read(self):
+        """Load from markdown path."""
+        loader = EncodeTextLoader(self.file_path)
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=100,
+        )
+        return loader.load_and_split(textsplitter)
+```
+
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/pdf/pdf_embedding.md
+++ b/docs/modules/knowledge/pdf/pdf_embedding.md
@@ -0,0 +1,43 @@
+PDFEmbedding
+==================================
+pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class PDFEmbedding(SourceEmbedding):
+    """pdf embedding for read pdf document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from pdf path."""
+        loader = PyPDFLoader(self.file_path)
+        # textsplitter = CHNDocumentSplitter(
+        #     pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
+        # )
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=100,
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/ppt/ppt_embedding.md
+++ b/docs/modules/knowledge/ppt/ppt_embedding.md
@@ -0,0 +1,40 @@
+PPTEmbedding
+==================================
+ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class PPTEmbedding(SourceEmbedding):
+    """ppt embedding for read ppt document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with pdf path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from ppt path."""
+        loader = UnstructuredPowerPointLoader(self.file_path)
+        textsplitter = SpacyTextSplitter(
+            pipeline="zh_core_web_sm",
+            chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+            chunk_overlap=200,
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/url/url_embedding.md
+++ b/docs/modules/knowledge/url/url_embedding.md
@@ -0,0 +1,47 @@
+URL Embedding
+==================================
+url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class URLEmbedding(SourceEmbedding):
+    """url embedding for read url document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with url path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from url path."""
+        loader = WebBaseLoader(web_path=self.file_path)
+        if CFG.LANGUAGE == "en":
+            text_splitter = CharacterTextSplitter(
+                chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
+                chunk_overlap=20,
+                length_function=len,
+            )
+        else:
+            text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=1000)
+        return loader.load_and_split(text_splitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            content = d.page_content.replace("\n", "")
+            soup = BeautifulSoup(content, "html.parser")
+            for tag in soup(["!doctype", "meta"]):
+                tag.extract()
+            documents[i].page_content = soup.get_text()
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/word/word_embedding.md
+++ b/docs/modules/knowledge/word/word_embedding.md
@@ -0,0 +1,38 @@
+WordEmbedding
+==================================
+word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class WordEmbedding(SourceEmbedding):
+    """word embedding for read word document."""
+
+    def __init__(self, file_path, vector_store_config):
+        """Initialize with word path."""
+        super().__init__(file_path, vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from word path."""
+        loader = UnstructuredWordDocumentLoader(self.file_path)
+        textsplitter = CHNDocumentSplitter(
+            pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
+        )
+        return loader.load_and_split(textsplitter)
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/llms.md
+++ b/docs/modules/llms.md
@@ -86,4 +86,25 @@ class ChatGLMChatAdapter(BaseChatAdpter):

        return chatglm_generate_stream
 ```
- if you want to integrate your own model, just need to inheriting BaseLLMAdaper and BaseChatAdpter and implement the methods
+ if you want to integrate your own model, just need to inheriting BaseLLMAdaper and BaseChatAdpter and implement the methods
+ 
+
+## Multi Proxy LLMs
+### 1. Openai proxy
+ If you haven't deployed a private infrastructure for a large model, or if you want to use DB-GPT in a low-cost and high-efficiency way, you can also use OpenAI's large model as your underlying model.
+
+- If your environment deploying DB-GPT has access to OpenAI, then modify the .env configuration file as below will work.
+```
+LLM_MODEL=proxy_llm
+MODEL_SERVER=127.0.0.1:8000
+PROXY_API_KEY=sk-xxx
+PROXY_SERVER_URL=https://api.openai.com/v1/chat/completions
+```
+
+- If you can't access OpenAI locally but have an OpenAI proxy service, you can configure as follows.
+```
+LLM_MODEL=proxy_llm
+MODEL_SERVER=127.0.0.1:8000
+PROXY_API_KEY=sk-xxx
+PROXY_SERVER_URL={your-openai-proxy-server/v1/chat/completions}
+```
--- a/docs/modules/plugins.md
+++ b/docs/modules/plugins.md
@@ -1,3 +1,98 @@
 # Plugins

-The ability of Agent and Plugin is the core of whether large models can be automated. In this project, we natively support the plugin mode, and large models can automatically achieve their goals. At the same time, in order to give full play to the advantages of the community, the plugins used in this project natively support the Auto-GPT plugin ecology, that is, Auto-GPT plugins can directly run in our project.
+The ability of Agent and Plugin is the core of whether large models can be automated. In this project, we natively support the plugin mode, and large models can automatically achieve their goals. At the same time, in order to give full play to the advantages of the community, the plugins used in this project natively support the Auto-GPT plugin ecology, that is, Auto-GPT plugins can directly run in our project.
+
+## Local Plugins
+
+### 1.1 How to write local plugins.
+
+- Local plugins use the Auto-GPT plugin template. A simple example is as follows: first write a plugin file called "sql_executor.py".
+```python
+import pymysql
+import pymysql.cursors
+
+def get_conn():
+    return pymysql.connect(
+        host="127.0.0.1",
+        port=int("2883"),
+        user="mock",
+        password="mock",
+        database="mock",
+        charset="utf8mb4",
+        ssl_ca=None,
+    )
+
+def ob_sql_executor(sql: str):
+    try:
+        conn = get_conn()
+        with conn.cursor() as cursor:
+            cursor.execute(sql)
+            result = cursor.fetchall()
+        field_names = tuple(i[0] for i in cursor.description)
+        result = list(result)
+        result.insert(0, field_names)
+        return result
+    except pymysql.err.ProgrammingError as e:
+        return str(e)
+```
+
+Then set the "can_handle_post_prompt" method of the plugin template to True. In the "post_prompt" method, write the prompt information and the mapped plugin function.
+
+```python
+"""This is a template for DB-GPT plugins."""
+from typing import Any, Dict, List, Optional, Tuple, TypeVar, TypedDict
+
+from auto_gpt_plugin_template import AutoGPTPluginTemplate
+
+PromptGenerator = TypeVar("PromptGenerator")
+
+class Message(TypedDict):
+    role: str
+    content: str
+
+class DBGPTOceanBase(AutoGPTPluginTemplate):
+    """
+    This is an DB-GPT plugin to connect OceanBase.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self._name = "DB-GPT-OB-Serverless-Plugin"
+        self._version = "0.1.0"
+        self._description = "This is an DB-GPT plugin to connect OceanBase."
+
+    def can_handle_post_prompt(self) -> bool:
+        return True
+
+    def post_prompt(self, prompt: PromptGenerator) -> PromptGenerator:
+        from .sql_executor import ob_sql_executor
+
+        prompt.add_command(
+            "ob_sql_executor",
+            "Execute SQL in OceanBase Database.",
+            {"sql": "<sql>"},
+            ob_sql_executor,
+        )
+        return prompt
+    ...
+
+```
+
+### 1.2 How to use local plugins
+
+- Pack your plugin project into `your-plugin.zip` and place it in the `/plugins/` directory of the DB-GPT project. After starting the webserver, you can select and use it in the `Plugin Model` section.
+
+
+## Public Plugins
+
+### 1.1 How to use public plugins
+
+- By default, after launching the webserver, plugins from the public plugin library `DB-GPT-Plugins` will be automatically loaded. For more details, please refer to [DB-GPT-Plugins](https://github.com/csunny/DB-GPT-Plugins)
+
+### 1.2 Contribute to the DB-GPT-Plugins repository
+
+- Please refer to the plugin development process in the public plugin library, and put the configuration parameters in `.plugin_env`
+
+- We warmly welcome everyone to contribute plugins to the public plugin library!
+
+