doc:update knowledge api document

2025-09-14 05:31:40 +00:00 · 2023-07-12 16:33:34 +08:00
parent 16d6ce8c89
commit 30adbaf4fd
12 changed files with 90 additions and 12 deletions
--- a/README.md
+++ b/README.md
@@ -63,6 +63,7 @@ https://github.com/csunny/DB-GPT/assets/13723926/55f31781-1d49-4757-b96e-7ef6d3d
 </p>

 ## Releases 
+- [2023/07/12]🔥🔥🔥DB-GPT python api package 0.3.0. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/installation.html) 
 - [2023/07/06]🔥🔥🔥Brand-new DB-GPT product with a brand-new web UI. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html) 
 - [2023/06/25]🔥support chatglm2-6b model. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html) 
 - [2023/06/14] support gpt4all model, which can run at M1/M2, or cpu machine. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html) 
--- a/docs/getting_started/getting_started.md
+++ b/docs/getting_started/getting_started.md
@@ -25,14 +25,14 @@ $ docker run --name=mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=aa12345678 -dit my
 We use [Chroma embedding database](https://github.com/chroma-core/chroma) as the default for our vector database, so there is no need for special installation. If you choose to connect to other databases, you can follow our tutorial for installation and configuration. 
 For the entire installation process of DB-GPT, we use the miniconda3 virtual environment. Create a virtual environment and install the Python dependencies.

-```
+```{tip}
 python>=3.10
 conda create -n dbgpt_env python=3.10
 conda activate dbgpt_env
 pip install -r requirements.txt
 ```
 Before use DB-GPT Knowledge Management
-```
+```{tip}
 python -m spacy download zh_core_web_sm

 ```
@@ -40,7 +40,7 @@ python -m spacy download zh_core_web_sm
 Once the environment is installed, we have to create a new folder "models" in the DB-GPT project, and then we can put all the models downloaded from huggingface in this directory

 Notice make sure you have install git-lfs
-```
+```{tip}
 git clone https://huggingface.co/Tribbiani/vicuna-13b 
 git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
 git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
@@ -49,7 +49,7 @@ git clone https://huggingface.co/THUDM/chatglm2-6b

 The model files are large and will take a long time to download. During the download, let's configure the .env file, which needs to be copied and created from the .env.template

-```
+```{tip}
 cp .env.template .env
 ```

--- a/docs/getting_started/installation.md
+++ b/docs/getting_started/installation.md
@@ -0,0 +1,29 @@
+# Installation
+DB-GPT provides a third-party Python API package that you can integrate into your own code.
+
+### Installation from Pip
+
+You can simply pip install:
+```{tip}
+pip install -i https://pypi.org/ db-gpt==0.3.0
+```
+Notice:make sure python>=3.10
+
+
+### Environment Setup
+
+By default, if you use the EmbeddingEngine api
+
+you will prepare embedding models from huggingface
+
+Notice make sure you have install git-lfs
+```{tip}
+git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
+```
+version:
+- db-gpt0.3.0
+  - [embedding_engine api](https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html)
+  - [multi source embedding](https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html)
+  - [vector connector](https://db-gpt.readthedocs.io/en/latest/modules/vector.html)
+
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -48,6 +48,7 @@ Getting Started
   :hidden:

   getting_started/getting_started.md
+   getting_started/installation.md
   getting_started/concepts.md
   getting_started/tutorials.md

--- a/docs/modules/knowledge.rst
+++ b/docs/modules/knowledge.rst
@@ -105,7 +105,12 @@ Document type can be .txt, .pdf, .md, .doc, .ppt.

 Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.

- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
+- `pdf_embedding <./knowledge/pdf/pdf_embedding.html>`_: supported pdf embedding.
+- `markdown_embedding <./knowledge/markdown/markdown_embedding.html>`_: supported markdown embedding.
+- `word_embedding <./knowledge/word/word_embedding.html>`_: supported word embedding.
+- `url_embedding <./knowledge/url/url_embedding.html>`_: supported url embedding.
+- `ppt_embedding <./knowledge/ppt/ppt_embedding.html>`_: supported ppt embedding.
+- `string_embedding <./knowledge/string/string_embedding.html>`_: supported raw text embedding.


 .. toctree::
@@ -118,4 +123,5 @@ Note that the default vector model used is text2vec-large-chinese (which is a la
   ./knowledge/markdown/markdown_embedding.md
   ./knowledge/word/word_embedding.md
   ./knowledge/url/url_embedding.md
-   ./knowledge/ppt/ppt_embedding.md
+   ./knowledge/ppt/ppt_embedding.md
+   ./knowledge/string/string_embedding.md
--- a/docs/modules/knowledge/markdown/markdown_embedding.md
+++ b/docs/modules/knowledge/markdown/markdown_embedding.md
@@ -1,4 +1,4 @@
-MarkdownEmbedding
+Markdown
 ==================================
 markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

--- a/docs/modules/knowledge/pdf/pdf_embedding.md
+++ b/docs/modules/knowledge/pdf/pdf_embedding.md
@@ -1,4 +1,4 @@
-PDFEmbedding
+PDF
 ==================================
 pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

--- a/docs/modules/knowledge/ppt/ppt_embedding.md
+++ b/docs/modules/knowledge/ppt/ppt_embedding.md
@@ -1,4 +1,4 @@
-PPTEmbedding
+PPT
 ==================================
 ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

--- a/docs/modules/knowledge/string/string_embedding.md
+++ b/docs/modules/knowledge/string/string_embedding.md
@@ -0,0 +1,41 @@
+String
+==================================
+string embedding can import a long raw text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
+
+inheriting the SourceEmbedding
+```
+class StringEmbedding(SourceEmbedding):
+    """string embedding for read string document."""
+
+    def __init__(
+        self,
+        file_path,
+        vector_store_config,
+        text_splitter: Optional[TextSplitter] = None,
+    ):
+        """Initialize raw text word path."""
+        super().__init__(file_path=file_path, vector_store_config=vector_store_config)
+        self.file_path = file_path
+        self.vector_store_config = vector_store_config
+        self.text_splitter = text_splitter or None
+```
+
+implement read() and data_process()
+read() method allows you to read data and split data into chunk
+```
+@register
+    def read(self):
+        """Load from String path."""
+        metadata = {"source": "raw text"}
+        return [Document(page_content=self.file_path, metadata=metadata)]
+```
+data_process() method allows you to pre processing your ways
+```
+@register
+    def data_process(self, documents: List[Document]):
+        i = 0
+        for d in documents:
+            documents[i].page_content = d.page_content.replace("\n", "")
+            i += 1
+        return documents
+```
--- a/docs/modules/knowledge/url/url_embedding.md
+++ b/docs/modules/knowledge/url/url_embedding.md
@@ -1,4 +1,4 @@
-URL Embedding
+URL
 ==================================
 url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

--- a/docs/modules/knowledge/word/word_embedding.md
+++ b/docs/modules/knowledge/word/word_embedding.md
@@ -1,4 +1,4 @@
-WordEmbedding
+Word
 ==================================
 word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

--- a/pilot/embedding_engine/string_embedding.py
+++ b/pilot/embedding_engine/string_embedding.py
@@ -24,7 +24,7 @@ class StringEmbedding(SourceEmbedding):
    @register
    def read(self):
        """Load from String path."""
-        metadata = {"source": "db_summary"}
+        metadata = {"source": "raw text"}
        return [Document(page_content=self.file_path, metadata=metadata)]

    @register