doc:update knowledge api document

This commit is contained in:
aries_ckt 2023-07-12 16:33:34 +08:00
parent 16d6ce8c89
commit 30adbaf4fd
12 changed files with 90 additions and 12 deletions

View File

@ -63,6 +63,7 @@ https://github.com/csunny/DB-GPT/assets/13723926/55f31781-1d49-4757-b96e-7ef6d3d
</p>
## Releases
- [2023/07/12]🔥🔥🔥DB-GPT python api package 0.3.0. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/installation.html)
- [2023/07/06]🔥🔥🔥Brand-new DB-GPT product with a brand-new web UI. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html)
- [2023/06/25]🔥support chatglm2-6b model. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html)
- [2023/06/14] support gpt4all model, which can run at M1/M2, or cpu machine. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html)

View File

@ -25,14 +25,14 @@ $ docker run --name=mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=aa12345678 -dit my
We use [Chroma embedding database](https://github.com/chroma-core/chroma) as the default for our vector database, so there is no need for special installation. If you choose to connect to other databases, you can follow our tutorial for installation and configuration.
For the entire installation process of DB-GPT, we use the miniconda3 virtual environment. Create a virtual environment and install the Python dependencies.
```
```{tip}
python>=3.10
conda create -n dbgpt_env python=3.10
conda activate dbgpt_env
pip install -r requirements.txt
```
Before use DB-GPT Knowledge Management
```
```{tip}
python -m spacy download zh_core_web_sm
```
@ -40,7 +40,7 @@ python -m spacy download zh_core_web_sm
Once the environment is installed, we have to create a new folder "models" in the DB-GPT project, and then we can put all the models downloaded from huggingface in this directory
Notice make sure you have install git-lfs
```
```{tip}
git clone https://huggingface.co/Tribbiani/vicuna-13b
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
@ -49,7 +49,7 @@ git clone https://huggingface.co/THUDM/chatglm2-6b
The model files are large and will take a long time to download. During the download, let's configure the .env file, which needs to be copied and created from the .env.template
```
```{tip}
cp .env.template .env
```

View File

@ -0,0 +1,29 @@
# Installation
DB-GPT provides a third-party Python API package that you can integrate into your own code.
### Installation from Pip
You can simply pip install:
```{tip}
pip install -i https://pypi.org/ db-gpt==0.3.0
```
Notice:make sure python>=3.10
### Environment Setup
By default, if you use the EmbeddingEngine api
you will prepare embedding models from huggingface
Notice make sure you have install git-lfs
```{tip}
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
```
version:
- db-gpt0.3.0
- [embedding_engine api](https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html)
- [multi source embedding](https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html)
- [vector connector](https://db-gpt.readthedocs.io/en/latest/modules/vector.html)

View File

@ -48,6 +48,7 @@ Getting Started
:hidden:
getting_started/getting_started.md
getting_started/installation.md
getting_started/concepts.md
getting_started/tutorials.md

View File

@ -105,7 +105,12 @@ Document type can be .txt, .pdf, .md, .doc, .ppt.
Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
- `pdf_embedding <./knowledge/pdf/pdf_embedding.html>`_: supported pdf embedding.
- `markdown_embedding <./knowledge/markdown/markdown_embedding.html>`_: supported markdown embedding.
- `word_embedding <./knowledge/word/word_embedding.html>`_: supported word embedding.
- `url_embedding <./knowledge/url/url_embedding.html>`_: supported url embedding.
- `ppt_embedding <./knowledge/ppt/ppt_embedding.html>`_: supported ppt embedding.
- `string_embedding <./knowledge/string/string_embedding.html>`_: supported raw text embedding.
.. toctree::
@ -118,4 +123,5 @@ Note that the default vector model used is text2vec-large-chinese (which is a la
./knowledge/markdown/markdown_embedding.md
./knowledge/word/word_embedding.md
./knowledge/url/url_embedding.md
./knowledge/ppt/ppt_embedding.md
./knowledge/ppt/ppt_embedding.md
./knowledge/string/string_embedding.md

View File

@ -1,4 +1,4 @@
MarkdownEmbedding
Markdown
==================================
markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

View File

@ -1,4 +1,4 @@
PDFEmbedding
PDF
==================================
pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

View File

@ -1,4 +1,4 @@
PPTEmbedding
PPT
==================================
ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

View File

@ -0,0 +1,41 @@
String
==================================
string embedding can import a long raw text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class StringEmbedding(SourceEmbedding):
"""string embedding for read string document."""
def __init__(
self,
file_path,
vector_store_config,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize raw text word path."""
super().__init__(file_path=file_path, vector_store_config=vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or None
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from String path."""
metadata = {"source": "raw text"}
return [Document(page_content=self.file_path, metadata=metadata)]
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```

View File

@ -1,4 +1,4 @@
URL Embedding
URL
==================================
url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

View File

@ -1,4 +1,4 @@
WordEmbedding
Word
==================================
word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.

View File

@ -24,7 +24,7 @@ class StringEmbedding(SourceEmbedding):
@register
def read(self):
"""Load from String path."""
metadata = {"source": "db_summary"}
metadata = {"source": "raw text"}
return [Document(page_content=self.file_path, metadata=metadata)]
@register