feat:dbgpt api 0.3.0 (#319)

1.EmbeddingEngine:provide knowledge_embedding() and similar_search()
2.Multi SourceEmbedding
3.doc for installation
4.fix chroma exit bug
This commit is contained in:
magic.chen 2023-07-14 13:47:06 +08:00 committed by GitHub
commit 75115f1175
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
52 changed files with 1101 additions and 511 deletions

View File

@ -62,16 +62,6 @@ https://github.com/csunny/DB-GPT/assets/13723926/55f31781-1d49-4757-b96e-7ef6d3d
<img src="./assets/chat_knowledge.png" width="800px" />
</p>
## Releases
- [2023/07/06]🔥🔥🔥Brand-new DB-GPT product with a brand-new web UI. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html)
- [2023/06/25]🔥support chatglm2-6b model. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html)
- [2023/06/14] support gpt4all model, which can run at M1/M2, or cpu machine. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html)
- [2023/06/01]🔥 On the basis of the Vicuna-13B basic model, task chain calls are implemented through plugins. For example, the implementation of creating a database with a single sentence.[demo](./assets/auto_plugin.gif)
- [2023/06/01]🔥 QLoRA guanaco(7b, 13b, 33b) support.
- [2023/05/28] Learning from crawling data from the Internet [demo](./assets/dbgpt_demo.gif)
- [2023/05/21] Generate SQL and execute it automatically. [demo](./assets/chat-data.gif)
- [2023/05/15] Chat with documents. [demo](./assets/new_knownledge_en.gif)
- [2023/05/06] SQL generation and diagnosis. [demo](./assets/demo_en.gif)
## Features

View File

@ -65,17 +65,6 @@ https://github.com/csunny/DB-GPT/assets/13723926/55f31781-1d49-4757-b96e-7ef6d3d
<img src="./assets/chat_knowledge.png" width="800px" />
</p>
## 最新发布
- [2023/07/06]🔥🔥🔥 全新的DB-GPT产品。 [使用文档](https://db-gpt.readthedocs.io/projects/db-gpt-docs-zh-cn/zh_CN/latest/getting_started/getting_started.html)
- [2023/06/25]🔥 支持ChatGLM2-6B模型。 [使用文档](https://db-gpt.readthedocs.io/projects/db-gpt-docs-zh-cn/zh_CN/latest/modules/llms.html)
- [2023/06/14]🔥 支持gpt4all模型可以在M1/M2 或者CPU机器上运行。 [使用文档](https://db-gpt.readthedocs.io/projects/db-gpt-docs-zh-cn/zh_CN/latest/modules/llms.html)
- [2023/06/01]🔥 在Vicuna-13B基础模型的基础上通过插件实现任务链调用。例如单句创建数据库的实现.
- [2023/06/01]🔥 QLoRA guanaco(原驼)支持, 支持4090运行33B
- [2023/05/28]🔥根据URL进行对话 [演示](./assets/chat_url_zh.gif)
- [2023/05/21] SQL生成与自动执行. [演示](./assets/auto_sql.gif)
- [2023/05/15] 知识库对话 [演示](./assets/new_knownledge.gif)
- [2023/05/06] SQL生成与诊断 [演示](./assets/演示.gif)
## 特性一览
目前我们已经发布了多种关键的特性,这里一一列举展示一下当前发布的能力。

View File

@ -25,22 +25,25 @@ $ docker run --name=mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=aa12345678 -dit my
We use [Chroma embedding database](https://github.com/chroma-core/chroma) as the default for our vector database, so there is no need for special installation. If you choose to connect to other databases, you can follow our tutorial for installation and configuration.
For the entire installation process of DB-GPT, we use the miniconda3 virtual environment. Create a virtual environment and install the Python dependencies.
```
```bash
python>=3.10
conda create -n dbgpt_env python=3.10
conda activate dbgpt_env
pip install -r requirements.txt
```
Before use DB-GPT Knowledge Management
```
```bash
python -m spacy download zh_core_web_sm
```
Once the environment is installed, we have to create a new folder "models" in the DB-GPT project, and then we can put all the models downloaded from huggingface in this directory
```{tip}
Notice make sure you have install git-lfs
```
```bash
git clone https://huggingface.co/Tribbiani/vicuna-13b
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
@ -49,7 +52,7 @@ git clone https://huggingface.co/THUDM/chatglm2-6b
The model files are large and will take a long time to download. During the download, let's configure the .env file, which needs to be copied and created from the .env.template
```
```{tip}
cp .env.template .env
```

View File

@ -0,0 +1,35 @@
# Installation
DB-GPT provides a third-party Python API package that you can integrate into your own code.
### Installation from Pip
You can simply pip install:
```bash
pip install -i https://pypi.org/simple/ db-gpt==0.3.0
```
```{tip}
Notice:make sure python>=3.10
```
### Environment Setup
By default, if you use the EmbeddingEngine api
you will prepare embedding models from huggingface
```{tip}
Notice make sure you have install git-lfs
```
```bash
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese
```
version:
- db-gpt0.3.0
- [embedding_engine api](https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html)
- [multi source embedding](https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html)
- [vector connector](https://db-gpt.readthedocs.io/en/latest/modules/vector.html)

View File

@ -48,6 +48,7 @@ Getting Started
:hidden:
getting_started/getting_started.md
getting_started/installation.md
getting_started/concepts.md
getting_started/tutorials.md

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-07-05 17:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -19,29 +19,29 @@ msgstr ""
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../getting_started/getting_started.md:1 2e1519d628044c07b384e8bbe441863a
#: ../../getting_started/getting_started.md:1 0b2e795438a3413c875fd80191e85bad
msgid "Quickstart Guide"
msgstr "使用指南"
#: ../../getting_started/getting_started.md:3 00e8dc6e242d4f3b8b2fbc5e06f1f14e
#: ../../getting_started/getting_started.md:3 7b84c9776f8a4f9fb55afc640f37f45c
msgid ""
"This tutorial gives you a quick walkthrough about use DB-GPT with you "
"environment and data."
msgstr "本教程为您提供了关于如何使用DB-GPT的使用指南。"
#: ../../getting_started/getting_started.md:5 4b4473a5fbd64cef996d82fa36abe136
#: ../../getting_started/getting_started.md:5 1b2880e1ef674bfdbf39ac9f330aeec9
msgid "Installation"
msgstr "安装"
#: ../../getting_started/getting_started.md:7 5ab3187dd2134afe958d83a431c98f43
#: ../../getting_started/getting_started.md:7 d0a8c6654bfe4bbdb0eb40ceb2ea3388
msgid "To get started, install DB-GPT with the following steps."
msgstr "请按照以下步骤安装DB-GPT"
#: ../../getting_started/getting_started.md:9 7286e3a0da00450c9a6e9f29dbd27130
#: ../../getting_started/getting_started.md:9 0a4e0b06c7fe49a9b2ca56ba2eb7b8ba
msgid "1. Hardware Requirements"
msgstr "1. 硬件要求"
#: ../../getting_started/getting_started.md:10 3f3d279ca8a54c8c8ed16af3e0ffb281
#: ../../getting_started/getting_started.md:10 2b42f6546ef141f696943ba2120584e5
msgid ""
"As our project has the ability to achieve ChatGPT performance of over "
"85%, there are certain hardware requirements. However, overall, the "
@ -49,62 +49,62 @@ msgid ""
"specific hardware requirements for deployment are as follows:"
msgstr "由于我们的项目有能力达到85%以上的ChatGPT性能所以对硬件有一定的要求。但总体来说我们在消费级的显卡上即可完成项目的部署使用具体部署的硬件说明如下:"
#: ../../getting_started/getting_started.md 6e1e882511254687bd46fe45447794d1
#: ../../getting_started/getting_started.md 4df0c44eff8741f39ca0fdeff222f90c
msgid "GPU"
msgstr "GPU"
#: ../../getting_started/getting_started.md f0ee9919e1254bcdbe6e489a5fbf450f
#: ../../getting_started/getting_started.md b740a2991ce546cca43a426b760e9901
msgid "VRAM Size"
msgstr "显存大小"
#: ../../getting_started/getting_started.md eed88601ef0b49b58d95b89928a3810e
#: ../../getting_started/getting_started.md 222b91ff82f14d12acaac5aa238758c8
msgid "Performance"
msgstr "显存大小"
#: ../../getting_started/getting_started.md 4f717383ef2d4e2da9ee2d1c148aa6c5
#: ../../getting_started/getting_started.md c2d2ae6a4c964c4f90a9009160754782
msgid "RTX 4090"
msgstr "RTX 4090"
#: ../../getting_started/getting_started.md d2d9bd1b57694404b39cdef49fd5b570
#: d7d914b8d5e34ac192b94d48f0ee1781
#: ../../getting_started/getting_started.md 529220ec6a294e449dc460ba2e8829a1
#: 5e0c5900842e4d66b2064b13cc31a3ad
msgid "24 GB"
msgstr "24 GB"
#: ../../getting_started/getting_started.md cb86730ab05e4172941c3e771384c4ba
#: ../../getting_started/getting_started.md 84d29eef342f4d6282295c0e32487548
msgid "Smooth conversation inference"
msgstr "可以流畅的进行对话推理,无卡顿"
#: ../../getting_started/getting_started.md 3e32d5c38bf6499cbfedb80944549114
#: ../../getting_started/getting_started.md 5a10effe322e4afb8315415c04dc05a4
msgid "RTX 3090"
msgstr "RTX 3090"
#: ../../getting_started/getting_started.md 1d3caa2a06844997ad55d20863559e9f
#: ../../getting_started/getting_started.md 8924059525ab43329a8bb6659e034d5e
msgid "Smooth conversation inference, better than V100"
msgstr "可以流畅进行对话推理有卡顿感但好于V100"
#: ../../getting_started/getting_started.md b80ec359bd004d5f801ec09ca3b2d0ff
#: ../../getting_started/getting_started.md 10f5bc076f524127a956d7a23f3666ba
msgid "V100"
msgstr "V100"
#: ../../getting_started/getting_started.md aed55a6b8c8d49d9b9c02bfd5c10b062
#: ../../getting_started/getting_started.md 7d664e81984847c7accd08db93fad404
msgid "16 GB"
msgstr "16 GB"
#: ../../getting_started/getting_started.md dcd6daab75fe4bf8b8dd19ea785f0bd6
#: ../../getting_started/getting_started.md 86765bc9ab01409fb7f5edf04f9b32a5
msgid "Conversation inference possible, noticeable stutter"
msgstr "可以进行对话推理,有明显卡顿"
#: ../../getting_started/getting_started.md:18 e39a4b763ed74cea88d54d163ea72ce0
#: ../../getting_started/getting_started.md:18 a0ac5591c0ac4ac6a385e562353daf22
msgid "2. Install"
msgstr "2. 安装"
#: ../../getting_started/getting_started.md:20 9beba274b78a46c6aafb30173372b334
#: ../../getting_started/getting_started.md:20 a64a9a5945074ece872509f8cb425da9
msgid ""
"This project relies on a local MySQL database service, which you need to "
"install locally. We recommend using Docker for installation."
msgstr "本项目依赖一个本地的 MySQL 数据库服务,你需要本地安装,推荐直接使用 Docker 安装。"
#: ../../getting_started/getting_started.md:25 3bce689bb49043eca5b9aa3c5525eaac
#: ../../getting_started/getting_started.md:25 11e799a372ab4d0f8269cd7be98bebc6
msgid ""
"We use [Chroma embedding database](https://github.com/chroma-core/chroma)"
" as the default for our vector database, so there is no need for special "
@ -117,11 +117,11 @@ msgstr ""
"向量数据库我们默认使用的是Chroma内存数据库所以无需特殊安装如果有需要连接其他的同学可以按照我们的教程进行安装配置。整个DB-"
"GPT的安装过程我们使用的是miniconda3的虚拟环境。创建虚拟环境并安装python依赖包"
#: ../../getting_started/getting_started.md:34 61ad49740d0b49afa254cb2d10a0d2ae
#: ../../getting_started/getting_started.md:34 dcab69c83d4c48b9bb19c4336ee74a66
msgid "Before use DB-GPT Knowledge Management"
msgstr "使用知识库管理功能之前"
#: ../../getting_started/getting_started.md:40 656041e456f248a0a472be06357d7f89
#: ../../getting_started/getting_started.md:40 735aeb6ae8aa4344b7ff679548279acc
msgid ""
"Once the environment is installed, we have to create a new folder "
"\"models\" in the DB-GPT project, and then we can put all the models "
@ -130,29 +130,33 @@ msgstr ""
"环境安装完成后我们必须在DB-"
"GPT项目中创建一个新文件夹\"models\"然后我们可以把从huggingface下载的所有模型放到这个目录下。"
#: ../../getting_started/getting_started.md:42 4dfb7d63fdf544f2bf9dd8663efa8d31
#: ../../getting_started/getting_started.md:43 7cbefe131b24488b9be39b3e8ed4f563
#, fuzzy
msgid "Notice make sure you have install git-lfs"
msgstr "确保你已经安装了git-lfs"
#: ../../getting_started/getting_started.md:50 a52c137b8ef54b7ead41a2d8ff81d457
#: ../../getting_started/getting_started.md:53 54ec90ebb969475988451cd66e6ff412
msgid ""
"The model files are large and will take a long time to download. During "
"the download, let's configure the .env file, which needs to be copied and"
" created from the .env.template"
msgstr "模型文件很大,需要很长时间才能下载。在下载过程中,让我们配置.env文件它需要从。env.template中复制和创建。"
#: ../../getting_started/getting_started.md:56 db87d872a47047dc8cd1de390d068ed4
#: ../../getting_started/getting_started.md:56 9bdadbee88af4683a4eb7b4f221fb4b8
msgid "cp .env.template .env"
msgstr "cp .env.template .env"
#: ../../getting_started/getting_started.md:59 6357c4a0154b4f08a079419ac408442d
msgid ""
"You can configure basic parameters in the .env file, for example setting "
"LLM_MODEL to the model to be used"
msgstr "您可以在.env文件中配置基本参数例如将LLM_MODEL设置为要使用的模型。"
#: ../../getting_started/getting_started.md:58 c8865a327b4b44daa55813479c743e3c
#: ../../getting_started/getting_started.md:61 2f349f3ed3184b849ade2a15d5bf0c6c
msgid "3. Run"
msgstr "3. 运行"
#: ../../getting_started/getting_started.md:59 e81dabe730134753a4daa05a7bdd44af
#: ../../getting_started/getting_started.md:62 fe408e4405bd48288e2e746386615925
msgid ""
"You can refer to this document to obtain the Vicuna weights: "
"[Vicuna](https://github.com/lm-sys/FastChat/blob/main/README.md#model-"
@ -161,7 +165,7 @@ msgstr ""
"关于基础模型, 可以根据[Vicuna](https://github.com/lm-"
"sys/FastChat/blob/main/README.md#model-weights) 合成教程进行合成。"
#: ../../getting_started/getting_started.md:61 714cbc9485ea47d0a06aa1a31b9af3e3
#: ../../getting_started/getting_started.md:64 c0acfe28007f459ca21174f968763fa3
msgid ""
"If you have difficulty with this step, you can also directly use the "
"model from [this link](https://huggingface.co/Tribbiani/vicuna-7b) as a "
@ -170,11 +174,11 @@ msgstr ""
"如果此步有困难的同学,也可以直接使用[此链接](https://huggingface.co/Tribbiani/vicuna-"
"7b)上的模型进行替代。"
#: ../../getting_started/getting_started.md:63 2b8f6985fe1a414e95d334d3ee9d0878
#: ../../getting_started/getting_started.md:66 cc0f4c4e43f24b679f857a8d937528ee
msgid "prepare server sql script"
msgstr "准备db-gpt server sql脚本"
#: ../../getting_started/getting_started.md:69 7cb9beb0e15a46759dbcb4606dcb6867
#: ../../getting_started/getting_started.md:72 386948064fe646f2b9f51a262dd64bf2
msgid ""
"set .env configuration set your vector store type, "
"eg:VECTOR_STORE_TYPE=Chroma, now we support Chroma and Milvus(version > "
@ -183,17 +187,17 @@ msgstr ""
"在.env文件设置向量数据库环境变量eg:VECTOR_STORE_TYPE=Chroma, 目前我们支持了 Chroma and "
"Milvus(version >2.1) "
#: ../../getting_started/getting_started.md:72 cdb7ef30e8c9441293e8b3fd95d621ed
#: ../../getting_started/getting_started.md:75 e6f6b06459944f2d8509703af365c664
#, fuzzy
msgid "Run db-gpt server"
msgstr "运行模型服务"
#: ../../getting_started/getting_started.md:77 e7bb3001d46b458aa0c522c4a7a8d45b
#: ../../getting_started/getting_started.md:80 489b595dc08a459ca2fd83b1389d3bbd
#, fuzzy
msgid "Open http://localhost:5000 with your browser to see the product."
msgstr "打开浏览器访问http://localhost:5000"
#: ../../getting_started/getting_started.md:79 68c55e3ecfc642f2869a9917ec65904c
#: ../../getting_started/getting_started.md:82 699afb01c9f243ab837cdc73252f624c
msgid ""
"If you want to access an external LLM service, you need to 1.set the "
"variables LLM_MODEL=YOUR_MODEL_NAME "
@ -201,11 +205,11 @@ msgid ""
"file. 2.execute dbgpt_server.py in light mode"
msgstr "如果你想访问外部的大模型服务1.需要在.env文件设置模型名和外部模型服务地址。2.使用light模式启动服务"
#: ../../getting_started/getting_started.md:86 474aea4023bb44dd970773b110bbf0ee
#: ../../getting_started/getting_started.md:89 7df7f3870e1140d3a17dc322a46d6476
msgid ""
"If you want to learn about dbgpt-webui, read https://github.com/csunny"
"/DB-GPT/tree/new-page-framework/datacenter"
msgstr "如果你想了解DB-GPT前端服务访问https://github.com/csunny"
"/DB-GPT/tree/new-page-framework/datacenter"
msgstr ""
"如果你想了解DB-GPT前端服务访问https://github.com/csunny/DB-GPT/tree/new-page-"
"framework/datacenter"

View File

@ -0,0 +1,85 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, csunny
# This file is distributed under the same license as the DB-GPT package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 👏👏 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../getting_started/installation.md:1 bc5bfc8ebfc847c5a22f2346357cf747
msgid "Installation"
msgstr "安装dbgpt包指南"
#: ../../getting_started/installation.md:2 1aaef0db5ee9426aa337021d782666af
msgid ""
"DB-GPT provides a third-party Python API package that you can integrate "
"into your own code."
msgstr "DB-GPT提供了python第三方包你可以在你的代码中引入"
#: ../../getting_started/installation.md:4 de542f259e20441991a0e5a7d52769b8
msgid "Installation from Pip"
msgstr "使用pip安装"
#: ../../getting_started/installation.md:6 3357f019aa8249b292162de92757eec4
msgid "You can simply pip install:"
msgstr "你可以使用pip install"
#: ../../getting_started/installation.md:12 9c610d593608452f9d7d8d7e462251e3
msgid "Notice:make sure python>=3.10"
msgstr "注意:确保你的python版本>=3.10"
#: ../../getting_started/installation.md:15 b2ed238c29bb40cba990068e8d7ceae7
msgid "Environment Setup"
msgstr "环境设置"
#: ../../getting_started/installation.md:17 4804ad4d8edf44f49b1d35b271635fad
msgid "By default, if you use the EmbeddingEngine api"
msgstr "如果你想使用EmbeddingEngine api"
#: ../../getting_started/installation.md:19 2205f69ec60d4f73bb3a93a583928455
msgid "you will prepare embedding models from huggingface"
msgstr "你需要从huggingface下载embedding models"
#: ../../getting_started/installation.md:22 693c18a83f034dcc8c263674418bcde2
msgid "Notice make sure you have install git-lfs"
msgstr "确保你已经安装了git-lfs"
#: ../../getting_started/installation.md:30 dd8d0880b55e4c48bfc414f8cbdda268
msgid "version:"
msgstr "版本:"
#: ../../getting_started/installation.md:31 731e634b96164efbbc1ce9fa88361b12
msgid "db-gpt0.3.0"
msgstr "db-gpt0.3.0"
#: ../../getting_started/installation.md:32 38fb635be4554d94b527c6762253d46d
msgid ""
"[embedding_engine api](https://db-"
"gpt.readthedocs.io/en/latest/modules/knowledge.html)"
msgstr "[embedding_engine api](https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html)"
#: ../../getting_started/installation.md:33 a60b0ffe21a74ebca05529dc1dd1ba99
msgid ""
"[multi source embedding](https://db-"
"gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html)"
msgstr "[multi source embedding](https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html)"
#: ../../getting_started/installation.md:34 3c752c9305414719bc3f561cf18a75af
msgid ""
"[vector connector](https://db-"
"gpt.readthedocs.io/en/latest/modules/vector.html)"
msgstr "[vector connector](https://db-gpt.readthedocs.io/en/latest/modules/vector.html)"

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-30 17:16+0800\n"
"POT-Creation-Date: 2023-07-12 16:23+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -19,25 +19,25 @@ msgstr ""
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../getting_started/tutorials.md:1 e494f27e68fd40efa2864a532087cfef
#: ../../getting_started/tutorials.md:1 cb100b89a2a747cd90759e415c737070
msgid "Tutorials"
msgstr "教程"
#: ../../getting_started/tutorials.md:4 8eecfbf3240b44fcb425034600316cea
#: ../../getting_started/tutorials.md:4 dbc2a2346b384cc3930086f97181b14b
msgid "This is a collection of DB-GPT tutorials on Medium."
msgstr "这是知乎上DB-GPT教程的集合。"
#: ../../getting_started/tutorials.md:6 a40601867a3d4ce886a197f2f337ec0f
#: ../../getting_started/tutorials.md:6 67e5b6dbac654d428e6a8be9d1ec6473
msgid ""
"DB-GPT is divided into several functions, including chat with knowledge "
"base, execute SQL, chat with database, and execute plugins."
msgstr "DB-GPT包含以下功能和知识库聊天执行SQL和数据库聊天以及执行插件。"
#: ../../getting_started/tutorials.md:8 493e6f56a75d45ef8bb15d3049a24994
#: ../../getting_started/tutorials.md:8 744aaec68aa3413c9b17b09714476d32
msgid "Introduction"
msgstr "介绍"
#: ../../getting_started/tutorials.md:9 4526a793cdb94b8f99f41c48cd5ee453
#: ../../getting_started/tutorials.md:9 305bcf5e847a4322a2834b84fa3c694a
#, fuzzy
msgid "[What is DB-GPT](https://www.youtube.com/watch?v=QszhVJerc0I)"
msgstr ""
@ -45,12 +45,12 @@ msgstr ""
"GPT](https://www.bilibili.com/video/BV1SM4y1a7Nj/?buvid=551b023900b290f9497610b2155a2668&is_story_h5=false&mid=%2BVyE%2Fwau5woPcUKieCWS0A%3D%3D&p=1&plat_id=116&share_from=ugc&share_medium=iphone&share_plat=ios&share_session_id=5D08B533-82A4-4D40-9615-7826065B4574&share_source=GENERIC&share_tag=s_i&timestamp=1686307943&unique_k=bhO3lgQ&up_id=31375446)"
" by csunny (https://github.com/csunny/DB-GPT)"
#: ../../getting_started/tutorials.md:11 95313384e5da4f5db96ac990596b2e73
#: ../../getting_started/tutorials.md:11 22fdc6937b2248ae8f5a7ef385aa55d9
#, fuzzy
msgid "Knowledge"
msgstr "知识库"
#: ../../getting_started/tutorials.md:13 e7a141f4df8d4974b0797dd7723c4658
#: ../../getting_started/tutorials.md:13 9bbf0f5aece64389b93b16235abda58e
#, fuzzy
msgid ""
"[How to Create your own knowledge repository](https://db-"
@ -59,55 +59,55 @@ msgstr ""
"[怎么创建自己的知识库](https://db-"
"gpt.readthedocs.io/en/latest/modules/knowledge.html)"
#: ../../getting_started/tutorials.md:15 f7db5b05a2db44e6a98b7d0df0a6f4ee
#: ../../getting_started/tutorials.md:15 ae201d75a3aa485e99b258103245db1c
#, fuzzy
msgid "![Add new Knowledge demonstration](../../assets/new_knownledge.gif)"
msgstr "[新增知识库演示](../../assets/new_knownledge_en.gif)"
#: ../../getting_started/tutorials.md:15 1a1647a7ca23423294823529301dd75f
#: ../../getting_started/tutorials.md:15 e7bfb3396f7b42f1a1be9f29df1773a2
#, fuzzy
msgid "Add new Knowledge demonstration"
msgstr "[新增知识库演示](../../assets/new_knownledge_en.gif)"
#: ../../getting_started/tutorials.md:17 de26224a814e4c6798d3a342b0f0fe3a
#: ../../getting_started/tutorials.md:17 d37acc0486ec40309e7e944bb0458b0a
msgid "SQL Generation"
msgstr "SQL生成"
#: ../../getting_started/tutorials.md:18 f8fe82c554424239beb522f94d285c52
#: ../../getting_started/tutorials.md:18 86a328c9e15f46679a2611f7162f9fbe
#, fuzzy
msgid "![sql generation demonstration](../../assets/demo_en.gif)"
msgstr "[sql生成演示](../../assets/demo_en.gif)"
#: ../../getting_started/tutorials.md:18 41e932b692074fccb8059cadb0ed320e
#: ../../getting_started/tutorials.md:18 03bc8d7320be44f0879a553a324ec26f
#, fuzzy
msgid "sql generation demonstration"
msgstr "[sql生成演示](../../assets/demo_en.gif)"
#: ../../getting_started/tutorials.md:20 78bda916272f4cf99e9b26b4d9ba09ab
#: ../../getting_started/tutorials.md:20 5f3b241f24634c09880d5de014f64f1b
msgid "SQL Execute"
msgstr "SQL执行"
#: ../../getting_started/tutorials.md:21 53cc83de34784c3c8d4d8204eacccbe9
#: ../../getting_started/tutorials.md:21 13a16debf2624f44bfb2e0453c11572d
#, fuzzy
msgid "![sql execute demonstration](../../assets/auto_sql_en.gif)"
msgstr "[sql execute 演示](../../assets/auto_sql_en.gif)"
#: ../../getting_started/tutorials.md:21 535c06f487ed4d15a6cdd17a0154d798
#: ../../getting_started/tutorials.md:21 2d9673cfd48b49a5b1942fdc9de292bf
#, fuzzy
msgid "sql execute demonstration"
msgstr "SQL执行演示"
#: ../../getting_started/tutorials.md:23 0482e6155dc44843adc3a3aa77528f03
#: ../../getting_started/tutorials.md:23 8cc0c647ad804969b470b133708de37f
#, fuzzy
msgid "Plugins"
msgstr "DB插件"
#: ../../getting_started/tutorials.md:24 632617dd88fe4688b789fbb941686c0f
#: ../../getting_started/tutorials.md:24 cad5cc0cb94b42a1a6619bbd2a8b9f4c
#, fuzzy
msgid "![db plugins demonstration](../../assets/chart_db_city_users.png)"
msgid "![db plugins demonstration](../../assets/dashboard.png)"
msgstr "[db plugins 演示](../../assets/dbgpt_bytebase_plugin.gif)"
#: ../../getting_started/tutorials.md:24 020ff499469145f0a34ac468fff91948
#: ../../getting_started/tutorials.md:24 adeee7ea37b743c9b251976124520725
msgid "db plugins demonstration"
msgstr "DB插件演示"

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 15:12+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -19,12 +19,12 @@ msgstr ""
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge.rst:2 ../../modules/knowledge.rst:30
#: e98ef6095fc54f8f8dc045cfa1733dc2
#: ../../modules/knowledge.rst:2 ../../modules/knowledge.rst:136
#: 3cc8fa6e9fbd4d889603d99424e9529a
msgid "Knowledge"
msgstr "知识"
#: ../../modules/knowledge.rst:4 51340dd2758e42ee8e96c3935de53438
#: ../../modules/knowledge.rst:4 0465a393d9d541958c39c1d07c885d1f
#, fuzzy
msgid ""
"As the knowledge base is currently the most significant user demand "
@ -32,55 +32,90 @@ msgid ""
"knowledge bases. At the same time, we also provide multiple knowledge "
"base management strategies in this project, such as pdf knowledge,md "
"knowledge, txt knowledge, word knowledge, ppt knowledge:"
msgstr "由于知识库是当前用户需求最显著的场景,我们原生支持知识库的构建和处理。同时,我们还在本项目中提供了多种知识库管理策略,如:pdf,md "
", txt, word, ppt"
msgstr ""
"由于知识库是当前用户需求最显著的场景,我们原生支持知识库的构建和处理。同时,我们还在本项目中提供了多种知识库管理策略,如:pdf,md , "
"txt, word, ppt"
#: ../../modules/knowledge.rst:7 25eeb187843a4d9baa4d0c0a404eec65
#: ../../modules/knowledge.rst:6 e670cbe14d8e4da88ba935e4120c31e0
msgid ""
"We currently support many document formats: raw text, txt, pdf, md, html,"
" doc, ppt, and url. In the future, we will continue to support more types"
" of knowledge, including audio, video, various databases, and big data "
"sources. Of course, we look forward to your active participation in "
"contributing code."
msgstr ""
#: ../../modules/knowledge.rst:9 e0bf601a1a0c458297306db6ff79f931
msgid "**Create your own knowledge repository**"
msgstr "创建你自己的知识库"
#: ../../modules/knowledge.rst:9 bed8a8f08c194ff59a31dc53f67561c1
msgid ""
"1.Place personal knowledge files or folders in the pilot/datasets "
"directory."
msgstr "1.将个人知识文件或文件夹放在pilot/datasets目录中。"
#: ../../modules/knowledge.rst:11 bb26708135d44615be3c1824668010f6
msgid "1.prepare"
msgstr "准备"
#: ../../modules/knowledge.rst:11 6e03e1a2799a432f8319c3aaf33e2867
#: ../../modules/knowledge.rst:13 c150a0378f3e4625908fa0d8a25860e9
#, fuzzy
msgid ""
"We currently support many document formats: txt, pdf, md, html, doc, ppt,"
" and url."
"We currently support many document formats: TEXT(raw text), "
"DOCUMENT(.txt, .pdf, .md, .doc, .ppt, .html), and URL."
msgstr "当前支持txt, pdf, md, html, doc, ppt, url文档格式"
#: ../../modules/knowledge.rst:13 883ebf16fe7f4e1fbc73ef7430104e79
msgid "before execution: python -m spacy download zh_core_web_sm"
msgstr "在执行之前请先执行python -m spacy download zh_core_web_sm"
#: ../../modules/knowledge.rst:15 7f9f02a93d5d4325b3d2d976f4bb28a0
msgid "before execution:"
msgstr "开始前"
#: ../../modules/knowledge.rst:15 59f4bfa8c1064391919ce2af69f2d4c9
msgid ""
"2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma "
"(now only support Chroma and Milvus, if you set Milvus, please set "
"MILVUS_URL and MILVUS_PORT)"
msgstr "2.更新你的.env设置你的向量存储类型VECTOR_STORE_TYPE=Chroma(现在只支持Chroma和Milvus如果你设置了Milvus请设置MILVUS_URL和MILVUS_PORT)"
#: ../../modules/knowledge.rst:18 be600a4d93094045b78a43307dfc8f5f
#: ../../modules/knowledge.rst:24 59699a8385e04982a992cf0d71f6dcd5
#, fuzzy
msgid "2.Run the knowledge repository script in the tools directory."
msgstr "3.在tools目录执行知识入库脚本"
#: ../../modules/knowledge.rst:20 b27eddbbf6c74993a6653575f57fff18
msgid ""
"python tools/knowledge_init.py note : --vector_name : your vector store "
"name default_value:default"
"2.prepare embedding model, you can download from https://huggingface.co/."
" Notice you have installed git-lfs."
msgstr ""
"提前准备Embedding Model, 你可以在https://huggingface.co/进行下载注意你需要先安装git-lfs.eg:"
" git clone https://huggingface.co/THUDM/chatglm2-6b"
#: ../../modules/knowledge.rst:23 f32dc12aedc94ffc8fee77a4b6e0ec88
#: ../../modules/knowledge.rst:27 2be1a17d0b54476b9dea080d244fd747
msgid ""
"3.Add the knowledge repository in the interface by entering the name of "
"your knowledge repository (if not specified, enter \"default\") so you "
"can use it for Q&A based on your knowledge base."
msgstr "如果选择新增知识库,在界面上新增知识库输入你的知识库名"
"eg: git clone https://huggingface.co/sentence-transformers/all-"
"MiniLM-L6-v2"
msgstr "eg: git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"
#: ../../modules/knowledge.rst:25 5b1412c8beb24784bd2a93fe5c487b7b
#: ../../modules/knowledge.rst:33 d328f6e243624c9488ebd27c9324621b
msgid ""
"3.prepare vector_store instance and vector store config, now we support "
"Chroma, Milvus and Weaviate."
msgstr "提前准备向量数据库环境目前支持Chroma, Milvus and Weaviate向量数据库"
#: ../../modules/knowledge.rst:63 44f97154eff647d399fd30b6f9e3b867
msgid ""
"3.init Url Type EmbeddingEngine api and embedding your document into "
"vector store in your code."
msgstr "初始化 Url类型 EmbeddingEngine api 将url文档embedding向量化到向量数据库 "
#: ../../modules/knowledge.rst:75 e2581b414f0148bca88253c7af9cd591
msgid "If you want to add your source_reader or text_splitter, do this:"
msgstr "如果你想手动添加你自定义的source_reader和text_splitter, 请参考:"
#: ../../modules/knowledge.rst:95 74c110414f924bbfa3d512e45ba2f30f
#, fuzzy
msgid ""
"4.init Document Type EmbeddingEngine api and embedding your document into"
" vector store in your code. Document type can be .txt, .pdf, .md, .doc, "
".ppt."
msgstr ""
"初始化 文档型类型 EmbeddingEngine api 将文档embedding向量化到向量数据库(文档可以是.txt, .pdf, "
".md, .html, .doc, .ppt)"
#: ../../modules/knowledge.rst:108 0afd40098d5f4dfd9e44fe1d8004da25
msgid ""
"5.init TEXT Type EmbeddingEngine api and embedding your document into "
"vector store in your code."
msgstr "初始化TEXT类型 EmbeddingEngine api 将文档embedding向量化到向量数据库"
#: ../../modules/knowledge.rst:120 a66961bf3efd41fa8ea938129446f5a5
msgid "4.similar search based on your knowledge base. ::"
msgstr "在知识库进行相似性搜索"
#: ../../modules/knowledge.rst:126 b7066f408378450db26770f83fbd2716
msgid ""
"Note that the default vector model used is text2vec-large-chinese (which "
"is a large model, so if your personal computer configuration is not "
@ -90,9 +125,79 @@ msgstr ""
"注意这里默认向量模型是text2vec-large-chinese(模型比较大如果个人电脑配置不够建议采用text2vec-base-"
"chinese),因此确保需要将模型download下来放到models目录中。"
#: ../../modules/knowledge.rst:27 67773e32b01c48628c80b6fab8c90146
#: ../../modules/knowledge.rst:128 58481d55cab74936b6e84b24c39b1674
#, fuzzy
msgid ""
"`pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf "
"`pdf_embedding <./knowledge/pdf/pdf_embedding.html>`_: supported pdf "
"embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#: ../../modules/knowledge.rst:129 fbb013c4f1bc46af910c91292f6690cf
#, fuzzy
msgid ""
"`markdown_embedding <./knowledge/markdown/markdown_embedding.html>`_: "
"supported markdown embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#: ../../modules/knowledge.rst:130 59d45732f4914d16b4e01aee0992edf7
#, fuzzy
msgid ""
"`word_embedding <./knowledge/word/word_embedding.html>`_: supported word "
"embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#: ../../modules/knowledge.rst:131 df0e6f311861423e885b38e020a7c0f0
#, fuzzy
msgid ""
"`url_embedding <./knowledge/url/url_embedding.html>`_: supported url "
"embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#: ../../modules/knowledge.rst:132 7c550c1f5bc34fe9986731fb465e12cd
#, fuzzy
msgid ""
"`ppt_embedding <./knowledge/ppt/ppt_embedding.html>`_: supported ppt "
"embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#: ../../modules/knowledge.rst:133 8648684cb191476faeeb548389f79050
#, fuzzy
msgid ""
"`string_embedding <./knowledge/string/string_embedding.html>`_: supported"
" raw text embedding."
msgstr "pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding."
#~ msgid "before execution: python -m spacy download zh_core_web_sm"
#~ msgstr "在执行之前请先执行python -m spacy download zh_core_web_sm"
#~ msgid "2.Run the knowledge repository script in the tools directory."
#~ msgstr "3.在tools目录执行知识入库脚本"
#~ msgid ""
#~ "python tools/knowledge_init.py note : "
#~ "--vector_name : your vector store name"
#~ " default_value:default"
#~ msgstr ""
#~ msgid ""
#~ "3.Add the knowledge repository in the"
#~ " interface by entering the name of"
#~ " your knowledge repository (if not "
#~ "specified, enter \"default\") so you can"
#~ " use it for Q&A based on your"
#~ " knowledge base."
#~ msgstr "如果选择新增知识库,在界面上新增知识库输入你的知识库名"
#~ msgid ""
#~ "1.Place personal knowledge files or "
#~ "folders in the pilot/datasets directory."
#~ msgstr "1.将个人知识文件或文件夹放在pilot/datasets目录中。"
#~ msgid ""
#~ "2.Update your .env, set your vector "
#~ "store type, VECTOR_STORE_TYPE=Chroma (now only"
#~ " support Chroma and Milvus, if you"
#~ " set Milvus, please set MILVUS_URL "
#~ "and MILVUS_PORT)"
#~ msgstr "2.更新你的.env设置你的向量存储类型VECTOR_STORE_TYPE=Chroma(现在只支持Chroma和Milvus如果你设置了Milvus请设置MILVUS_URL和MILVUS_PORT)"

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 14:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -20,12 +20,13 @@ msgstr ""
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge/markdown/markdown_embedding.md:1
#: b5fd3aea05a64590955b958b753bf22a
msgid "MarkdownEmbedding"
#: 6d4eb4d8566b4dbaa301715148342aca
#, fuzzy
msgid "Markdown"
msgstr "MarkdownEmbedding"
#: ../../modules/knowledge/markdown/markdown_embedding.md:3
#: 0f98ce5b34d44c6f9c828e4b497984de
#: 050625646fa14cb1822d0d430fdf06ec
msgid ""
"markdown embedding can import md text into a vector knowledge base. The "
"entire embedding process includes the read (loading data), data_process "
@ -36,20 +37,20 @@ msgstr ""
"数据预处理data_process()和数据进向量数据库index_to_store()"
#: ../../modules/knowledge/markdown/markdown_embedding.md:5
#: 7f5ebfa8c7c146d7a340baca85634e16
#: af1313489c164e968def2f5f1716a522
msgid "inheriting the SourceEmbedding"
msgstr "继承SourceEmbedding"
#: ../../modules/knowledge/markdown/markdown_embedding.md:17
#: 732e946bc9d149a5af802b239304b943
#: ../../modules/knowledge/markdown/markdown_embedding.md:18
#: aebe894f955b44f3ac677ce50d47c846
#, fuzzy
msgid ""
"implement read() and data_process() read() method allows you to read data"
" and split data into chunk"
msgstr "实现read方法可以加载数据"
#: ../../modules/knowledge/markdown/markdown_embedding.md:33
#: f7e53658aee7403688b333b24ff08ce2
#: ../../modules/knowledge/markdown/markdown_embedding.md:41
#: d53a087726be4a0dbb8dadbeb772442b
msgid "data_process() method allows you to pre processing your ways"
msgstr "实现data_process方法可以进行数据预处理"

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 14:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -20,12 +20,12 @@ msgstr ""
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge/pdf/pdf_embedding.md:1
#: fe600a1f3f9f492da81652ebd3d6d52d
msgid "PDFEmbedding"
#: edf96281acc04612a3384b451dc71391
msgid "PDF"
msgstr ""
#: ../../modules/knowledge/pdf/pdf_embedding.md:3
#: a26a7d6ff041476b975bab5c0bf9f506
#: fdc7396cc2eb4186bb28ea8c491738bc
#, fuzzy
msgid ""
"pdfembedding can import PDF text into a vector knowledge base. The entire"
@ -37,20 +37,23 @@ msgstr ""
"数据预处理data_process()和数据进向量数据库index_to_store()"
#: ../../modules/knowledge/pdf/pdf_embedding.md:5
#: 1895f2a6272c43f0b328caba092102a9
#: d4950371bace43d8957bce9757d77b6e
msgid "inheriting the SourceEmbedding"
msgstr "继承SourceEmbedding"
#: ../../modules/knowledge/pdf/pdf_embedding.md:17
#: 2a4a349398354f9cb3e8d9630a4b8696
#: ../../modules/knowledge/pdf/pdf_embedding.md:18
#: 990c46bba6f3438da542417e4addb96f
#, fuzzy
msgid ""
"implement read() and data_process() read() method allows you to read data"
" and split data into chunk"
msgstr "实现read方法可以加载数据"
#: ../../modules/knowledge/pdf/pdf_embedding.md:34
#: 9b5c6d3e9e96443a908a09a8a762ea7a
#: ../../modules/knowledge/pdf/pdf_embedding.md:39
#: 29cf5a37da2f4ad7ab66750970f62d3f
msgid "data_process() method allows you to pre processing your ways"
msgstr "实现data_process方法可以进行数据预处理"
#~ msgid "PDFEmbedding"
#~ msgstr ""

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 14:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -20,12 +20,12 @@ msgstr ""
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge/ppt/ppt_embedding.md:1
#: 2cdb249b2b284064a0c9117d051e35d4
msgid "PPTEmbedding"
#: 86b98a120d0d4796a034c47a23ec8a03
msgid "PPT"
msgstr ""
#: ../../modules/knowledge/ppt/ppt_embedding.md:3
#: 71676e9b35434a849a206788da8f1394
#: af78e8c3a6c24bf79e03da41c6d13fba
msgid ""
"ppt embedding can import ppt text into a vector knowledge base. The "
"entire embedding process includes the read (loading data), data_process "
@ -36,20 +36,23 @@ msgstr ""
"数据预处理data_process()和数据进向量数据库index_to_store()"
#: ../../modules/knowledge/ppt/ppt_embedding.md:5
#: 016aeae4786e4d5bad815670bd109481
#: 0ddb5ec40a4e4864b63e7f578c2f3c34
msgid "inheriting the SourceEmbedding"
msgstr "继承SourceEmbedding"
#: ../../modules/knowledge/ppt/ppt_embedding.md:17
#: 2fb5b9dc912342df8c275cfd0e993fe0
#: ../../modules/knowledge/ppt/ppt_embedding.md:23
#: b74741f4a1814fe19842985a3f960972
#, fuzzy
msgid ""
"implement read() and data_process() read() method allows you to read data"
" and split data into chunk"
msgstr "实现read方法可以加载数据"
#: ../../modules/knowledge/ppt/ppt_embedding.md:31
#: 9a00f72c7ec84bde9971579c720d2628
#: ../../modules/knowledge/ppt/ppt_embedding.md:44
#: bc1e705c60cd4dde921150cb814ac8ae
msgid "data_process() method allows you to pre processing your ways"
msgstr "实现data_process方法可以进行数据预处理"
#~ msgid "PPTEmbedding"
#~ msgstr ""

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 14:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -20,12 +20,12 @@ msgstr ""
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge/url/url_embedding.md:1
#: e6d335e613ec4c3a80b89de67ba93098
msgid "URL Embedding"
#: c1db535b997f4a90a75806f389200a4e
msgid "URL"
msgstr ""
#: ../../modules/knowledge/url/url_embedding.md:3
#: 25e7643335264bdaaa9386ded243d51d
#: a4e3929be4964c35b7d169eaae8f29fe
msgid ""
"url embedding can import PDF text into a vector knowledge base. The "
"entire embedding process includes the read (loading data), data_process "
@ -36,20 +36,23 @@ msgstr ""
"数据预处理data_process()和数据进向量数据库index_to_store()"
#: ../../modules/knowledge/url/url_embedding.md:5
#: 4b8ca6d93ed0412ab1e640bd42b400ac
#: 0c0be35a31e84e76a60e9e4ffb61a414
msgid "inheriting the SourceEmbedding"
msgstr "继承SourceEmbedding"
#: ../../modules/knowledge/url/url_embedding.md:17
#: 5d69d27adc70406db97c398a339f6453
#: ../../modules/knowledge/url/url_embedding.md:23
#: f9916af3adee4da2988e5ed1912f2bdd
#, fuzzy
msgid ""
"implement read() and data_process() read() method allows you to read data"
" and split data into chunk"
msgstr "实现read方法可以加载数据"
#: ../../modules/knowledge/url/url_embedding.md:34
#: 7d055e181d9b4d47965ab249b18bd704
#: ../../modules/knowledge/url/url_embedding.md:44
#: 56c0720ae3d840069daad2ba7edc8122
msgid "data_process() method allows you to pre processing your ways"
msgstr "实现data_process方法可以进行数据预处理"
#~ msgid "URL Embedding"
#~ msgstr ""

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-14 14:51+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -20,12 +20,12 @@ msgstr ""
"Generated-By: Babel 2.12.1\n"
#: ../../modules/knowledge/word/word_embedding.md:1
#: 1b3272def692480bb101060a33d076c6
msgid "WordEmbedding"
#: fa236aa8d2e5471d8436e0ec60f906e8
msgid "Word"
msgstr ""
#: ../../modules/knowledge/word/word_embedding.md:3
#: a7ea0e94e5c74dab9aa7fb80ed42ed39
#: 02d0c183f7f646a7b74e22d0166c8718
msgid ""
"word embedding can import word doc/docx text into a vector knowledge "
"base. The entire embedding process includes the read (loading data), "
@ -36,20 +36,23 @@ msgstr ""
"数据预处理data_process()和数据进向量数据库index_to_store()"
#: ../../modules/knowledge/word/word_embedding.md:5
#: 12ba9527ef0745538dffb6b1dcf96933
#: ffa094cb7739457d88666c5b624bf078
msgid "inheriting the SourceEmbedding"
msgstr "继承SourceEmbedding"
#: ../../modules/knowledge/word/word_embedding.md:17
#: a4e5e7553f4a43b0b79ba0de83268ef0
#: ../../modules/knowledge/word/word_embedding.md:18
#: 146f03d86fd147b7847b7b907d52b408
#, fuzzy
msgid ""
"implement read() and data_process() read() method allows you to read data"
" and split data into chunk"
msgstr "实现read方法可以加载数据"
#: ../../modules/knowledge/word/word_embedding.md:29
#: 188a434dee7543f89cf5f1584f29ca62
#: ../../modules/knowledge/word/word_embedding.md:39
#: b29a213855af4446a64aadc5a3b76739
msgid "data_process() method allows you to pre processing your ways"
msgstr "实现data_process方法可以进行数据预处理"
#~ msgid "WordEmbedding"
#~ msgstr ""

View File

@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: DB-GPT 0.3.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-06-13 11:38+0800\n"
"POT-Creation-Date: 2023-07-13 15:39+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
@ -19,40 +19,43 @@ msgstr ""
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../use_cases/knownledge_based_qa.md:1 ddfe412b92e14324bdc11ffe58114e5f
msgid "Knownledge based qa"
msgstr "知识问答"
#~ msgid "Knownledge based qa"
#~ msgstr "知识问答"
#: ../../use_cases/knownledge_based_qa.md:3 48635316cc704a779089ff7b5cb9a836
msgid ""
"Chat with your own knowledge is a very interesting thing. In the usage "
"scenarios of this chapter, we will introduce how to build your own "
"knowledge base through the knowledge base API. Firstly, building a "
"knowledge store can currently be initialized by executing \"python "
"tool/knowledge_init.py\" to initialize the content of your own knowledge "
"base, which was introduced in the previous knowledge base module. Of "
"course, you can also call our provided knowledge embedding API to store "
"knowledge."
msgstr ""
"用自己的知识聊天是一件很有趣的事情。在本章的使用场景中我们将介绍如何通过知识库API构建自己的知识库。首先构建知识存储目前可以通过执行“python"
" "
"tool/knowledge_init.py”来初始化您自己的知识库的内容这在前面的知识库模块中已经介绍过了。当然你也可以调用我们提供的知识嵌入API来存储知识。"
#~ msgid ""
#~ "Chat with your own knowledge is a"
#~ " very interesting thing. In the usage"
#~ " scenarios of this chapter, we will"
#~ " introduce how to build your own "
#~ "knowledge base through the knowledge "
#~ "base API. Firstly, building a knowledge"
#~ " store can currently be initialized "
#~ "by executing \"python tool/knowledge_init.py\" "
#~ "to initialize the content of your "
#~ "own knowledge base, which was introduced"
#~ " in the previous knowledge base "
#~ "module. Of course, you can also "
#~ "call our provided knowledge embedding "
#~ "API to store knowledge."
#~ msgstr ""
#~ "用自己的知识聊天是一件很有趣的事情。在本章的使用场景中我们将介绍如何通过知识库API构建自己的知识库。首先构建知识存储目前可以通过执行“python"
#~ " "
#~ "tool/knowledge_init.py”来初始化您自己的知识库的内容这在前面的知识库模块中已经介绍过了。当然你也可以调用我们提供的知识嵌入API来存储知识。"
#: ../../use_cases/knownledge_based_qa.md:6 0a5c68429c9343cf8b88f4f1dddb18eb
#, fuzzy
msgid ""
"We currently support many document formats: txt, pdf, md, html, doc, ppt,"
" and url."
msgstr "“我们目前支持四种文件格式: txt, pdf, url, 和md。"
#~ msgid ""
#~ "We currently support many document "
#~ "formats: txt, pdf, md, html, doc, "
#~ "ppt, and url."
#~ msgstr "“我们目前支持四种文件格式: txt, pdf, url, 和md。"
#: ../../use_cases/knownledge_based_qa.md:20 83f3544c06954e5cbc0cc7788f699eb1
msgid ""
"Now we currently support vector databases: Chroma (default) and Milvus. "
"You can switch between them by modifying the \"VECTOR_STORE_TYPE\" field "
"in the .env file."
msgstr "“我们目前支持向量数据库:Chroma(默认)和Milvus。你可以通过修改.env文件中的“VECTOR_STORE_TYPE”参数在它们之间切换。"
#~ msgid ""
#~ "Now we currently support vector "
#~ "databases: Chroma (default) and Milvus. "
#~ "You can switch between them by "
#~ "modifying the \"VECTOR_STORE_TYPE\" field in"
#~ " the .env file."
#~ msgstr "“我们目前支持向量数据库:Chroma(默认)和Milvus。你可以通过修改.env文件中的“VECTOR_STORE_TYPE”参数在它们之间切换。"
#: ../../use_cases/knownledge_based_qa.md:31 ac12f26b81384fc4bf44ccce1c0d86b4
msgid "Below is an example of using the knowledge base API to query knowledge:"
msgstr "下面是一个使用知识库API进行查询的例子:"
#~ msgid "Below is an example of using the knowledge base API to query knowledge:"
#~ msgstr "下面是一个使用知识库API进行查询的例子:"

View File

@ -3,28 +3,134 @@ Knowledge
| As the knowledge base is currently the most significant user demand scenario, we natively support the construction and processing of knowledge bases. At the same time, we also provide multiple knowledge base management strategies in this project, such as pdf knowledge,md knowledge, txt knowledge, word knowledge, ppt knowledge:
We currently support many document formats: raw text, txt, pdf, md, html, doc, ppt, and url.
In the future, we will continue to support more types of knowledge, including audio, video, various databases, and big data sources. Of course, we look forward to your active participation in contributing code.
**Create your own knowledge repository**
1.Place personal knowledge files or folders in the pilot/datasets directory.
1.prepare
We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
We currently support many document formats: TEXT(raw text), DOCUMENT(.txt, .pdf, .md, .doc, .ppt, .html), and URL.
before execution: python -m spacy download zh_core_web_sm
before execution:
2.Update your .env, set your vector store type, VECTOR_STORE_TYPE=Chroma
(now only support Chroma and Milvus, if you set Milvus, please set MILVUS_URL and MILVUS_PORT)
::
2.Run the knowledge repository script in the tools directory.
pip install db-gpt -i https://pypi.org/
python -m spacy download zh_core_web_sm
from pilot import EmbeddingEngine,KnowledgeType
python tools/knowledge_init.py
note : --vector_name : your vector store name default_value:default
3.Add the knowledge repository in the interface by entering the name of your knowledge repository (if not specified, enter "default") so you can use it for Q&A based on your knowledge base.
2.prepare embedding model, you can download from https://huggingface.co/.
Notice you have installed git-lfs.
eg: git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
::
embedding_model = "your_embedding_model_path/all-MiniLM-L6-v2"
3.prepare vector_store instance and vector store config, now we support Chroma, Milvus and Weaviate.
::
#Chroma
vector_store_config = {
"vector_store_type":"Chroma",
"vector_store_name":"your_name",#you can define yourself
"chroma_persist_path":"your_persist_dir"
}
#Milvus
vector_store_config = {
"vector_store_type":"Milvus",
"vector_store_name":"your_name",#you can define yourself
"milvus_url":"your_url",
"milvus_port":"your_port",
"milvus_username":"your_username",(optional)
"milvus_password":"your_password",(optional)
"milvus_secure":"your_secure"(optional)
}
#Weaviate
vector_store_config = {
"vector_store_type":"Weaviate",
"vector_store_name":"your_name",#you can define yourself
"weaviate_url":"your_url",
"weaviate_port":"your_port",
"weaviate_username":"your_username",(optional)
"weaviate_password":"your_password",(optional)
}
3.init Url Type EmbeddingEngine api and embedding your document into vector store in your code.
::
url = "https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html"
embedding_engine = EmbeddingEngine(
knowledge_source=url,
knowledge_type=KnowledgeType.URL.value,
model_name=embedding_model,
vector_store_config=vector_store_config)
embedding_engine.knowledge_embedding()
If you want to add your source_reader or text_splitter, do this:
::
url = "https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html"
source_reader = WebBaseLoader(web_path=self.file_path)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
embedding_engine = EmbeddingEngine(
knowledge_source=url,
knowledge_type=KnowledgeType.URL.value,
model_name=embedding_model,
vector_store_config=vector_store_config,
source_reader=source_reader,
text_splitter=text_splitter
)
4.init Document Type EmbeddingEngine api and embedding your document into vector store in your code.
Document type can be .txt, .pdf, .md, .doc, .ppt.
::
document_path = "your_path/test.md"
embedding_engine = EmbeddingEngine(
knowledge_source=document_path,
knowledge_type=KnowledgeType.DOCUMENT.value,
model_name=embedding_model,
vector_store_config=vector_store_config)
embedding_engine.knowledge_embedding()
5.init TEXT Type EmbeddingEngine api and embedding your document into vector store in your code.
::
raw_text = "a long passage"
embedding_engine = EmbeddingEngine(
knowledge_source=raw_text,
knowledge_type=KnowledgeType.TEXT.value,
model_name=embedding_model,
vector_store_config=vector_store_config)
embedding_engine.knowledge_embedding()
4.similar search based on your knowledge base.
::
query = "please introduce the oceanbase"
topk = 5
docs = embedding_engine.similar_search(query, topk)
Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory.
- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding.
- `pdf_embedding <./knowledge/pdf/pdf_embedding.html>`_: supported pdf embedding.
- `markdown_embedding <./knowledge/markdown/markdown_embedding.html>`_: supported markdown embedding.
- `word_embedding <./knowledge/word/word_embedding.html>`_: supported word embedding.
- `url_embedding <./knowledge/url/url_embedding.html>`_: supported url embedding.
- `ppt_embedding <./knowledge/ppt/ppt_embedding.html>`_: supported ppt embedding.
- `string_embedding <./knowledge/string/string_embedding.html>`_: supported raw text embedding.
.. toctree::
@ -38,3 +144,4 @@ Note that the default vector model used is text2vec-large-chinese (which is a la
./knowledge/word/word_embedding.md
./knowledge/url/url_embedding.md
./knowledge/ppt/ppt_embedding.md
./knowledge/string/string_embedding.md

View File

@ -1,4 +1,4 @@
MarkdownEmbedding
Markdown
==================================
markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
@ -6,13 +6,14 @@ inheriting the SourceEmbedding
```
class MarkdownEmbedding(SourceEmbedding):
"""pdf embedding for read pdf document."""
"""pdf embedding for read markdown document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
def __init__(self, file_path, vector_store_config, text_splitter):
"""Initialize with markdown path."""
super().__init__(file_path, vector_store_config, text_splitter)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or Nore
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
@ -22,12 +23,19 @@ read() method allows you to read data and split data into chunk
def read(self):
"""Load from markdown path."""
loader = EncodeTextLoader(self.file_path)
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=100,
)
return loader.load_and_split(textsplitter)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(self.text_splitter)
```
data_process() method allows you to pre processing your ways

View File

@ -1,4 +1,4 @@
PDFEmbedding
PDF
==================================
pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
@ -7,11 +7,12 @@ inheriting the SourceEmbedding
class PDFEmbedding(SourceEmbedding):
"""pdf embedding for read pdf document."""
def __init__(self, file_path, vector_store_config):
def __init__(self, file_path, vector_store_config, text_splitter):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
super().__init__(file_path, vector_store_config, text_splitter)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or Nore
```
implement read() and data_process()
@ -21,15 +22,19 @@ read() method allows you to read data and split data into chunk
def read(self):
"""Load from pdf path."""
loader = PyPDFLoader(self.file_path)
# textsplitter = CHNDocumentSplitter(
# pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
# )
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=100,
)
return loader.load_and_split(textsplitter)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(self.text_splitter)
```
data_process() method allows you to pre processing your ways
```

View File

@ -1,4 +1,4 @@
PPTEmbedding
PPT
==================================
ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
@ -7,11 +7,17 @@ inheriting the SourceEmbedding
class PPTEmbedding(SourceEmbedding):
"""ppt embedding for read ppt document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize ppt word path."""
super().__init__(file_path, vector_store_config, text_splitter=None)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or None
```
implement read() and data_process()
@ -21,12 +27,19 @@ read() method allows you to read data and split data into chunk
def read(self):
"""Load from ppt path."""
loader = UnstructuredPowerPointLoader(self.file_path)
textsplitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=200,
)
return loader.load_and_split(textsplitter)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(self.text_splitter)
```
data_process() method allows you to pre processing your ways
```

View File

@ -0,0 +1,41 @@
String
==================================
string embedding can import a long raw text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
inheriting the SourceEmbedding
```
class StringEmbedding(SourceEmbedding):
"""string embedding for read string document."""
def __init__(
self,
file_path,
vector_store_config,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize raw text word path."""
super().__init__(file_path=file_path, vector_store_config=vector_store_config)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or None
```
implement read() and data_process()
read() method allows you to read data and split data into chunk
```
@register
def read(self):
"""Load from String path."""
metadata = {"source": "raw text"}
return [Document(page_content=self.file_path, metadata=metadata)]
```
data_process() method allows you to pre processing your ways
```
@register
def data_process(self, documents: List[Document]):
i = 0
for d in documents:
documents[i].page_content = d.page_content.replace("\n", "")
i += 1
return documents
```

View File

@ -1,4 +1,4 @@
URL Embedding
URL
==================================
url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
@ -7,11 +7,17 @@ inheriting the SourceEmbedding
class URLEmbedding(SourceEmbedding):
"""url embedding for read url document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with url path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize url word path."""
super().__init__(file_path, vector_store_config, text_splitter=None)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or None
```
implement read() and data_process()
@ -21,15 +27,19 @@ read() method allows you to read data and split data into chunk
def read(self):
"""Load from url path."""
loader = WebBaseLoader(web_path=self.file_path)
if CFG.LANGUAGE == "en":
text_splitter = CharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
text_splitter = CHNDocumentSplitter(pdf=True, sentence_size=1000)
return loader.load_and_split(text_splitter)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(self.text_splitter)
```
data_process() method allows you to pre processing your ways
```

View File

@ -1,4 +1,4 @@
WordEmbedding
Word
==================================
word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods.
@ -7,11 +7,12 @@ inheriting the SourceEmbedding
class WordEmbedding(SourceEmbedding):
"""word embedding for read word document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with word path."""
super().__init__(file_path, vector_store_config)
def __init__(self, file_path, vector_store_config, text_splitter):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config, text_splitter)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.text_splitter = text_splitter or Nore
```
implement read() and data_process()
@ -21,10 +22,19 @@ read() method allows you to read data and split data into chunk
def read(self):
"""Load from word path."""
loader = UnstructuredWordDocumentLoader(self.file_path)
textsplitter = CHNDocumentSplitter(
pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
)
return loader.load_and_split(textsplitter)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(self.text_splitter)
```
data_process() method allows you to pre processing your ways
```

View File

@ -1,43 +0,0 @@
# Knownledge based qa
Chat with your own knowledge is a very interesting thing. In the usage scenarios of this chapter, we will introduce how to build your own knowledge base through the knowledge base API. Firstly, building a knowledge store can currently be initialized by executing "python tool/knowledge_init.py" to initialize the content of your own knowledge base, which was introduced in the previous knowledge base module. Of course, you can also call our provided knowledge embedding API to store knowledge.
We currently support many document formats: txt, pdf, md, html, doc, ppt, and url.
```
vector_store_config = {
"vector_store_name": name
}
file_path = "your file path"
knowledge_embedding_client = KnowledgeEmbedding(file_path=file_path, model_name=LLM_MODEL_CONFIG["text2vec"], vector_store_config=vector_store_config)
knowledge_embedding_client.knowledge_embedding()
```
Now we currently support vector databases: Chroma (default) and Milvus. You can switch between them by modifying the "VECTOR_STORE_TYPE" field in the .env file.
```
#*******************************************************************#
#** VECTOR STORE SETTINGS **#
#*******************************************************************#
VECTOR_STORE_TYPE=Chroma
#MILVUS_URL=127.0.0.1
#MILVUS_PORT=19530
```
Below is an example of using the knowledge base API to query knowledge:
```
vector_store_config = {
"vector_store_name": name
}
query = "your query"
knowledge_embedding_client = KnowledgeEmbedding(file_path="", model_name=LLM_MODEL_CONFIG["text2vec"], vector_store_config=vector_store_config)
knowledge_embedding_client.similar_search(query, 10)
```

View File

@ -1,3 +1,4 @@
from pilot.embedding_engine import SourceEmbedding, register
from pilot.embedding_engine import EmbeddingEngine, KnowledgeType
__all__ = ["SourceEmbedding", "register"]
__all__ = ["SourceEmbedding", "register", "EmbeddingEngine", "KnowledgeType"]

View File

@ -344,7 +344,14 @@ class Database:
return [
d[0]
for d in results
if d[0] not in ["information_schema", "performance_schema", "sys", "mysql"]
if d[0]
not in [
"information_schema",
"performance_schema",
"sys",
"mysql",
"knowledge_management",
]
]
def convert_sql_write_to_select(self, write_sql):
@ -421,7 +428,13 @@ class Database:
session = self._db_sessions()
cursor = session.execute(text(f"SHOW CREATE TABLE {table_name}"))
ans = cursor.fetchall()
return ans[0][1]
res = ans[0][1]
res = re.sub(r"\s*ENGINE\s*=\s*InnoDB\s*", " ", res, flags=re.IGNORECASE)
res = re.sub(
r"\s*DEFAULT\s*CHARSET\s*=\s*\w+\s*", " ", res, flags=re.IGNORECASE
)
res = re.sub(r"\s*COLLATE\s*=\s*\w+\s*", " ", res, flags=re.IGNORECASE)
return res
def get_fields(self, table_name):
"""Get column fields about specified table."""

View File

@ -1,3 +1,5 @@
from pilot.embedding_engine.source_embedding import SourceEmbedding, register
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
from pilot.embedding_engine.knowledge_type import KnowledgeType
__all__ = ["SourceEmbedding", "register"]
__all__ = ["SourceEmbedding", "register", "EmbeddingEngine", "KnowledgeType"]

View File

@ -2,6 +2,11 @@ from typing import Dict, List, Optional
from langchain.document_loaders import CSVLoader
from langchain.schema import Document
from langchain.text_splitter import (
TextSplitter,
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
)
from pilot.embedding_engine import SourceEmbedding, register
@ -13,19 +18,36 @@ class CSVEmbedding(SourceEmbedding):
self,
file_path,
vector_store_config,
embedding_args: Optional[Dict] = None,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize with csv path."""
super().__init__(file_path, vector_store_config)
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.embedding_args = embedding_args
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from csv path."""
loader = CSVLoader(file_path=self.file_path)
return loader.load()
if self.source_reader is None:
self.source_reader = CSVLoader(self.file_path)
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=100,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -2,21 +2,28 @@ from typing import Optional
from chromadb.errors import NotEnoughElementsException
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import TextSplitter
from pilot.configs.config import Config
from pilot.embedding_engine.knowledge_type import get_knowledge_embedding, KnowledgeType
from pilot.vector_store.connector import VectorStoreConnector
CFG = Config()
class EmbeddingEngine:
"""EmbeddingEngine provide a chain process include(read->text_split->data_process->index_store) for knowledge document embedding into vector store.
1.knowledge_embedding:knowledge document source into vector store.(Chroma, Milvus, Weaviate)
2.similar_search: similarity search from vector_store
how to use reference:https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html
how to integrate:https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html
"""
class KnowledgeEmbedding:
def __init__(
self,
model_name,
vector_store_config,
knowledge_type: Optional[str] = KnowledgeType.DOCUMENT.value,
knowledge_source: Optional[str] = None,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize with knowledge embedding client, model_name, vector_store_config, knowledge_type, knowledge_source"""
self.knowledge_source = knowledge_source
@ -25,27 +32,36 @@ class KnowledgeEmbedding:
self.knowledge_type = knowledge_type
self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)
self.vector_store_config["embeddings"] = self.embeddings
self.source_reader = source_reader
self.text_splitter = text_splitter
def knowledge_embedding(self):
"""source embedding is chain process.read->text_split->data_process->index_store"""
self.knowledge_embedding_client = self.init_knowledge_embedding()
self.knowledge_embedding_client.source_embedding()
def knowledge_embedding_batch(self, docs):
"""Deprecation"""
# docs = self.knowledge_embedding_client.read_batch()
return self.knowledge_embedding_client.index_to_store(docs)
def read(self):
"""Deprecation"""
self.knowledge_embedding_client = self.init_knowledge_embedding()
return self.knowledge_embedding_client.read_batch()
def init_knowledge_embedding(self):
return get_knowledge_embedding(
self.knowledge_type, self.knowledge_source, self.vector_store_config
self.knowledge_type,
self.knowledge_source,
self.vector_store_config,
self.source_reader,
self.text_splitter,
)
def similar_search(self, text, topk):
vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
try:
ans = vector_client.similar_search(text, topk)
@ -55,12 +71,12 @@ class KnowledgeEmbedding:
def vector_exist(self):
vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
return vector_client.vector_name_exists()
def delete_by_ids(self, ids):
vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
vector_client.delete_by_ids(ids=ids)

View File

@ -11,6 +11,7 @@ from pilot.embedding_engine.word_embedding import WordEmbedding
DocumentEmbeddingType = {
".txt": (MarkdownEmbedding, {}),
".md": (MarkdownEmbedding, {}),
".html": (MarkdownEmbedding, {}),
".pdf": (PDFEmbedding, {}),
".doc": (WordEmbedding, {}),
".docx": (WordEmbedding, {}),
@ -25,10 +26,23 @@ class KnowledgeType(Enum):
URL = "URL"
TEXT = "TEXT"
OSS = "OSS"
S3 = "S3"
NOTION = "NOTION"
MYSQL = "MYSQL"
TIDB = "TIDB"
CLICKHOUSE = "CLICKHOUSE"
OCEANBASE = "OCEANBASE"
ELASTICSEARCH = "ELASTICSEARCH"
HIVE = "HIVE"
PRESTO = "PRESTO"
KAFKA = "KAFKA"
SPARK = "SPARK"
YOUTUBE = "YOUTUBE"
def get_knowledge_embedding(knowledge_type, knowledge_source, vector_store_config):
def get_knowledge_embedding(
knowledge_type, knowledge_source, vector_store_config, source_reader, text_splitter
):
match knowledge_type:
case KnowledgeType.DOCUMENT.value:
extension = "." + knowledge_source.rsplit(".", 1)[-1]
@ -37,6 +51,8 @@ def get_knowledge_embedding(knowledge_type, knowledge_source, vector_store_confi
embedding = knowledge_class(
knowledge_source,
vector_store_config=vector_store_config,
source_reader=source_reader,
text_splitter=text_splitter,
**knowledge_args,
)
return embedding
@ -45,18 +61,43 @@ def get_knowledge_embedding(knowledge_type, knowledge_source, vector_store_confi
embedding = URLEmbedding(
file_path=knowledge_source,
vector_store_config=vector_store_config,
source_reader=source_reader,
text_splitter=text_splitter,
)
return embedding
case KnowledgeType.TEXT.value:
embedding = StringEmbedding(
file_path=knowledge_source,
vector_store_config=vector_store_config,
source_reader=source_reader,
text_splitter=text_splitter,
)
return embedding
case KnowledgeType.OSS.value:
raise Exception("OSS have not integrate")
case KnowledgeType.S3.value:
raise Exception("S3 have not integrate")
case KnowledgeType.NOTION.value:
raise Exception("NOTION have not integrate")
case KnowledgeType.MYSQL.value:
raise Exception("MYSQL have not integrate")
case KnowledgeType.TIDB.value:
raise Exception("TIDB have not integrate")
case KnowledgeType.CLICKHOUSE.value:
raise Exception("CLICKHOUSE have not integrate")
case KnowledgeType.OCEANBASE.value:
raise Exception("OCEANBASE have not integrate")
case KnowledgeType.ELASTICSEARCH.value:
raise Exception("ELASTICSEARCH have not integrate")
case KnowledgeType.HIVE.value:
raise Exception("HIVE have not integrate")
case KnowledgeType.PRESTO.value:
raise Exception("PRESTO have not integrate")
case KnowledgeType.KAFKA.value:
raise Exception("KAFKA have not integrate")
case KnowledgeType.SPARK.value:
raise Exception("SPARK have not integrate")
case KnowledgeType.YOUTUBE.value:
raise Exception("YOUTUBE have not integrate")
case _:
raise Exception("unknown knowledge type")

View File

@ -1,7 +1,7 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
from typing import List
from typing import List, Optional
import markdown
from bs4 import BeautifulSoup
@ -10,48 +10,50 @@ from langchain.text_splitter import (
SpacyTextSplitter,
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
TextSplitter,
)
from pilot.configs.config import Config
from pilot.embedding_engine import SourceEmbedding, register
from pilot.embedding_engine.EncodeTextLoader import EncodeTextLoader
CFG = Config()
class MarkdownEmbedding(SourceEmbedding):
"""markdown embedding for read markdown document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with markdown path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize raw text word path."""
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
# self.encoding = encoding
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from markdown path."""
loader = EncodeTextLoader(self.file_path)
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
if self.source_reader is None:
self.source_reader = EncodeTextLoader(self.file_path)
if self.text_splitter is None:
try:
text_splitter = SpacyTextSplitter(
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_size=100,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(text_splitter)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -1,56 +1,55 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from typing import List
from typing import List, Optional
from langchain.document_loaders import PyPDFLoader
from langchain.schema import Document
from langchain.text_splitter import SpacyTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import (
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
TextSplitter,
)
from pilot.configs.config import Config
from pilot.embedding_engine import SourceEmbedding, register
CFG = Config()
class PDFEmbedding(SourceEmbedding):
"""pdf embedding for read pdf document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize pdf word path."""
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from pdf path."""
loader = PyPDFLoader(self.file_path)
# textsplitter = CHNDocumentSplitter(
# pdf=True, sentence_size=CFG.KNOWLEDGE_CHUNK_SIZE
# )
# textsplitter = SpacyTextSplitter(
# pipeline="zh_core_web_sm",
# chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
# chunk_overlap=100,
# )
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
if self.source_reader is None:
self.source_reader = PyPDFLoader(self.file_path)
if self.text_splitter is None:
try:
text_splitter = SpacyTextSplitter(
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_size=100,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(text_splitter)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -1,54 +1,55 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from typing import List
from typing import List, Optional
from langchain.document_loaders import UnstructuredPowerPointLoader
from langchain.schema import Document
from langchain.text_splitter import SpacyTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import (
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
TextSplitter,
)
from pilot.configs.config import Config
from pilot.embedding_engine import SourceEmbedding, register
from pilot.embedding_engine.chn_document_splitter import CHNDocumentSplitter
CFG = Config()
class PPTEmbedding(SourceEmbedding):
"""ppt embedding for read ppt document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize ppt word path."""
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from ppt path."""
loader = UnstructuredPowerPointLoader(self.file_path)
# textsplitter = SpacyTextSplitter(
# pipeline="zh_core_web_sm",
# chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
# chunk_overlap=200,
# )
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
if self.source_reader is None:
self.source_reader = UnstructuredPowerPointLoader(self.file_path)
if self.text_splitter is None:
try:
text_splitter = SpacyTextSplitter(
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_size=100,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(text_splitter)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -4,11 +4,11 @@ from abc import ABC, abstractmethod
from typing import Dict, List, Optional
from chromadb.errors import NotEnoughElementsException
from pilot.configs.config import Config
from langchain.text_splitter import TextSplitter
from pilot.vector_store.connector import VectorStoreConnector
registered_methods = []
CFG = Config()
def register(method):
@ -25,12 +25,16 @@ class SourceEmbedding(ABC):
def __init__(
self,
file_path,
vector_store_config,
vector_store_config: {},
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
embedding_args: Optional[Dict] = None,
):
"""Initialize with Loader url, model_name, vector_store_config"""
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
self.embedding_args = embedding_args
self.embeddings = vector_store_config["embeddings"]
@ -44,8 +48,8 @@ class SourceEmbedding(ABC):
"""pre process data."""
@register
def text_split(self, text):
"""text split chunk"""
def text_splitter(self, text_splitter: TextSplitter):
"""add text split chunk"""
pass
@register
@ -57,7 +61,7 @@ class SourceEmbedding(ABC):
def index_to_store(self, docs):
"""index to vector store"""
self.vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
return self.vector_client.load_document(docs)
@ -65,7 +69,7 @@ class SourceEmbedding(ABC):
def similar_search(self, doc, topk):
"""vector store similarity_search"""
self.vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
try:
ans = self.vector_client.similar_search(doc, topk)
@ -75,7 +79,7 @@ class SourceEmbedding(ABC):
def vector_name_exist(self):
self.vector_client = VectorStoreConnector(
CFG.VECTOR_STORE_TYPE, self.vector_store_config
self.vector_store_config["vector_store_type"], self.vector_store_config
)
return self.vector_client.vector_name_exists()

View File

@ -1,24 +1,55 @@
from typing import List
from typing import List, Optional
from langchain.schema import Document
from langchain.text_splitter import (
TextSplitter,
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
)
from pilot import SourceEmbedding, register
from pilot.embedding_engine import SourceEmbedding, register
class StringEmbedding(SourceEmbedding):
"""string embedding for read string document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with pdf path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize raw text word path."""
super().__init__(
file_path=file_path,
vector_store_config=vector_store_config,
source_reader=None,
text_splitter=None,
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from String path."""
metadata = {"source": "db_summary"}
return [Document(page_content=self.file_path, metadata=metadata)]
metadata = {"source": "raw text"}
docs = [Document(page_content=self.file_path, metadata=metadata)]
if self.text_splitter is None:
try:
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=500,
chunk_overlap=100,
)
except Exception:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return self.text_splitter.split_documents(docs)
return docs
@register
def data_process(self, documents: List[Document]):

View File

@ -1,49 +1,54 @@
from typing import List
from typing import List, Optional
from bs4 import BeautifulSoup
from langchain.document_loaders import WebBaseLoader
from langchain.schema import Document
from langchain.text_splitter import SpacyTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import (
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
TextSplitter,
)
from pilot.configs.config import Config
from pilot.configs.model_config import KNOWLEDGE_CHUNK_SPLIT_SIZE
from pilot.embedding_engine import SourceEmbedding, register
from pilot.embedding_engine.chn_document_splitter import CHNDocumentSplitter
CFG = Config()
class URLEmbedding(SourceEmbedding):
"""url embedding for read url document."""
def __init__(self, file_path, vector_store_config):
"""Initialize with url path."""
super().__init__(file_path, vector_store_config)
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize url word path."""
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from url path."""
loader = WebBaseLoader(web_path=self.file_path)
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
if self.source_reader is None:
self.source_reader = WebBaseLoader(web_path=self.file_path)
if self.text_splitter is None:
try:
text_splitter = SpacyTextSplitter(
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_size=100,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(text_splitter)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -1,48 +1,55 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from typing import List
from typing import List, Optional
from langchain.document_loaders import UnstructuredWordDocumentLoader
from langchain.schema import Document
from langchain.text_splitter import SpacyTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import (
SpacyTextSplitter,
RecursiveCharacterTextSplitter,
TextSplitter,
)
from pilot.configs.config import Config
from pilot.embedding_engine import SourceEmbedding, register
CFG = Config()
class WordEmbedding(SourceEmbedding):
"""word embedding for read word document."""
def __init__(self, file_path, vector_store_config):
def __init__(
self,
file_path,
vector_store_config,
source_reader: Optional = None,
text_splitter: Optional[TextSplitter] = None,
):
"""Initialize with word path."""
super().__init__(file_path, vector_store_config)
super().__init__(
file_path, vector_store_config, source_reader=None, text_splitter=None
)
self.file_path = file_path
self.vector_store_config = vector_store_config
self.source_reader = source_reader or None
self.text_splitter = text_splitter or None
@register
def read(self):
"""Load from word path."""
loader = UnstructuredWordDocumentLoader(self.file_path)
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
if self.source_reader is None:
self.source_reader = UnstructuredWordDocumentLoader(self.file_path)
if self.text_splitter is None:
try:
text_splitter = SpacyTextSplitter(
self.text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_size=100,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, chunk_overlap=50
)
return loader.load_and_split(text_splitter)
return self.source_reader.load_and_split(self.text_splitter)
@register
def data_process(self, documents: List[Document]):

View File

@ -50,7 +50,7 @@ prompt = PromptTemplate(
output_parser=DbChatOutputParser(
sep=PROMPT_SEP, is_stream_out=PROMPT_NEED_NEED_STREAM_OUT
),
example_selector=sql_data_example,
# example_selector=sql_data_example,
temperature=PROMPT_TEMPERATURE,
)
CFG.prompt_templates.update({prompt.template_scene: prompt})

View File

@ -17,7 +17,7 @@ from pilot.configs.model_config import (
)
from pilot.scene.chat_knowledge.custom.prompt import prompt
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
CFG = Config()
@ -37,10 +37,10 @@ class ChatNewKnowledge(BaseChat):
self.knowledge_name = knowledge_name
vector_store_config = {
"vector_store_name": knowledge_name,
"text_field": "content",
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
self.knowledge_embedding_client = KnowledgeEmbedding(
self.knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG["text2vec"],
vector_store_config=vector_store_config,
)

View File

@ -19,7 +19,7 @@ from pilot.configs.model_config import (
)
from pilot.scene.chat_knowledge.default.prompt import prompt
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
CFG = Config()
@ -38,9 +38,10 @@ class ChatDefaultKnowledge(BaseChat):
)
vector_store_config = {
"vector_store_name": "default",
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
self.knowledge_embedding_client = KnowledgeEmbedding(
self.knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG["text2vec"],
vector_store_config=vector_store_config,
)

View File

@ -18,7 +18,7 @@ from pilot.configs.model_config import (
)
from pilot.scene.chat_knowledge.url.prompt import prompt
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
CFG = Config()
@ -38,9 +38,10 @@ class ChatUrlKnowledge(BaseChat):
self.url = url
vector_store_config = {
"vector_store_name": url.replace(":", ""),
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
self.knowledge_embedding_client = KnowledgeEmbedding(
self.knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config=vector_store_config,
knowledge_type=KnowledgeType.URL.value,

View File

@ -19,7 +19,7 @@ from pilot.configs.model_config import (
)
from pilot.scene.chat_knowledge.v1.prompt import prompt
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
CFG = Config()
@ -38,9 +38,10 @@ class ChatKnowledge(BaseChat):
)
vector_store_config = {
"vector_store_name": knowledge_space,
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
self.knowledge_embedding_client = KnowledgeEmbedding(
self.knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config=vector_store_config,
)

View File

@ -1,3 +1,4 @@
import atexit
import traceback
import os
import shutil
@ -36,7 +37,7 @@ CFG = Config()
logger = build_logger("webserver", LOGDIR + "webserver.log")
def signal_handler(sig, frame):
def signal_handler():
print("in order to avoid chroma db atexit problem")
os._exit(0)
@ -96,7 +97,6 @@ if __name__ == "__main__":
action="store_true",
help="enable light mode",
)
signal.signal(signal.SIGINT, signal_handler)
# init server config
args = parser.parse_args()
@ -114,3 +114,4 @@ if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=args.port)
signal.signal(signal.SIGINT, signal_handler())

View File

@ -10,7 +10,7 @@ from pilot.configs.config import Config
from pilot.configs.model_config import LLM_MODEL_CONFIG, KNOWLEDGE_UPLOAD_ROOT_PATH
from pilot.openapi.api_v1.api_view_model import Result
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
from pilot.server.knowledge.service import KnowledgeService
from pilot.server.knowledge.request.request import (
@ -143,7 +143,7 @@ def document_list(space_name: str, query_request: ChunkQueryRequest):
@router.post("/knowledge/{vector_name}/query")
def similar_query(space_name: str, query_request: KnowledgeQueryRequest):
print(f"Received params: {space_name}, {query_request}")
client = KnowledgeEmbedding(
client = EmbeddingEngine(
model_name=embeddings, vector_store_config={"vector_store_name": space_name}
)
docs = client.similar_search(query_request.query, query_request.top_k)

View File

@ -1,9 +1,11 @@
import threading
from datetime import datetime
from langchain.text_splitter import RecursiveCharacterTextSplitter, SpacyTextSplitter
from pilot.configs.config import Config
from pilot.configs.model_config import LLM_MODEL_CONFIG
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.configs.model_config import LLM_MODEL_CONFIG, KNOWLEDGE_UPLOAD_ROOT_PATH
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
from pilot.logs import logger
from pilot.server.knowledge.chunk_db import (
DocumentChunkEntity,
@ -122,13 +124,34 @@ class KnowledgeService:
raise Exception(
f" doc:{doc.doc_name} status is {doc.status}, can not sync"
)
client = KnowledgeEmbedding(
if CFG.LANGUAGE == "en":
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=20,
length_function=len,
)
else:
try:
text_splitter = SpacyTextSplitter(
pipeline="zh_core_web_sm",
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE,
chunk_overlap=100,
)
except Exception:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CFG.KNOWLEDGE_CHUNK_SIZE, chunk_overlap=50
)
client = EmbeddingEngine(
knowledge_source=doc.content,
knowledge_type=doc.doc_type.upper(),
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config={
"vector_store_name": space_name,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
},
text_splitter=text_splitter,
)
chunk_docs = client.read()
# update document status

View File

@ -37,7 +37,7 @@ from pilot.conversation import (
from pilot.server.gradio_css import code_highlight_css
from pilot.server.gradio_patch import Chatbot as grChatbot
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
from pilot.utils import build_logger
from pilot.vector_store.extract_tovec import (
get_vector_storelist,
@ -659,13 +659,14 @@ def knowledge_embedding_store(vs_id, files):
shutil.move(
file.name, os.path.join(KNOWLEDGE_UPLOAD_ROOT_PATH, vs_id, filename)
)
knowledge_embedding_client = KnowledgeEmbedding(
knowledge_embedding_client = EmbeddingEngine(
knowledge_source=os.path.join(KNOWLEDGE_UPLOAD_ROOT_PATH, vs_id, filename),
knowledge_type=KnowledgeType.DOCUMENT.value,
model_name=LLM_MODEL_CONFIG["text2vec"],
vector_store_config={
"vector_store_name": vector_store_name["vs_name"],
"vector_store_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
},
)
knowledge_embedding_client.knowledge_embedding()

View File

@ -4,10 +4,10 @@ import uuid
from langchain.embeddings import HuggingFaceEmbeddings, logger
from pilot.configs.config import Config
from pilot.configs.model_config import LLM_MODEL_CONFIG
from pilot.configs.model_config import LLM_MODEL_CONFIG, KNOWLEDGE_UPLOAD_ROOT_PATH
from pilot.scene.base import ChatScene
from pilot.scene.base_chat import BaseChat
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
from pilot.embedding_engine.string_embedding import StringEmbedding
from pilot.summary.mysql_db_summary import MysqlSummary
from pilot.scene.chat_factory import ChatFactory
@ -33,6 +33,8 @@ class DBSummaryClient:
)
vector_store_config = {
"vector_store_name": dbname + "_summary",
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"embeddings": embeddings,
}
embedding = StringEmbedding(
@ -60,6 +62,8 @@ class DBSummaryClient:
) in db_summary_client.get_table_summary().items():
table_vector_store_config = {
"vector_store_name": dbname + "_" + table_name + "_ts",
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"embeddings": embeddings,
}
embedding = StringEmbedding(
@ -73,8 +77,10 @@ class DBSummaryClient:
def get_db_summary(self, dbname, query, topk):
vector_store_config = {
"vector_store_name": dbname + "_profile",
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
knowledge_embedding_client = KnowledgeEmbedding(
knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config=vector_store_config,
)
@ -86,8 +92,11 @@ class DBSummaryClient:
"""get user query related tables info"""
vector_store_config = {
"vector_store_name": dbname + "_summary",
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
knowledge_embedding_client = KnowledgeEmbedding(
knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config=vector_store_config,
)
@ -109,9 +118,11 @@ class DBSummaryClient:
for table in related_tables:
vector_store_config = {
"vector_store_name": dbname + "_" + table + "_ts",
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
knowledge_embedding_client = KnowledgeEmbedding(
file_path="",
knowledge_embedding_client = EmbeddingEngine(
model_name=LLM_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config=vector_store_config,
)
@ -128,6 +139,8 @@ class DBSummaryClient:
def init_db_profile(self, db_summary_client, dbname, embeddings):
profile_store_config = {
"vector_store_name": dbname + "_profile",
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"embeddings": embeddings,
}
embedding = StringEmbedding(

View File

@ -1,7 +1,6 @@
import os
from langchain.vectorstores import Chroma
from pilot.configs.model_config import KNOWLEDGE_UPLOAD_ROOT_PATH
from pilot.logs import logger
from pilot.vector_store.vector_store_base import VectorStoreBase
@ -13,7 +12,7 @@ class ChromaStore(VectorStoreBase):
self.ctx = ctx
self.embeddings = ctx["embeddings"]
self.persist_dir = os.path.join(
KNOWLEDGE_UPLOAD_ROOT_PATH, ctx["vector_store_name"] + ".vectordb"
ctx["chroma_persist_path"], ctx["vector_store_name"] + ".vectordb"
)
self.vector_store_client = Chroma(
persist_directory=self.persist_dir, embedding_function=self.embeddings

View File

@ -1,12 +1,18 @@
from pilot.vector_store.chroma_store import ChromaStore
# from pilot.vector_store.milvus_store import MilvusStore
from pilot.vector_store.milvus_store import MilvusStore
connector = {"Chroma": ChromaStore, "Milvus": None}
connector = {"Chroma": ChromaStore, "Milvus": MilvusStore}
class VectorStoreConnector:
"""vector store connector, can connect different vector db provided load document api_v1 and similar search api_v1."""
"""VectorStoreConnector, can connect different vector db provided load document api_v1 and similar search api_v1.
1.load_document:knowledge document source into vector store.(Chroma, Milvus, Weaviate)
2.similar_search: similarity search from vector_store
how to use reference:https://db-gpt.readthedocs.io/en/latest/modules/vector.html
how to integrate:https://db-gpt.readthedocs.io/en/latest/modules/vector/milvus/milvus.html
"""
def __init__(self, vector_store_type, ctx: {}) -> None:
"""initialize vector store connector."""

View File

@ -3,13 +3,9 @@ from typing import Any, Iterable, List, Optional, Tuple
from langchain.docstore.document import Document
from pymilvus import Collection, DataType, connections, utility
from pilot.configs.config import Config
from pilot.vector_store.vector_store_base import VectorStoreBase
CFG = Config()
class MilvusStore(VectorStoreBase):
"""Milvus database"""
@ -22,10 +18,10 @@ class MilvusStore(VectorStoreBase):
# self.configure(cfg)
connect_kwargs = {}
self.uri = CFG.MILVUS_URL
self.port = CFG.MILVUS_PORT
self.username = CFG.MILVUS_USERNAME
self.password = CFG.MILVUS_PASSWORD
self.uri = ctx.get("milvus_url", None)
self.port = ctx.get("milvus_port", None)
self.username = ctx.get("milvus_username", None)
self.password = ctx.get("milvus_password", None)
self.collection_name = ctx.get("vector_store_name", None)
self.secure = ctx.get("secure", None)
self.embedding = ctx.get("embeddings", None)

View File

@ -17,9 +17,9 @@ def parse_requirements(file_name: str) -> List[str]:
setuptools.setup(
name="DB-GPT",
name="db-gpt",
packages=find_packages(),
version="0.3.0",
version="0.3.1",
author="csunny",
author_email="cfqcsunny@gmail.com",
description="DB-GPT is an experimental open-source project that uses localized GPT large models to interact with your data and environment."

View File

@ -0,0 +1,20 @@
from pilot import EmbeddingEngine, KnowledgeType
url = "https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html"
embedding_model = "text2vec"
vector_store_type = "Chroma"
chroma_persist_path = "your_persist_path"
vector_store_config = {
"vector_store_name": url.replace(":", ""),
"vector_store_type": vector_store_type,
"chroma_persist_path": chroma_persist_path,
}
embedding_engine = EmbeddingEngine(
knowledge_source=url,
knowledge_type=KnowledgeType.URL.value,
model_name=embedding_model,
vector_store_config=vector_store_config,
)
# embedding url content to vector store
embedding_engine.knowledge_embedding()

View File

@ -15,8 +15,9 @@ from pilot.configs.config import Config
from pilot.configs.model_config import (
DATASETS_DIR,
LLM_MODEL_CONFIG,
KNOWLEDGE_UPLOAD_ROOT_PATH,
)
from pilot.embedding_engine.knowledge_embedding import KnowledgeEmbedding
from pilot.embedding_engine.embedding_engine import EmbeddingEngine
knowledge_space_service = KnowledgeService()
@ -37,7 +38,7 @@ class LocalKnowledgeInit:
for root, _, files in os.walk(file_path, topdown=False):
for file in files:
filename = os.path.join(root, file)
ke = KnowledgeEmbedding(
ke = EmbeddingEngine(
knowledge_source=filename,
knowledge_type=KnowledgeType.DOCUMENT.value,
model_name=self.model_name,
@ -68,7 +69,11 @@ if __name__ == "__main__":
args = parser.parse_args()
vector_name = args.vector_name
store_type = CFG.VECTOR_STORE_TYPE
vector_store_config = {"vector_store_name": vector_name}
vector_store_config = {
"vector_store_name": vector_name,
"vector_store_type": CFG.VECTOR_STORE_TYPE,
"chroma_persist_path": KNOWLEDGE_UPLOAD_ROOT_PATH,
}
print(vector_store_config)
kv = LocalKnowledgeInit(vector_store_config=vector_store_config)
kv.knowledge_persist(file_path=DATASETS_DIR)