diff --git a/README.md b/README.md index 9538418ca..2c056f988 100644 --- a/README.md +++ b/README.md @@ -63,6 +63,7 @@ https://github.com/csunny/DB-GPT/assets/13723926/55f31781-1d49-4757-b96e-7ef6d3d

## Releases +- [2023/07/12]🔥🔥🔥DB-GPT python api package 0.3.0. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/installation.html) - [2023/07/06]🔥🔥🔥Brand-new DB-GPT product with a brand-new web UI. [documents](https://db-gpt.readthedocs.io/en/latest/getting_started/getting_started.html) - [2023/06/25]🔥support chatglm2-6b model. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html) - [2023/06/14] support gpt4all model, which can run at M1/M2, or cpu machine. [documents](https://db-gpt.readthedocs.io/en/latest/modules/llms.html) diff --git a/docs/getting_started/getting_started.md b/docs/getting_started/getting_started.md index dcd5c9913..672f0f74e 100644 --- a/docs/getting_started/getting_started.md +++ b/docs/getting_started/getting_started.md @@ -25,14 +25,14 @@ $ docker run --name=mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=aa12345678 -dit my We use [Chroma embedding database](https://github.com/chroma-core/chroma) as the default for our vector database, so there is no need for special installation. If you choose to connect to other databases, you can follow our tutorial for installation and configuration. For the entire installation process of DB-GPT, we use the miniconda3 virtual environment. Create a virtual environment and install the Python dependencies. -``` +```{tip} python>=3.10 conda create -n dbgpt_env python=3.10 conda activate dbgpt_env pip install -r requirements.txt ``` Before use DB-GPT Knowledge Management -``` +```{tip} python -m spacy download zh_core_web_sm ``` @@ -40,7 +40,7 @@ python -m spacy download zh_core_web_sm Once the environment is installed, we have to create a new folder "models" in the DB-GPT project, and then we can put all the models downloaded from huggingface in this directory Notice make sure you have install git-lfs -``` +```{tip} git clone https://huggingface.co/Tribbiani/vicuna-13b git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese @@ -49,7 +49,7 @@ git clone https://huggingface.co/THUDM/chatglm2-6b The model files are large and will take a long time to download. During the download, let's configure the .env file, which needs to be copied and created from the .env.template -``` +```{tip} cp .env.template .env ``` diff --git a/docs/getting_started/installation.md b/docs/getting_started/installation.md new file mode 100644 index 000000000..17188e2b7 --- /dev/null +++ b/docs/getting_started/installation.md @@ -0,0 +1,29 @@ +# Installation +DB-GPT provides a third-party Python API package that you can integrate into your own code. + +### Installation from Pip + +You can simply pip install: +```{tip} +pip install -i https://pypi.org/ db-gpt==0.3.0 +``` +Notice:make sure python>=3.10 + + +### Environment Setup + +By default, if you use the EmbeddingEngine api + +you will prepare embedding models from huggingface + +Notice make sure you have install git-lfs +```{tip} +git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 +git clone https://huggingface.co/GanymedeNil/text2vec-large-chinese +``` +version: +- db-gpt0.3.0 + - [embedding_engine api](https://db-gpt.readthedocs.io/en/latest/modules/knowledge.html) + - [multi source embedding](https://db-gpt.readthedocs.io/en/latest/modules/knowledge/pdf/pdf_embedding.html) + - [vector connector](https://db-gpt.readthedocs.io/en/latest/modules/vector.html) + diff --git a/docs/index.rst b/docs/index.rst index 50f1a6214..b52edc93c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -48,6 +48,7 @@ Getting Started :hidden: getting_started/getting_started.md + getting_started/installation.md getting_started/concepts.md getting_started/tutorials.md diff --git a/docs/modules/knowledge.rst b/docs/modules/knowledge.rst index bfb6a7cb4..e4714c54b 100644 --- a/docs/modules/knowledge.rst +++ b/docs/modules/knowledge.rst @@ -105,7 +105,12 @@ Document type can be .txt, .pdf, .md, .doc, .ppt. Note that the default vector model used is text2vec-large-chinese (which is a large model, so if your personal computer configuration is not enough, it is recommended to use text2vec-base-chinese). Therefore, ensure that you download the model and place it in the models directory. -- `pdf_embedding <./knowledge/pdf_embedding.html>`_: supported pdf embedding. +- `pdf_embedding <./knowledge/pdf/pdf_embedding.html>`_: supported pdf embedding. +- `markdown_embedding <./knowledge/markdown/markdown_embedding.html>`_: supported markdown embedding. +- `word_embedding <./knowledge/word/word_embedding.html>`_: supported word embedding. +- `url_embedding <./knowledge/url/url_embedding.html>`_: supported url embedding. +- `ppt_embedding <./knowledge/ppt/ppt_embedding.html>`_: supported ppt embedding. +- `string_embedding <./knowledge/string/string_embedding.html>`_: supported raw text embedding. .. toctree:: @@ -118,4 +123,5 @@ Note that the default vector model used is text2vec-large-chinese (which is a la ./knowledge/markdown/markdown_embedding.md ./knowledge/word/word_embedding.md ./knowledge/url/url_embedding.md - ./knowledge/ppt/ppt_embedding.md \ No newline at end of file + ./knowledge/ppt/ppt_embedding.md + ./knowledge/string/string_embedding.md \ No newline at end of file diff --git a/docs/modules/knowledge/markdown/markdown_embedding.md b/docs/modules/knowledge/markdown/markdown_embedding.md index 72a5d2c38..870772eb2 100644 --- a/docs/modules/knowledge/markdown/markdown_embedding.md +++ b/docs/modules/knowledge/markdown/markdown_embedding.md @@ -1,4 +1,4 @@ -MarkdownEmbedding +Markdown ================================== markdown embedding can import md text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. diff --git a/docs/modules/knowledge/pdf/pdf_embedding.md b/docs/modules/knowledge/pdf/pdf_embedding.md index 0b349b406..d6366f65f 100644 --- a/docs/modules/knowledge/pdf/pdf_embedding.md +++ b/docs/modules/knowledge/pdf/pdf_embedding.md @@ -1,4 +1,4 @@ -PDFEmbedding +PDF ================================== pdfembedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. diff --git a/docs/modules/knowledge/ppt/ppt_embedding.md b/docs/modules/knowledge/ppt/ppt_embedding.md index c23b25a78..84e5b0135 100644 --- a/docs/modules/knowledge/ppt/ppt_embedding.md +++ b/docs/modules/knowledge/ppt/ppt_embedding.md @@ -1,4 +1,4 @@ -PPTEmbedding +PPT ================================== ppt embedding can import ppt text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. diff --git a/docs/modules/knowledge/string/string_embedding.md b/docs/modules/knowledge/string/string_embedding.md new file mode 100644 index 000000000..385a79ff8 --- /dev/null +++ b/docs/modules/knowledge/string/string_embedding.md @@ -0,0 +1,41 @@ +String +================================== +string embedding can import a long raw text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. + +inheriting the SourceEmbedding +``` +class StringEmbedding(SourceEmbedding): + """string embedding for read string document.""" + + def __init__( + self, + file_path, + vector_store_config, + text_splitter: Optional[TextSplitter] = None, + ): + """Initialize raw text word path.""" + super().__init__(file_path=file_path, vector_store_config=vector_store_config) + self.file_path = file_path + self.vector_store_config = vector_store_config + self.text_splitter = text_splitter or None +``` + +implement read() and data_process() +read() method allows you to read data and split data into chunk +``` +@register + def read(self): + """Load from String path.""" + metadata = {"source": "raw text"} + return [Document(page_content=self.file_path, metadata=metadata)] +``` +data_process() method allows you to pre processing your ways +``` +@register + def data_process(self, documents: List[Document]): + i = 0 + for d in documents: + documents[i].page_content = d.page_content.replace("\n", "") + i += 1 + return documents +``` diff --git a/docs/modules/knowledge/url/url_embedding.md b/docs/modules/knowledge/url/url_embedding.md index d485d30b1..f23fa82d5 100644 --- a/docs/modules/knowledge/url/url_embedding.md +++ b/docs/modules/knowledge/url/url_embedding.md @@ -1,4 +1,4 @@ -URL Embedding +URL ================================== url embedding can import PDF text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. diff --git a/docs/modules/knowledge/word/word_embedding.md b/docs/modules/knowledge/word/word_embedding.md index 38a966f0e..b2df34d31 100644 --- a/docs/modules/knowledge/word/word_embedding.md +++ b/docs/modules/knowledge/word/word_embedding.md @@ -1,4 +1,4 @@ -WordEmbedding +Word ================================== word embedding can import word doc/docx text into a vector knowledge base. The entire embedding process includes the read (loading data), data_process (data processing), and index_to_store (embedding to the vector database) methods. diff --git a/pilot/embedding_engine/string_embedding.py b/pilot/embedding_engine/string_embedding.py index 64d81899b..6a7b0c959 100644 --- a/pilot/embedding_engine/string_embedding.py +++ b/pilot/embedding_engine/string_embedding.py @@ -24,7 +24,7 @@ class StringEmbedding(SourceEmbedding): @register def read(self): """Load from String path.""" - metadata = {"source": "db_summary"} + metadata = {"source": "raw text"} return [Document(page_content=self.file_path, metadata=metadata)] @register