mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-09 15:03:21 +00:00
community[minor]: Add Apache Doris as vector store (#17527)
--------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
21
docs/docs/integrations/providers/apache_doris.mdx
Normal file
21
docs/docs/integrations/providers/apache_doris.mdx
Normal file
@@ -0,0 +1,21 @@
|
||||
# Apache Doris
|
||||
|
||||
>[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.
|
||||
It delivers lightning-fast analytics on real-time data at scale.
|
||||
|
||||
>Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
|
||||
```bash
|
||||
pip install pymysql
|
||||
```
|
||||
|
||||
## Vector Store
|
||||
|
||||
See a [usage example](/docs/integrations/vectorstores/apache_doris).
|
||||
|
||||
```python
|
||||
from langchain_community.vectorstores import ApacheDoris
|
||||
```
|
322
docs/docs/integrations/vectorstores/apache_doris.ipynb
Normal file
322
docs/docs/integrations/vectorstores/apache_doris.ipynb
Normal file
@@ -0,0 +1,322 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "84180ad0-66cd-43e5-b0b8-2067a29e16ba",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"# Apache Doris\n",
|
||||
"\n",
|
||||
">[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.\n",
|
||||
"It delivers lightning-fast analytics on real-time data at scale.\n",
|
||||
"\n",
|
||||
">Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
|
||||
"\n",
|
||||
"Here we'll show how to use the Apache Doris Vector Store."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1685854f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install --upgrade --quiet pymysql"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2c891bba",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f4e6ca20-79dd-482a-8f68-af9d7dd59c7c",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install sqlalchemy\n",
|
||||
"!pip install langchain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "96f7c7a2-4811-4fdf-87f5-c60772f51fe1",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-02-14T12:54:01.392500Z",
|
||||
"start_time": "2024-02-14T12:53:58.866615Z"
|
||||
},
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import RetrievalQA\n",
|
||||
"from langchain.text_splitter import TokenTextSplitter\n",
|
||||
"from langchain_community.document_loaders import (\n",
|
||||
" DirectoryLoader,\n",
|
||||
" UnstructuredMarkdownLoader,\n",
|
||||
")\n",
|
||||
"from langchain_community.vectorstores.apache_doris import (\n",
|
||||
" ApacheDoris,\n",
|
||||
" ApacheDorisSettings,\n",
|
||||
")\n",
|
||||
"from langchain_openai import OpenAI, OpenAIEmbeddings\n",
|
||||
"\n",
|
||||
"update_vectordb = False"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ee821c00",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load docs and split them into tokens"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "34ba0cfd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Load all markdown files under the `docs` directory\n",
|
||||
"\n",
|
||||
"for Apache Doris documents, you can clone repo from https://github.com/apache/doris, and there is `docs` directory in it."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "799edf20-bcf4-4a65-bff7-b907f6bdba20",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-02-14T12:55:24.128917Z",
|
||||
"start_time": "2024-02-14T12:55:19.463831Z"
|
||||
},
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = DirectoryLoader(\n",
|
||||
" \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n",
|
||||
")\n",
|
||||
"documents = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b415fe2a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "0dc5ba83-62ef-4f61-a443-e872f251e7da",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load text splitter and split docs into snippets of text\n",
|
||||
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
|
||||
"split_docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"# tell vectordb to update text embeddings\n",
|
||||
"update_vectordb = True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "46966e25-9449-4a36-87d1-c0b25dce2994",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"split_docs[-20]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "99422e95-b407-43eb-aa68-9a62363fc82f",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e780d77f-3f96-4690-a10f-f87566f7ccc6",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"## Create vectordb instance"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "15702d9c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Use Apache Doris as vectordb"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "ced7dbe1",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-02-14T12:55:39.508287Z",
|
||||
"start_time": "2024-02-14T12:55:39.500370Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def gen_apache_doris(update_vectordb, embeddings, settings):\n",
|
||||
" if update_vectordb:\n",
|
||||
" docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)\n",
|
||||
" else:\n",
|
||||
" docsearch = ApacheDoris(embeddings, settings)\n",
|
||||
" return docsearch"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "15d86fda",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Convert tokens into embeddings and put them into vectordb"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ff1322ea",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we use Apache Doris as vectordb, you can configure Apache Doris instance via `ApacheDorisSettings`.\n",
|
||||
"\n",
|
||||
"Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify:\n",
|
||||
"1. host/port\n",
|
||||
"2. username(default: 'root')\n",
|
||||
"3. password(default: '')\n",
|
||||
"4. database(default: 'default')\n",
|
||||
"5. table(default: 'langchain')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "b34f8c31-c173-4902-8168-2e838ddfb9e9",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-02-14T12:56:02.671291Z",
|
||||
"start_time": "2024-02-14T12:55:48.350294Z"
|
||||
},
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from getpass import getpass\n",
|
||||
"\n",
|
||||
"os.environ[\"OPENAI_API_KEY\"] = getpass()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c53ab3f2-9e34-4424-8b07-6292bde67e14",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"update_vectordb = True\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()\n",
|
||||
"\n",
|
||||
"# configure Apache Doris settings(host/port/user/pw/db)\n",
|
||||
"settings = ApacheDorisSettings()\n",
|
||||
"settings.port = 9030\n",
|
||||
"settings.host = \"172.30.34.130\"\n",
|
||||
"settings.username = \"root\"\n",
|
||||
"settings.password = \"\"\n",
|
||||
"settings.database = \"langchain\"\n",
|
||||
"docsearch = gen_apache_doris(update_vectordb, embeddings, settings)\n",
|
||||
"\n",
|
||||
"print(docsearch)\n",
|
||||
"\n",
|
||||
"update_vectordb = False"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bde66626",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Build QA and ask question to it"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "84921814",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI()\n",
|
||||
"qa = RetrievalQA.from_chain_type(\n",
|
||||
" llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
|
||||
")\n",
|
||||
"query = \"what is apache doris\"\n",
|
||||
"resp = qa.run(query)\n",
|
||||
"print(resp)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Reference in New Issue
Block a user