[Feature][VectorStore] Support StarRocks as vector db (#6119)

<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes # (issue)

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

Here are some examples to use StarRocks as vectordb

```
from langchain.vectorstores import StarRocks
from langchain.vectorstores.starrocks import StarRocksSettings

embeddings = OpenAIEmbeddings()

# conifgure starrocks settings
settings = StarRocksSettings()
settings.port = 41003
settings.host = '127.0.0.1'
settings.username = 'root'
settings.password = ''
settings.database = 'zya'

# to fill new embeddings
docsearch = StarRocks.from_documents(split_docs, embeddings, config = settings)   


# or to use already-built embeddings in database.
docsearch = StarRocks(embeddings, settings)
```

#### Who can review?

Tag maintainers/contributors who might be interested:

@dev2049 

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @hwchase17

  VectorStores / Retrievers / Memory
  - @dev2049

 -->

---------

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
This commit is contained in:
dirtysalt
2023-06-22 00:02:33 +08:00
committed by GitHub
parent 7a4ff424fc
commit 57cc3d1d3d
3 changed files with 773 additions and 0 deletions

View File

@@ -0,0 +1,313 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "59723cea",
"metadata": {},
"source": [
"# StarRocks\n",
"\n",
"[StarRocks | A High-Performance Analytical Database](https://www.starrocks.io/)\n",
"\n",
"StarRocks is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.\n",
"\n",
"Usually StarRocks is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
"\n",
"Here we'll show how to use the StarRocks Vector Store."
]
},
{
"cell_type": "markdown",
"id": "1685854f",
"metadata": {},
"source": [
"\n",
"## Import all used modules"
]
},
{
"cell_type": "markdown",
"id": "2c891bba",
"metadata": {},
"source": [
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3c85fb93",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!\n",
" warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n"
]
}
],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import StarRocks\n",
"from langchain.vectorstores.starrocks import StarRocksSettings\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter\n",
"from langchain import OpenAI,VectorDBQA\n",
"from langchain.document_loaders import DirectoryLoader\n",
"from langchain.chains import RetrievalQA\n",
"from langchain.document_loaders import TextLoader, UnstructuredMarkdownLoader\n",
"\n",
"update_vectordb = False"
]
},
{
"cell_type": "markdown",
"id": "ee821c00",
"metadata": {},
"source": [
"## Load docs and split them into tokens"
]
},
{
"cell_type": "markdown",
"id": "34ba0cfd",
"metadata": {},
"source": [
"Load all markdown files under the `docs` directory\n",
"\n",
"for starrocks documents, you can clone repo from https://github.com/StarRocks/starrocks, and there is `docs` directory in it."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "85912696",
"metadata": {},
"outputs": [],
"source": [
"loader = DirectoryLoader('./docs', glob='**/*.md', loader_cls=UnstructuredMarkdownLoader)\n",
"documents = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "b415fe2a",
"metadata": {},
"source": [
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "07e8acff",
"metadata": {},
"outputs": [],
"source": [
"# load text splitter and split docs into snippets of text\n",
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
"split_docs = text_splitter.split_documents(documents)\n",
"\n",
"# tell vectordb to update text embeddings\n",
"update_vectordb = True"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1f365370",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Compile StarRocks with Docker\\n\\nThis topic describes how to compile StarRocks using Docker.\\n\\nOverview\\n\\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\\n\\nStarRocks version and DEV ENV image\\n\\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\\n\\nFor Ubuntu 22.04:\\n\\n| Branch name | Image name |\\n | --------------- | ----------------------------------- |\\n | main | starrocks/dev-env-ubuntu:latest |\\n | branch-3.0 | starrocks/dev-env-ubuntu:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-ubuntu:2.5-latest |\\n\\nFor CentOS 7.9:\\n\\n| Branch name | Image name |\\n | --------------- | ------------------------------------ |\\n | main | starrocks/dev-env-centos7:latest |\\n | branch-3.0 | starrocks/dev-env-centos7:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-centos7:2.5-latest |\\n\\nPrerequisites\\n\\nBefore compiling StarRocks, make sure the following requirements are satisfied:\\n\\nHardware\\n\\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"split_docs[-20]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "50012b29",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# docs = 657, # splits = 2802\n"
]
}
],
"source": [
"print('# docs = %d, # splits = %d' % (len(documents), len(split_docs)))"
]
},
{
"cell_type": "markdown",
"id": "5371f152",
"metadata": {},
"source": [
"## Create vectordb instance"
]
},
{
"cell_type": "markdown",
"id": "15702d9c",
"metadata": {},
"source": [
"### Use StarRocks as vectordb"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ced7dbe1",
"metadata": {},
"outputs": [],
"source": [
"def gen_starrocks(update_vectordb, embeddings, settings):\n",
" if update_vectordb:\n",
" docsearch = StarRocks.from_documents(split_docs, embeddings, config = settings) \n",
" else:\n",
" docsearch = StarRocks(embeddings, settings) \n",
" return docsearch\n"
]
},
{
"cell_type": "markdown",
"id": "15d86fda",
"metadata": {},
"source": [
"## Convert tokens into embeddings and put them into vectordb"
]
},
{
"cell_type": "markdown",
"id": "ff1322ea",
"metadata": {},
"source": [
"Here we use StarRocks as vectordb, you can configure StarRocks instance via `StarRocksSettings`.\n",
"\n",
"Configuring StarRocks instance is pretty much like configuring mysql instance. You need to specify:\n",
"1. host/port\n",
"2. username(default: 'root')\n",
"3. password(default: '')\n",
"4. database(default: 'default')\n",
"5. table(default: 'langchain')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "26410d9b",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Inserting data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2802/2802 [02:26<00:00, 19.11it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[92m\u001b[1mzya.langchain @ 127.0.0.1:41003\u001b[0m\n",
"\n",
"\u001b[1musername: root\u001b[0m\n",
"\n",
"Table Schema:\n",
"----------------------------------------------------------------------------\n",
"|\u001b[94mname \u001b[0m|\u001b[96mtype \u001b[0m|\u001b[96mkey \u001b[0m|\n",
"----------------------------------------------------------------------------\n",
"|\u001b[94mid \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mtrue \u001b[0m|\n",
"|\u001b[94mdocument \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
"|\u001b[94membedding \u001b[0m|\u001b[96marray<float> \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
"|\u001b[94mmetadata \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
"----------------------------------------------------------------------------\n",
"\n"
]
}
],
"source": [
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# configure starrocks settings(host/port/user/pw/db)\n",
"settings = StarRocksSettings()\n",
"settings.port = 41003\n",
"settings.host = '127.0.0.1'\n",
"settings.username = 'root'\n",
"settings.password = ''\n",
"settings.database = 'zya'\n",
"docsearch = gen_starrocks(update_vectordb, embeddings, settings)\n",
"\n",
"print(docsearch)\n",
"\n",
"update_vectordb = False"
]
},
{
"cell_type": "markdown",
"id": "bde66626",
"metadata": {},
"source": [
"## Build QA and ask question to it"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "84921814",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`\n"
]
}
],
"source": [
"llm = OpenAI()\n",
"qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever())\n",
"query = \"is profile enabled by default? if not, how to enable profile?\"\n",
"resp = qa.run(query)\n",
"print(resp)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}