[Feature][VectorStore] Support StarRocks as vector db (#6119)

Fixes # (issue) #### Before submitting  Here are some examples to use StarRocks as vectordb ``` from langchain.vectorstores import StarRocks from langchain.vectorstores.starrocks import StarRocksSettings embeddings = OpenAIEmbeddings() # conifgure starrocks settings settings = StarRocksSettings() settings.port = 41003 settings.host = '127.0.0.1' settings.username = 'root' settings.password = '' settings.database = 'zya' # to fill new embeddings docsearch = StarRocks.from_documents(split_docs, embeddings, config = settings) # or to use already-built embeddings in database. docsearch = StarRocks(embeddings, settings) ``` #### Who can review? Tag maintainers/contributors who might be interested: @dev2049  --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
2025-09-06 05:25:04 +00:00 · 2023-06-22 00:02:33 +08:00
parent 7a4ff424fc
commit 57cc3d1d3d
3 changed files with 773 additions and 0 deletions
--- a/docs/extras/modules/data_connection/vectorstores/integrations/starrocks.ipynb
+++ b/docs/extras/modules/data_connection/vectorstores/integrations/starrocks.ipynb
@@ -0,0 +1,313 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "59723cea",
+   "metadata": {},
+   "source": [
+    "# StarRocks\n",
+    "\n",
+    "[StarRocks | A High-Performance Analytical Database](https://www.starrocks.io/)\n",
+    "\n",
+    "StarRocks is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.\n",
+    "\n",
+    "Usually StarRocks is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
+    "\n",
+    "Here we'll show how to use the StarRocks Vector Store."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1685854f",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Import all used modules"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c891bba",
+   "metadata": {},
+   "source": [
+    "Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3c85fb93",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!\n",
+      "  warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "from langchain.vectorstores import StarRocks\n",
+    "from langchain.vectorstores.starrocks import StarRocksSettings\n",
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter\n",
+    "from langchain import OpenAI,VectorDBQA\n",
+    "from langchain.document_loaders import DirectoryLoader\n",
+    "from langchain.chains import RetrievalQA\n",
+    "from langchain.document_loaders import TextLoader, UnstructuredMarkdownLoader\n",
+    "\n",
+    "update_vectordb = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee821c00",
+   "metadata": {},
+   "source": [
+    "## Load docs and split them into tokens"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34ba0cfd",
+   "metadata": {},
+   "source": [
+    "Load all markdown files under the `docs` directory\n",
+    "\n",
+    "for starrocks documents, you can clone repo from https://github.com/StarRocks/starrocks, and there is `docs` directory in it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "85912696",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = DirectoryLoader('./docs', glob='**/*.md', loader_cls=UnstructuredMarkdownLoader)\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b415fe2a",
+   "metadata": {},
+   "source": [
+    "Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "07e8acff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load text splitter and split docs into snippets of text\n",
+    "text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
+    "split_docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "# tell vectordb to update text embeddings\n",
+    "update_vectordb = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "1f365370",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(page_content='Compile StarRocks with Docker\\n\\nThis topic describes how to compile StarRocks using Docker.\\n\\nOverview\\n\\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\\n\\nStarRocks version and DEV ENV image\\n\\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\\n\\nFor Ubuntu 22.04:\\n\\n| Branch name | Image name              |\\n  | --------------- | ----------------------------------- |\\n  | main            | starrocks/dev-env-ubuntu:latest     |\\n  | branch-3.0      | starrocks/dev-env-ubuntu:3.0-latest |\\n  | branch-2.5      | starrocks/dev-env-ubuntu:2.5-latest |\\n\\nFor CentOS 7.9:\\n\\n| Branch name | Image name                       |\\n  | --------------- | ------------------------------------ |\\n  | main            | starrocks/dev-env-centos7:latest     |\\n  | branch-3.0      | starrocks/dev-env-centos7:3.0-latest |\\n  | branch-2.5      | starrocks/dev-env-centos7:2.5-latest |\\n\\nPrerequisites\\n\\nBefore compiling StarRocks, make sure the following requirements are satisfied:\\n\\nHardware\\n\\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "split_docs[-20]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "50012b29",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "# docs  = 657, # splits = 2802\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('# docs  = %d, # splits = %d' % (len(documents), len(split_docs)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5371f152",
+   "metadata": {},
+   "source": [
+    "## Create vectordb instance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15702d9c",
+   "metadata": {},
+   "source": [
+    "### Use StarRocks as vectordb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ced7dbe1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def gen_starrocks(update_vectordb, embeddings, settings):\n",
+    "    if update_vectordb:\n",
+    "        docsearch = StarRocks.from_documents(split_docs, embeddings, config = settings)            \n",
+    "    else:\n",
+    "        docsearch = StarRocks(embeddings, settings)    \n",
+    "    return docsearch\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15d86fda",
+   "metadata": {},
+   "source": [
+    "## Convert tokens into embeddings and put them into vectordb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff1322ea",
+   "metadata": {},
+   "source": [
+    "Here we use StarRocks as vectordb, you can configure StarRocks instance via `StarRocksSettings`.\n",
+    "\n",
+    "Configuring StarRocks instance is pretty much like configuring mysql instance. You need to specify:\n",
+    "1. host/port\n",
+    "2. username(default: 'root')\n",
+    "3. password(default: '')\n",
+    "4. database(default: 'default')\n",
+    "5. table(default: 'langchain')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "26410d9b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Inserting data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2802/2802 [02:26<00:00, 19.11it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[92m\u001b[1mzya.langchain @ 127.0.0.1:41003\u001b[0m\n",
+      "\n",
+      "\u001b[1musername: root\u001b[0m\n",
+      "\n",
+      "Table Schema:\n",
+      "----------------------------------------------------------------------------\n",
+      "|\u001b[94mname                    \u001b[0m|\u001b[96mtype                    \u001b[0m|\u001b[96mkey                     \u001b[0m|\n",
+      "----------------------------------------------------------------------------\n",
+      "|\u001b[94mid                      \u001b[0m|\u001b[96mvarchar(65533)          \u001b[0m|\u001b[96mtrue                    \u001b[0m|\n",
+      "|\u001b[94mdocument                \u001b[0m|\u001b[96mvarchar(65533)          \u001b[0m|\u001b[96mfalse                   \u001b[0m|\n",
+      "|\u001b[94membedding               \u001b[0m|\u001b[96marray<float>            \u001b[0m|\u001b[96mfalse                   \u001b[0m|\n",
+      "|\u001b[94mmetadata                \u001b[0m|\u001b[96mvarchar(65533)          \u001b[0m|\u001b[96mfalse                   \u001b[0m|\n",
+      "----------------------------------------------------------------------------\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "embeddings = OpenAIEmbeddings()\n",
+    "\n",
+    "# configure starrocks settings(host/port/user/pw/db)\n",
+    "settings = StarRocksSettings()\n",
+    "settings.port = 41003\n",
+    "settings.host = '127.0.0.1'\n",
+    "settings.username = 'root'\n",
+    "settings.password = ''\n",
+    "settings.database = 'zya'\n",
+    "docsearch = gen_starrocks(update_vectordb, embeddings, settings)\n",
+    "\n",
+    "print(docsearch)\n",
+    "\n",
+    "update_vectordb = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bde66626",
+   "metadata": {},
+   "source": [
+    "## Build QA and ask question to it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "84921814",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`\n"
+     ]
+    }
+   ],
+   "source": [
+    "llm = OpenAI()\n",
+    "qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever())\n",
+    "query = \"is profile enabled by default? if not, how to enable profile?\"\n",
+    "resp = qa.run(query)\n",
+    "print(resp)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}