Harrison/myscale (#3352)

Co-authored-by: Fangrui Liu <fangruil@moqi.ai> Co-authored-by: 刘方瑞 <fangrui.liu@outlook.com> Co-authored-by: Fangrui.Liu <fangrui.liu@ubc.ca>
2025-07-17 02:03:44 +00:00 · 2023-04-22 09:17:38 -07:00 · 2023-04-22 09:17:38 -07:00 · a6664be79c
commit a6664be79c
parent 6200a2a00e
8 changed files with 887 additions and 7 deletions
--- a/docs/ecosystem/myscale.md
+++ b/docs/ecosystem/myscale.md
@ -0,0 +1,65 @@
 # MyScale
 This page covers how to use MyScale vector database within LangChain.
 It is broken into two parts: installation and setup, and then references to specific MyScale wrappers.
 With MyScale, you can manage both structured and unstructured (vectorized) data, and perform joint queries and analytics on both types of data using SQL. Plus, MyScale's cloud-native OLAP architecture, built on top of ClickHouse, enables lightning-fast data processing even on massive datasets.
 ## Introduction
 [Overview to MyScale and High performance vector search](https://docs.myscale.com/en/overview/)
 You can now register on our SaaS and [start a cluster now!](https://docs.myscale.com/en/quickstart/)
 If you are also interested in how we managed to integrate SQL and vector, please refer to [this document](https://docs.myscale.com/en/vector-reference/) for further syntax reference.
 We also deliver with live demo on huggingface! Please checkout our [huggingface space](https://huggingface.co/myscale)! They search millions of vector within a blink!
 ## Installation and Setup
 - Install the Python SDK with `pip install clickhouse-connect`
 ### Setting up envrionments
 There are two ways to set up parameters for myscale index.
 1. Environment Variables
    Before you run the app, please set the environment variable with `export`:
    `export MYSCALE_URL='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`
    You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)
    Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.
 2. Create `MyScaleSettings` object with parameters
    ```python
    from langchain.vectorstores import MyScale, MyScaleSettings
    config = MyScaleSetting(host="<your-backend-url>", port=8443, ...)
    index = MyScale(embedding_function, config)
    index.add_documents(...)
    ```
 ## Wrappers
 supported functions:
 - `add_texts`
 - `add_documents`
 - `from_texts`
 - `from_documents`
 - `similarity_search`
 - `asimilarity_search`
 - `similarity_search_by_vector`
 - `asimilarity_search_by_vector`
 - `similarity_search_with_relevance_scores`
 ### VectorStore
 There exists a wrapper around MyScale database, allowing you to use it as a vectorstore,
 whether for semantic search or similar example retrieval.
 To import this vectorstore:
 ```python
 from langchain.vectorstores import MyScale
 ```
 For a more detailed walkthrough of the MyScale wrapper, see [this notebook](../modules/indexes/vectorstores/examples/myscale.ipynb)
--- a/docs/modules/indexes/vectorstores/examples/myscale.ipynb
+++ b/docs/modules/indexes/vectorstores/examples/myscale.ipynb
@ -0,0 +1,267 @@
 {
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "683953b3",
   "metadata": {},
   "source": [
    "# MyScale\n",
    "\n",
    "This notebook shows how to use functionality related to the MyScale vector database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "aac9563e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import MyScale\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a9d16fa3",
   "metadata": {},
   "source": [
    "## Setting up envrionments\n",
    "\n",
    "There are two ways to set up parameters for myscale index.\n",
    "\n",
    "1. Environment Variables\n",
    "\n",
    "    Before you run the app, please set the environment variable with `export`:\n",
    "    `export MYSCALE_URL='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`\n",
    "\n",
    "    You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)\n",
    "\n",
    "    Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.\n",
    "\n",
    "2. Create `MyScaleSettings` object with parameters\n",
    "\n",
    "\n",
    "    ```python\n",
    "    from langchain.vectorstores import MyScale, MyScaleSettings\n",
    "    config = MyScaleSetting(host=\"<your-backend-url>\", port=8443, ...)\n",
    "    index = MyScale(embedding_function, config)\n",
    "    index.add_documents(...)\n",
    "    ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a3c3999a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "loader = TextLoader('../../../state_of_the_union.txt')\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6e104aee",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inserting data...: 100%|██████████| 42/42 [00:18<00:00,  2.21it/s]\n"
     ]
    }
   ],
   "source": [
    "for d in docs:\n",
    "    d.metadata = {'some': 'metadata'}\n",
    "docsearch = MyScale.from_documents(docs, embeddings)\n",
    "\n",
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "docs = docsearch.similarity_search(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9c608226",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "As Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment they’re conducting on our children for profit. \n",
      "\n",
      "It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n",
      "\n",
      "And let’s get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n",
      "\n",
      "Third, support our veterans. \n",
      "\n",
      "Veterans are the best of us. \n",
      "\n",
      "I’ve always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n",
      "\n",
      "My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free.  \n",
      "\n",
      "Our troops in Iraq and Afghanistan faced many dangers.\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e3a8b105",
   "metadata": {},
   "source": [
    "## Get connection info and data schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69996818",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(str(docsearch))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f59360c0",
   "metadata": {},
   "source": [
    "## Filtering\n",
    "\n",
    "You can have direct access to myscale SQL where statement. You can write `WHERE` clause following standard SQL.\n",
    "\n",
    "**NOTE**: Please be aware of SQL injection, this interface must not be directly called by end-user.\n",
    "\n",
    "If you custimized your `column_map` under your setting, you search with filter like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "232055f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inserting data...: 100%|██████████| 42/42 [00:15<00:00,  2.69it/s]\n"
     ]
    }
   ],
   "source": [
    "from langchain.vectorstores import MyScale, MyScaleSettings\n",
    "from langchain.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader('../../../state_of_the_union.txt')\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()\n",
    "\n",
    "for i, d in enumerate(docs):\n",
    "    d.metadata = {'doc_id': i}\n",
    "\n",
    "docsearch = MyScale.from_documents(docs, embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ddbcee77",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.252379834651947 {'doc_id': 6, 'some': ''} And I’m taking robus...\n",
      "0.25022566318511963 {'doc_id': 1, 'some': ''} Groups of citizens b...\n",
      "0.2469480037689209 {'doc_id': 8, 'some': ''} And so many families...\n",
      "0.2428302764892578 {'doc_id': 0, 'some': 'metadata'} As Frances Haugen, w...\n"
     ]
    }
   ],
   "source": [
    "meta = docsearch.metadata_column\n",
    "output = docsearch.similarity_search_with_relevance_scores('What did the president say about Ketanji Brown Jackson?', \n",
    "                                                           k=4, where_str=f\"{meta}.doc_id<10\")\n",
    "for d, dist in output:\n",
    "    print(dist, d.metadata, d.page_content[:20] + '...')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a359ed74",
   "metadata": {},
   "source": [
    "## Deleting your data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb6a9d36",
   "metadata": {},
   "outputs": [],
   "source": [
    "docsearch.drop()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48dbd8e0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/docs/reference/integrations.md
+++ b/docs/reference/integrations.md
@ -45,6 +45,8 @@ The following use cases require specific installs and api keys:
  - Set up Elasticsearch backend. If you want to do locally, [this](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/getting-started.html) is a good guide.
 - _FAISS_:
  - Install requirements with `pip install faiss` for Python 3.7 and `pip install faiss-cpu` for Python 3.10+.
 - _MyScale_
  - Install requirements with `pip install clickhouse-connect`. For documentations, please refer to [this document](https://docs.myscale.com/en/overview/).
 - _Manifest_:
  - Install requirements with `pip install manifest-ml` (Note: this is only available in Python 3.8+ currently).
 - _OpenSearch_:
--- a/langchain/vectorstores/init.py
+++ b/langchain/vectorstores/init.py
@ -8,6 +8,7 @@ from langchain.vectorstores.deeplake import DeepLake
 from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
 from langchain.vectorstores.faiss import FAISS
 from langchain.vectorstores.milvus import Milvus
 from langchain.vectorstores.myscale import MyScale, MyScaleSettings
 from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch
 from langchain.vectorstores.pinecone import Pinecone
 from langchain.vectorstores.qdrant import Qdrant
@ -29,6 +30,8 @@ __all__ = [
    "AtlasDB",
    "DeepLake",
    "Annoy",
    "MyScale",
    "MyScaleSettings",
    "SupabaseVectorStore",
    "AnalyticDB",
 ]
--- a/langchain/vectorstores/myscale.py
+++ b/langchain/vectorstores/myscale.py
@ -0,0 +1,433 @@
 """Wrapper around MyScale vector database."""
 from __future__ import annotations
 import json
 import logging
 from hashlib import sha1
 from threading import Thread
 from typing import Any, Dict, Iterable, List, Optional, Tuple
 from pydantic import BaseSettings
 from langchain.docstore.document import Document
 from langchain.embeddings.base import Embeddings
 from langchain.vectorstores.base import VectorStore
 logger = logging.getLogger()
 def has_mul_sub_str(s: str, *args: Any) -> bool:
    for a in args:
        if a not in s:
            return False
    return True
 class MyScaleSettings(BaseSettings):
    """MyScale Client Configuration
    Attribute:
        myscale_host (str) : An URL to connect to MyScale backend.
                             Defaults to 'localhost'.
        myscale_port (int) : URL port to connect with HTTP. Defaults to 8443.
        username (str) : Usernamed to login. Defaults to None.
        password (str) : Password to login. Defaults to None.
        index_type (str): index type string.
        index_param (dict): index build parameter.
        database (str) : Database name to find the table. Defaults to 'default'.
        table (str) : Table name to operate on.
                      Defaults to 'vector_table'.
        metric (str) : Metric to compute distance,
                       supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'.
        column_map (Dict) : Column type map to project column name onto langchain
                            semantics. Must have keys: `text`, `id`, `vector`,
                            must be same size to number of columns. For example:
                            .. code-block:: python
                            {
                                'id': 'text_id',
                                'vector': 'text_embedding',
                                'text': 'text_plain',
                                'metadata': 'metadata_dictionary_in_json',
                            }
                            Defaults to identity map.
    """
    host: str = "localhost"
    port: int = 8443
    username: Optional[str] = None
    password: Optional[str] = None
    index_type: str = "IVFFLAT"
    index_param: Optional[Dict[str, str]] = None
    column_map: Dict[str, str] = {
        "id": "id",
        "text": "text",
        "vector": "vector",
        "metadata": "metadata",
    }
    database: str = "default"
    table: str = "langchain"
    metric: str = "cosine"
    def __getitem__(self, item: str) -> Any:
        return getattr(self, item)
    class Config:
        env_file = ".env"
        env_prefix = "myscale_"
        env_file_encoding = "utf-8"
 class MyScale(VectorStore):
    """Wrapper around MyScale vector database
    You need a `clickhouse-connect` python package, and a valid account
    to connect to MyScale.
    MyScale can not only search with simple vector indexes,
    it also supports complex query with multiple conditions,
    constraints and even sub-queries.
    For more information, please visit
        [myscale official site](https://docs.myscale.com/en/overview/)
    """
    def __init__(
        self,
        embedding: Embeddings,
        config: Optional[MyScaleSettings] = None,
        **kwargs: Any,
    ) -> None:
        """MyScale Wrapper to LangChain
        embedding_function (Embeddings):
        config (MyScaleSettings): Configuration to MyScale Client
        Other keyword arguments will pass into
            [clickhouse-connect](https://docs.myscale.com/)
        """
        try:
            from clickhouse_connect import get_client
        except ImportError:
            raise ValueError(
                "Could not import clickhouse connect python package. "
                "Please install it with `pip install clickhouse-connect`."
            )
        try:
            from tqdm import tqdm
            self.pgbar = tqdm
        except ImportError:
            # Just in case if tqdm is not installed
            self.pgbar = lambda x: x
        super().__init__()
        if config is not None:
            self.config = config
        else:
            self.config = MyScaleSettings()
        assert self.config
        assert self.config.host and self.config.port
        assert (
            self.config.column_map
            and self.config.database
            and self.config.table
            and self.config.metric
        )
        for k in ["id", "vector", "text", "metadata"]:
            assert k in self.config.column_map
        assert self.config.metric in ["ip", "cosine", "l2"]
        # initialize the schema
        dim = len(embedding.embed_query("try this out"))
        index_params = (
            ", " + ",".join([f"'{k}={v}'" for k, v in self.config.index_param.items()])
            if self.config.index_param
            else ""
        )
        schema_ = f"""
            CREATE TABLE IF NOT EXISTS {self.config.database}.{self.config.table}(
                {self.config.column_map['id']} String,
                {self.config.column_map['text']} String,
                {self.config.column_map['vector']} Array(Float32),
                {self.config.column_map['metadata']} JSON,
                CONSTRAINT cons_vec_len CHECK length(\
                    {self.config.column_map['vector']}) = {dim},
                VECTOR INDEX vidx {self.config.column_map['vector']} \
                    TYPE {self.config.index_type}(\
                        'metric_type={self.config.metric}'{index_params})
            ) ENGINE = MergeTree ORDER BY {self.config.column_map['id']}
        """
        self.dim = dim
        self.BS = "\\"
        self.must_escape = ("\\", "'")
        self.embedding_function = embedding.embed_query
        self.dist_order = "ASC" if self.config.metric in ["cosine", "l2"] else "DESC"
        # Create a connection to myscale
        self.client = get_client(
            host=self.config.host,
            port=self.config.port,
            username=self.config.username,
            password=self.config.password,
            **kwargs,
        )
        self.client.command("SET allow_experimental_object_type=1")
        self.client.command(schema_)
    def escape_str(self, value: str) -> str:
        return "".join(f"{self.BS}{c}" if c in self.must_escape else c for c in value)
    def _build_istr(self, transac: Iterable, column_names: Iterable[str]) -> str:
        ks = ",".join(column_names)
        _data = []
        for n in transac:
            n = ",".join([f"'{self.escape_str(str(_n))}'" for _n in n])
            _data.append(f"({n})")
        i_str = f"""
                INSERT INTO TABLE 
                    {self.config.database}.{self.config.table}({ks})
                VALUES
                {','.join(_data)}
                """
        return i_str
    def _insert(self, transac: Iterable, column_names: Iterable[str]) -> None:
        _i_str = self._build_istr(transac, column_names)
        self.client.command(_i_str)
    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        batch_size: int = 32,
        ids: Optional[Iterable[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.
        Args:
            texts: Iterable of strings to add to the vectorstore.
            ids: Optional list of ids to associate with the texts.
            batch_size: Batch size of insertion
            metadata: Optional column data to be inserted
        Returns:
            List of ids from adding the texts into the vectorstore.
        """
        # Embed and create the documents
        ids = ids or [sha1(t.encode("utf-8")).hexdigest() for t in texts]
        colmap_ = self.config.column_map
        transac = []
        column_names = {
            colmap_["id"]: ids,
            colmap_["text"]: texts,
            colmap_["vector"]: map(self.embedding_function, texts),
        }
        metadatas = metadatas or [{} for _ in texts]
        column_names[colmap_["metadata"]] = map(json.dumps, metadatas)
        assert len(set(colmap_) - set(column_names)) >= 0
        keys, values = zip(*column_names.items())
        try:
            t = None
            for v in self.pgbar(
                zip(*values), desc="Inserting data...", total=len(metadatas)
            ):
                assert len(v[keys.index(self.config.column_map["vector"])]) == self.dim
                transac.append(v)
                if len(transac) == batch_size:
                    if t:
                        t.join()
                    t = Thread(target=self._insert, args=[transac, keys])
                    t.start()
                    transac = []
            if len(transac) > 0:
                if t:
                    t.join()
                self._insert(transac, keys)
            return [i for i in ids]
        except Exception as e:
            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
            return []
    @classmethod
    def from_texts(
        cls,
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[Dict[Any, Any]]] = None,
        config: Optional[MyScaleSettings] = None,
        text_ids: Optional[Iterable[str]] = None,
        batch_size: int = 32,
        **kwargs: Any,
    ) -> MyScale:
        """Create Myscale wrapper with existing texts
        Args:
            embedding_function (Embeddings): Function to extract text embedding
            texts (Iterable[str]): List or tuple of strings to be added
            config (MyScaleSettings, Optional): Myscale configuration
            text_ids (Optional[Iterable], optional): IDs for the texts.
                                                     Defaults to None.
            batch_size (int, optional): Batchsize when transmitting data to MyScale.
                                        Defaults to 32.
            metadata (List[dict], optional): metadata to texts. Defaults to None.
            Other keyword arguments will pass into
                [clickhouse-connect](https://clickhouse.com/docs/en/integrations/python#clickhouse-connect-driver-api)
        Returns:
            MyScale Index
        """
        ctx = cls(embedding, config, **kwargs)
        ctx.add_texts(texts, ids=text_ids, batch_size=batch_size, metadatas=metadatas)
        return ctx
    def __repr__(self) -> str:
        """Text representation for myscale, prints backends, username and schemas.
            Easy to use with `str(Myscale())`
        Returns:
            repr: string to show connection info and data schema
        """
        _repr = f"\033[92m\033[1m{self.config.database}.{self.config.table} @ "
        _repr += f"{self.config.host}:{self.config.port}\033[0m\n\n"
        _repr += f"\033[1musername: {self.config.username}\033[0m\n\nTable Schema:\n"
        _repr += "-" * 51 + "\n"
        for r in self.client.query(
            f"DESC {self.config.database}.{self.config.table}"
        ).named_results():
            _repr += (
                f"|\033[94m{r['name']:24s}\033[0m|\033[96m{r['type']:24s}\033[0m|\n"
            )
        _repr += "-" * 51 + "\n"
        return _repr
    def _build_qstr(
        self, q_emb: List[float], topk: int, where_str: Optional[str] = None
    ) -> str:
        q_emb_str = ",".join(map(str, q_emb))
        if where_str:
            where_str = f"PREWHERE {where_str}"
        else:
            where_str = ""
        q_str = f"""
            SELECT {self.config.column_map['text']}, 
                {self.config.column_map['metadata']}, dist
            FROM {self.config.database}.{self.config.table}
            {where_str}
            ORDER BY distance({self.config.column_map['vector']}, [{q_emb_str}]) 
                AS dist {self.dist_order}
            LIMIT {topk}
            """
        return q_str
    def similarity_search(
        self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any
    ) -> List[Document]:
        """Perform a similarity search with MyScale
        Args:
            query (str): query string
            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
            where_str (Optional[str], optional): where condition string.
                                                 Defaults to None.
            NOTE: Please do not let end-user to fill this and always be aware
                  of SQL injection. When dealing with metadatas, remember to
                  use `{self.metadata_column}.attribute` instead of `attribute`
                  alone. The default name for it is `metadata`.
        Returns:
            List[Document]: List of Documents
        """
        return self.similarity_search_by_vector(
            self.embedding_function(query), k, where_str, **kwargs
        )
    def similarity_search_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        where_str: Optional[str] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Perform a similarity search with MyScale by vectors
        Args:
            query (str): query string
            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
            where_str (Optional[str], optional): where condition string.
                                                 Defaults to None.
            NOTE: Please do not let end-user to fill this and always be aware
                  of SQL injection. When dealing with metadatas, remember to
                  use `{self.metadata_column}.attribute` instead of `attribute`
                  alone. The default name for it is `metadata`.
        Returns:
            List[Document]: List of (Document, similarity)
        """
        q_str = self._build_qstr(embedding, k, where_str)
        try:
            return [
                Document(
                    page_content=r[self.config.column_map["text"]],
                    metadata=r[self.config.column_map["metadata"]],
                )
                for r in self.client.query(q_str).named_results()
            ]
        except Exception as e:
            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
            return []
    def similarity_search_with_relevance_scores(
        self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any
    ) -> List[Tuple[Document, float]]:
        """Perform a similarity search with MyScale
        Args:
            query (str): query string
            k (int, optional): Top K neighbors to retrieve. Defaults to 4.
            where_str (Optional[str], optional): where condition string.
                                                 Defaults to None.
            NOTE: Please do not let end-user to fill this and always be aware
                  of SQL injection. When dealing with metadatas, remember to
                  use `{self.metadata_column}.attribute` instead of `attribute`
                  alone. The default name for it is `metadata`.
        Returns:
            List[Document]: List of documents
        """
        q_str = self._build_qstr(self.embedding_function(query), k, where_str)
        try:
            return [
                (
                    Document(
                        page_content=r[self.config.column_map["text"]],
                        metadata=r[self.config.column_map["metadata"]],
                    ),
                    r["dist"],
                )
                for r in self.client.query(q_str).named_results()
            ]
        except Exception as e:
            logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m")
            return []
    def drop(self) -> None:
        """
        Helper function: Drop data
        """
        self.client.command(
            f"DROP TABLE IF EXISTS {self.config.database}.{self.config.table}"
        )
    @property
    def metadata_column(self) -> str:
        return self.config.column_map["metadata"]
--- a/poetry.lock
+++ b/poetry.lock
@ -1055,7 +1055,7 @@ colorama = {version = "*", markers = "platform_system == \"Windows\""}
 name = "clickhouse-connect"
 version = "0.5.20"
 description = "ClickHouse core driver, SqlAlchemy, and Superset libraries"
-category = "dev"
+category = "main"
 optional = false
 python-versions = "~=3.7"
 files = [
@ -3519,7 +3519,7 @@ dev = ["Sphinx (==5.3.0)", "colorama (==0.4.5)", "colorama (==0.4.6)", "freezegu
 name = "lz4"
 version = "4.3.2"
 description = "LZ4 Bindings for Python"
-category = "dev"
+category = "main"
 optional = false
 python-versions = ">=3.7"
 files = [
@ -6293,7 +6293,7 @@ dev = ["atomicwrites (==1.2.1)", "attrs (==19.2.0)", "coverage (==6.5.0)", "hatc
 name = "pytz"
 version = "2023.3"
 description = "World timezone definitions, modern and historical"
-category = "dev"
+category = "main"
 optional = false
 python-versions = "*"
 files = [
@ -9212,7 +9212,7 @@ testing = ["big-O", "flake8 (<5)", "jaraco.functools", "jaraco.itertools", "more
 name = "zstandard"
 version = "0.21.0"
 description = "Zstandard bindings for Python"
-category = "dev"
+category = "main"
 optional = false
 python-versions = ">=3.7"
 files = [
@ -9268,7 +9268,7 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
 cffi = ["cffi (>=1.11)"]
 [extras]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"]
 cohere = ["cohere"]
 llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
 openai = ["openai"]
@ -9277,4 +9277,4 @@ qdrant = ["qdrant-client"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "8b0be7a924d83d9afc5e21e95aa529258a3ae916418e0c1c159732291a615af8"
+content-hash = "da027a1b27f348548ca828c6da40795e2f57a7a7858bdeac1a08573d3e031e12"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -34,6 +34,7 @@ jinja2 = {version = "^3", optional = true}
 tiktoken = {version = "^0.3.2", optional = true, python="^3.9"}
 pinecone-client = {version = "^2", optional = true}
 pinecone-text = {version = "^0.4.2", optional = true}
 clickhouse-connect = {version="^0.5.14", optional=true}
 weaviate-client = {version = "^3", optional = true}
 google-api-python-client = {version = "2.70.0", optional = true}
 wolframalpha = {version = "5.0.0", optional = true}
@ -106,6 +107,7 @@ elasticsearch = {extras = ["async"], version = "^8.6.2"}
 redis = "^4.5.4"
 pinecone-client = "^2.2.1"
 pinecone-text = "^0.4.2"
 clickhouse-connect = "^0.5.14"
 pgvector = "^0.1.6"
 transformers = "^4.27.4"
 pandas = "^2.0.0"
@ -142,7 +144,7 @@ llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifes
 qdrant = ["qdrant-client"]
 openai = ["openai"]
 cohere = ["cohere"]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"]
 [tool.ruff]
 select = [
--- a/tests/integration_tests/vectorstores/test_myscale.py
+++ b/tests/integration_tests/vectorstores/test_myscale.py
@ -0,0 +1,108 @@
 """Test MyScale functionality."""
 import pytest
 from langchain.docstore.document import Document
 from langchain.vectorstores import MyScale, MyScaleSettings
 from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
 def test_myscale() -> None:
    """Test end to end construction and search."""
    texts = ["foo", "bar", "baz"]
    config = MyScaleSettings()
    config.table = "test_myscale"
    docsearch = MyScale.from_texts(texts, FakeEmbeddings(), config=config)
    output = docsearch.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
    docsearch.drop()
@pytest.mark.asyncio
 async def test_myscale_async() -> None:
    """Test end to end construction and search."""
    texts = ["foo", "bar", "baz"]
    config = MyScaleSettings()
    config.table = "test_myscale_async"
    docsearch = MyScale.from_texts(
        texts=texts, embedding=FakeEmbeddings(), config=config
    )
    output = await docsearch.asimilarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
    docsearch.drop()
 def test_myscale_with_metadatas() -> None:
    """Test end to end construction and search."""
    texts = ["foo", "bar", "baz"]
    metadatas = [{"page": str(i)} for i in range(len(texts))]
    config = MyScaleSettings()
    config.table = "test_myscale_with_metadatas"
    docsearch = MyScale.from_texts(
        texts=texts,
        embedding=FakeEmbeddings(),
        config=config,
        metadatas=metadatas,
    )
    output = docsearch.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"page": "0"})]
    docsearch.drop()
 def test_myscale_with_metadatas_with_relevance_scores() -> None:
    """Test end to end construction and scored search."""
    texts = ["foo", "bar", "baz"]
    metadatas = [{"page": str(i)} for i in range(len(texts))]
    config = MyScaleSettings()
    config.table = "test_myscale_with_metadatas_with_relevance_scores"
    docsearch = MyScale.from_texts(
        texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config
    )
    output = docsearch.similarity_search_with_relevance_scores("foo", k=1)
    assert output[0][0] == Document(page_content="foo", metadata={"page": "0"})
    docsearch.drop()
 def test_myscale_search_filter() -> None:
    """Test end to end construction and search with metadata filtering."""
    texts = ["far", "bar", "baz"]
    metadatas = [{"first_letter": "{}".format(text[0])} for text in texts]
    config = MyScaleSettings()
    config.table = "test_myscale_search_filter"
    docsearch = MyScale.from_texts(
        texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config
    )
    output = docsearch.similarity_search(
        "far", k=1, where_str=f"{docsearch.metadata_column}.first_letter='f'"
    )
    assert output == [Document(page_content="far", metadata={"first_letter": "f"})]
    output = docsearch.similarity_search(
        "bar", k=1, where_str=f"{docsearch.metadata_column}.first_letter='b'"
    )
    assert output == [Document(page_content="bar", metadata={"first_letter": "b"})]
    docsearch.drop()
 def test_myscale_with_persistence() -> None:
    """Test end to end construction and search, with persistence."""
    config = MyScaleSettings()
    config.table = "test_myscale_with_persistence"
    texts = [
        "foo",
        "bar",
        "baz",
    ]
    docsearch = MyScale.from_texts(
        texts=texts, embedding=FakeEmbeddings(), config=config
    )
    output = docsearch.similarity_search("foo", k=1)
    assert output == [Document(page_content="foo", metadata={"_dummy": 0})]
    # Get a new VectorStore with same config
    # it will reuse the table spontaneously
    # unless you drop it
    docsearch = MyScale(embedding=FakeEmbeddings(), config=config)
    output = docsearch.similarity_search("foo", k=1)
    # Clean up
    docsearch.drop()