diff --git a/docs/ecosystem/myscale.md b/docs/ecosystem/myscale.md new file mode 100644 index 00000000000..696d144ced2 --- /dev/null +++ b/docs/ecosystem/myscale.md @@ -0,0 +1,65 @@ +# MyScale + +This page covers how to use MyScale vector database within LangChain. +It is broken into two parts: installation and setup, and then references to specific MyScale wrappers. + +With MyScale, you can manage both structured and unstructured (vectorized) data, and perform joint queries and analytics on both types of data using SQL. Plus, MyScale's cloud-native OLAP architecture, built on top of ClickHouse, enables lightning-fast data processing even on massive datasets. + +## Introduction + +[Overview to MyScale and High performance vector search](https://docs.myscale.com/en/overview/) + +You can now register on our SaaS and [start a cluster now!](https://docs.myscale.com/en/quickstart/) + +If you are also interested in how we managed to integrate SQL and vector, please refer to [this document](https://docs.myscale.com/en/vector-reference/) for further syntax reference. + +We also deliver with live demo on huggingface! Please checkout our [huggingface space](https://huggingface.co/myscale)! They search millions of vector within a blink! + +## Installation and Setup +- Install the Python SDK with `pip install clickhouse-connect` + +### Setting up envrionments + +There are two ways to set up parameters for myscale index. + +1. Environment Variables + + Before you run the app, please set the environment variable with `export`: + `export MYSCALE_URL='' MYSCALE_PORT= MYSCALE_USERNAME= MYSCALE_PASSWORD= ...` + + You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/) + Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive. + +2. Create `MyScaleSettings` object with parameters + + + ```python + from langchain.vectorstores import MyScale, MyScaleSettings + config = MyScaleSetting(host="", port=8443, ...) + index = MyScale(embedding_function, config) + index.add_documents(...) + ``` + +## Wrappers +supported functions: +- `add_texts` +- `add_documents` +- `from_texts` +- `from_documents` +- `similarity_search` +- `asimilarity_search` +- `similarity_search_by_vector` +- `asimilarity_search_by_vector` +- `similarity_search_with_relevance_scores` + +### VectorStore + +There exists a wrapper around MyScale database, allowing you to use it as a vectorstore, +whether for semantic search or similar example retrieval. + +To import this vectorstore: +```python +from langchain.vectorstores import MyScale +``` + +For a more detailed walkthrough of the MyScale wrapper, see [this notebook](../modules/indexes/vectorstores/examples/myscale.ipynb) diff --git a/docs/modules/indexes/vectorstores/examples/myscale.ipynb b/docs/modules/indexes/vectorstores/examples/myscale.ipynb new file mode 100644 index 00000000000..b3ae66dffae --- /dev/null +++ b/docs/modules/indexes/vectorstores/examples/myscale.ipynb @@ -0,0 +1,267 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "683953b3", + "metadata": {}, + "source": [ + "# MyScale\n", + "\n", + "This notebook shows how to use functionality related to the MyScale vector database." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "aac9563e", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.vectorstores import MyScale\n", + "from langchain.document_loaders import TextLoader" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a9d16fa3", + "metadata": {}, + "source": [ + "## Setting up envrionments\n", + "\n", + "There are two ways to set up parameters for myscale index.\n", + "\n", + "1. Environment Variables\n", + "\n", + " Before you run the app, please set the environment variable with `export`:\n", + " `export MYSCALE_URL='' MYSCALE_PORT= MYSCALE_USERNAME= MYSCALE_PASSWORD= ...`\n", + "\n", + " You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)\n", + "\n", + " Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.\n", + "\n", + "2. Create `MyScaleSettings` object with parameters\n", + "\n", + "\n", + " ```python\n", + " from langchain.vectorstores import MyScale, MyScaleSettings\n", + " config = MyScaleSetting(host=\"\", port=8443, ...)\n", + " index = MyScale(embedding_function, config)\n", + " index.add_documents(...)\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a3c3999a", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import TextLoader\n", + "loader = TextLoader('../../../state_of_the_union.txt')\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "embeddings = OpenAIEmbeddings()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "6e104aee", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Inserting data...: 100%|██████████| 42/42 [00:18<00:00, 2.21it/s]\n" + ] + } + ], + "source": [ + "for d in docs:\n", + " d.metadata = {'some': 'metadata'}\n", + "docsearch = MyScale.from_documents(docs, embeddings)\n", + "\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "docs = docsearch.similarity_search(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9c608226", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "As Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment they’re conducting on our children for profit. \n", + "\n", + "It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. \n", + "\n", + "And let’s get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. \n", + "\n", + "Third, support our veterans. \n", + "\n", + "Veterans are the best of us. \n", + "\n", + "I’ve always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. \n", + "\n", + "My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free. \n", + "\n", + "Our troops in Iraq and Afghanistan faced many dangers.\n" + ] + } + ], + "source": [ + "print(docs[0].page_content)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e3a8b105", + "metadata": {}, + "source": [ + "## Get connection info and data schema" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69996818", + "metadata": {}, + "outputs": [], + "source": [ + "print(str(docsearch))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f59360c0", + "metadata": {}, + "source": [ + "## Filtering\n", + "\n", + "You can have direct access to myscale SQL where statement. You can write `WHERE` clause following standard SQL.\n", + "\n", + "**NOTE**: Please be aware of SQL injection, this interface must not be directly called by end-user.\n", + "\n", + "If you custimized your `column_map` under your setting, you search with filter like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "232055f6", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Inserting data...: 100%|██████████| 42/42 [00:15<00:00, 2.69it/s]\n" + ] + } + ], + "source": [ + "from langchain.vectorstores import MyScale, MyScaleSettings\n", + "from langchain.document_loaders import TextLoader\n", + "\n", + "loader = TextLoader('../../../state_of_the_union.txt')\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "embeddings = OpenAIEmbeddings()\n", + "\n", + "for i, d in enumerate(docs):\n", + " d.metadata = {'doc_id': i}\n", + "\n", + "docsearch = MyScale.from_documents(docs, embeddings)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "ddbcee77", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.252379834651947 {'doc_id': 6, 'some': ''} And I’m taking robus...\n", + "0.25022566318511963 {'doc_id': 1, 'some': ''} Groups of citizens b...\n", + "0.2469480037689209 {'doc_id': 8, 'some': ''} And so many families...\n", + "0.2428302764892578 {'doc_id': 0, 'some': 'metadata'} As Frances Haugen, w...\n" + ] + } + ], + "source": [ + "meta = docsearch.metadata_column\n", + "output = docsearch.similarity_search_with_relevance_scores('What did the president say about Ketanji Brown Jackson?', \n", + " k=4, where_str=f\"{meta}.doc_id<10\")\n", + "for d, dist in output:\n", + " print(dist, d.metadata, d.page_content[:20] + '...')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a359ed74", + "metadata": {}, + "source": [ + "## Deleting your data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb6a9d36", + "metadata": {}, + "outputs": [], + "source": [ + "docsearch.drop()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48dbd8e0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/reference/integrations.md b/docs/reference/integrations.md index e487ae49185..8e57ee69ec6 100644 --- a/docs/reference/integrations.md +++ b/docs/reference/integrations.md @@ -45,6 +45,8 @@ The following use cases require specific installs and api keys: - Set up Elasticsearch backend. If you want to do locally, [this](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/getting-started.html) is a good guide. - _FAISS_: - Install requirements with `pip install faiss` for Python 3.7 and `pip install faiss-cpu` for Python 3.10+. +- _MyScale_ + - Install requirements with `pip install clickhouse-connect`. For documentations, please refer to [this document](https://docs.myscale.com/en/overview/). - _Manifest_: - Install requirements with `pip install manifest-ml` (Note: this is only available in Python 3.8+ currently). - _OpenSearch_: diff --git a/langchain/vectorstores/__init__.py b/langchain/vectorstores/__init__.py index 55b317cb46a..30d1ca7ecdc 100644 --- a/langchain/vectorstores/__init__.py +++ b/langchain/vectorstores/__init__.py @@ -8,6 +8,7 @@ from langchain.vectorstores.deeplake import DeepLake from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch from langchain.vectorstores.faiss import FAISS from langchain.vectorstores.milvus import Milvus +from langchain.vectorstores.myscale import MyScale, MyScaleSettings from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch from langchain.vectorstores.pinecone import Pinecone from langchain.vectorstores.qdrant import Qdrant @@ -29,6 +30,8 @@ __all__ = [ "AtlasDB", "DeepLake", "Annoy", + "MyScale", + "MyScaleSettings", "SupabaseVectorStore", "AnalyticDB", ] diff --git a/langchain/vectorstores/myscale.py b/langchain/vectorstores/myscale.py new file mode 100644 index 00000000000..3ae8d275dbf --- /dev/null +++ b/langchain/vectorstores/myscale.py @@ -0,0 +1,433 @@ +"""Wrapper around MyScale vector database.""" +from __future__ import annotations + +import json +import logging +from hashlib import sha1 +from threading import Thread +from typing import Any, Dict, Iterable, List, Optional, Tuple + +from pydantic import BaseSettings + +from langchain.docstore.document import Document +from langchain.embeddings.base import Embeddings +from langchain.vectorstores.base import VectorStore + +logger = logging.getLogger() + + +def has_mul_sub_str(s: str, *args: Any) -> bool: + for a in args: + if a not in s: + return False + return True + + +class MyScaleSettings(BaseSettings): + """MyScale Client Configuration + + Attribute: + myscale_host (str) : An URL to connect to MyScale backend. + Defaults to 'localhost'. + myscale_port (int) : URL port to connect with HTTP. Defaults to 8443. + username (str) : Usernamed to login. Defaults to None. + password (str) : Password to login. Defaults to None. + index_type (str): index type string. + index_param (dict): index build parameter. + database (str) : Database name to find the table. Defaults to 'default'. + table (str) : Table name to operate on. + Defaults to 'vector_table'. + metric (str) : Metric to compute distance, + supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'. + column_map (Dict) : Column type map to project column name onto langchain + semantics. Must have keys: `text`, `id`, `vector`, + must be same size to number of columns. For example: + .. code-block:: python + { + 'id': 'text_id', + 'vector': 'text_embedding', + 'text': 'text_plain', + 'metadata': 'metadata_dictionary_in_json', + } + + Defaults to identity map. + """ + + host: str = "localhost" + port: int = 8443 + + username: Optional[str] = None + password: Optional[str] = None + + index_type: str = "IVFFLAT" + index_param: Optional[Dict[str, str]] = None + + column_map: Dict[str, str] = { + "id": "id", + "text": "text", + "vector": "vector", + "metadata": "metadata", + } + + database: str = "default" + table: str = "langchain" + metric: str = "cosine" + + def __getitem__(self, item: str) -> Any: + return getattr(self, item) + + class Config: + env_file = ".env" + env_prefix = "myscale_" + env_file_encoding = "utf-8" + + +class MyScale(VectorStore): + """Wrapper around MyScale vector database + + You need a `clickhouse-connect` python package, and a valid account + to connect to MyScale. + + MyScale can not only search with simple vector indexes, + it also supports complex query with multiple conditions, + constraints and even sub-queries. + + For more information, please visit + [myscale official site](https://docs.myscale.com/en/overview/) + """ + + def __init__( + self, + embedding: Embeddings, + config: Optional[MyScaleSettings] = None, + **kwargs: Any, + ) -> None: + """MyScale Wrapper to LangChain + + embedding_function (Embeddings): + config (MyScaleSettings): Configuration to MyScale Client + Other keyword arguments will pass into + [clickhouse-connect](https://docs.myscale.com/) + """ + try: + from clickhouse_connect import get_client + except ImportError: + raise ValueError( + "Could not import clickhouse connect python package. " + "Please install it with `pip install clickhouse-connect`." + ) + try: + from tqdm import tqdm + + self.pgbar = tqdm + except ImportError: + # Just in case if tqdm is not installed + self.pgbar = lambda x: x + super().__init__() + if config is not None: + self.config = config + else: + self.config = MyScaleSettings() + assert self.config + assert self.config.host and self.config.port + assert ( + self.config.column_map + and self.config.database + and self.config.table + and self.config.metric + ) + for k in ["id", "vector", "text", "metadata"]: + assert k in self.config.column_map + assert self.config.metric in ["ip", "cosine", "l2"] + + # initialize the schema + dim = len(embedding.embed_query("try this out")) + + index_params = ( + ", " + ",".join([f"'{k}={v}'" for k, v in self.config.index_param.items()]) + if self.config.index_param + else "" + ) + schema_ = f""" + CREATE TABLE IF NOT EXISTS {self.config.database}.{self.config.table}( + {self.config.column_map['id']} String, + {self.config.column_map['text']} String, + {self.config.column_map['vector']} Array(Float32), + {self.config.column_map['metadata']} JSON, + CONSTRAINT cons_vec_len CHECK length(\ + {self.config.column_map['vector']}) = {dim}, + VECTOR INDEX vidx {self.config.column_map['vector']} \ + TYPE {self.config.index_type}(\ + 'metric_type={self.config.metric}'{index_params}) + ) ENGINE = MergeTree ORDER BY {self.config.column_map['id']} + """ + self.dim = dim + self.BS = "\\" + self.must_escape = ("\\", "'") + self.embedding_function = embedding.embed_query + self.dist_order = "ASC" if self.config.metric in ["cosine", "l2"] else "DESC" + + # Create a connection to myscale + self.client = get_client( + host=self.config.host, + port=self.config.port, + username=self.config.username, + password=self.config.password, + **kwargs, + ) + self.client.command("SET allow_experimental_object_type=1") + self.client.command(schema_) + + def escape_str(self, value: str) -> str: + return "".join(f"{self.BS}{c}" if c in self.must_escape else c for c in value) + + def _build_istr(self, transac: Iterable, column_names: Iterable[str]) -> str: + ks = ",".join(column_names) + _data = [] + for n in transac: + n = ",".join([f"'{self.escape_str(str(_n))}'" for _n in n]) + _data.append(f"({n})") + i_str = f""" + INSERT INTO TABLE + {self.config.database}.{self.config.table}({ks}) + VALUES + {','.join(_data)} + """ + return i_str + + def _insert(self, transac: Iterable, column_names: Iterable[str]) -> None: + _i_str = self._build_istr(transac, column_names) + self.client.command(_i_str) + + def add_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + batch_size: int = 32, + ids: Optional[Iterable[str]] = None, + **kwargs: Any, + ) -> List[str]: + """Run more texts through the embeddings and add to the vectorstore. + + Args: + texts: Iterable of strings to add to the vectorstore. + ids: Optional list of ids to associate with the texts. + batch_size: Batch size of insertion + metadata: Optional column data to be inserted + + Returns: + List of ids from adding the texts into the vectorstore. + + """ + # Embed and create the documents + ids = ids or [sha1(t.encode("utf-8")).hexdigest() for t in texts] + colmap_ = self.config.column_map + + transac = [] + column_names = { + colmap_["id"]: ids, + colmap_["text"]: texts, + colmap_["vector"]: map(self.embedding_function, texts), + } + metadatas = metadatas or [{} for _ in texts] + column_names[colmap_["metadata"]] = map(json.dumps, metadatas) + assert len(set(colmap_) - set(column_names)) >= 0 + keys, values = zip(*column_names.items()) + try: + t = None + for v in self.pgbar( + zip(*values), desc="Inserting data...", total=len(metadatas) + ): + assert len(v[keys.index(self.config.column_map["vector"])]) == self.dim + transac.append(v) + if len(transac) == batch_size: + if t: + t.join() + t = Thread(target=self._insert, args=[transac, keys]) + t.start() + transac = [] + if len(transac) > 0: + if t: + t.join() + self._insert(transac, keys) + return [i for i in ids] + except Exception as e: + logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m") + return [] + + @classmethod + def from_texts( + cls, + texts: List[str], + embedding: Embeddings, + metadatas: Optional[List[Dict[Any, Any]]] = None, + config: Optional[MyScaleSettings] = None, + text_ids: Optional[Iterable[str]] = None, + batch_size: int = 32, + **kwargs: Any, + ) -> MyScale: + """Create Myscale wrapper with existing texts + + Args: + embedding_function (Embeddings): Function to extract text embedding + texts (Iterable[str]): List or tuple of strings to be added + config (MyScaleSettings, Optional): Myscale configuration + text_ids (Optional[Iterable], optional): IDs for the texts. + Defaults to None. + batch_size (int, optional): Batchsize when transmitting data to MyScale. + Defaults to 32. + metadata (List[dict], optional): metadata to texts. Defaults to None. + Other keyword arguments will pass into + [clickhouse-connect](https://clickhouse.com/docs/en/integrations/python#clickhouse-connect-driver-api) + Returns: + MyScale Index + """ + ctx = cls(embedding, config, **kwargs) + ctx.add_texts(texts, ids=text_ids, batch_size=batch_size, metadatas=metadatas) + return ctx + + def __repr__(self) -> str: + """Text representation for myscale, prints backends, username and schemas. + Easy to use with `str(Myscale())` + + Returns: + repr: string to show connection info and data schema + """ + _repr = f"\033[92m\033[1m{self.config.database}.{self.config.table} @ " + _repr += f"{self.config.host}:{self.config.port}\033[0m\n\n" + _repr += f"\033[1musername: {self.config.username}\033[0m\n\nTable Schema:\n" + _repr += "-" * 51 + "\n" + for r in self.client.query( + f"DESC {self.config.database}.{self.config.table}" + ).named_results(): + _repr += ( + f"|\033[94m{r['name']:24s}\033[0m|\033[96m{r['type']:24s}\033[0m|\n" + ) + _repr += "-" * 51 + "\n" + return _repr + + def _build_qstr( + self, q_emb: List[float], topk: int, where_str: Optional[str] = None + ) -> str: + q_emb_str = ",".join(map(str, q_emb)) + if where_str: + where_str = f"PREWHERE {where_str}" + else: + where_str = "" + + q_str = f""" + SELECT {self.config.column_map['text']}, + {self.config.column_map['metadata']}, dist + FROM {self.config.database}.{self.config.table} + {where_str} + ORDER BY distance({self.config.column_map['vector']}, [{q_emb_str}]) + AS dist {self.dist_order} + LIMIT {topk} + """ + return q_str + + def similarity_search( + self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any + ) -> List[Document]: + """Perform a similarity search with MyScale + + Args: + query (str): query string + k (int, optional): Top K neighbors to retrieve. Defaults to 4. + where_str (Optional[str], optional): where condition string. + Defaults to None. + + NOTE: Please do not let end-user to fill this and always be aware + of SQL injection. When dealing with metadatas, remember to + use `{self.metadata_column}.attribute` instead of `attribute` + alone. The default name for it is `metadata`. + + Returns: + List[Document]: List of Documents + """ + return self.similarity_search_by_vector( + self.embedding_function(query), k, where_str, **kwargs + ) + + def similarity_search_by_vector( + self, + embedding: List[float], + k: int = 4, + where_str: Optional[str] = None, + **kwargs: Any, + ) -> List[Document]: + """Perform a similarity search with MyScale by vectors + + Args: + query (str): query string + k (int, optional): Top K neighbors to retrieve. Defaults to 4. + where_str (Optional[str], optional): where condition string. + Defaults to None. + + NOTE: Please do not let end-user to fill this and always be aware + of SQL injection. When dealing with metadatas, remember to + use `{self.metadata_column}.attribute` instead of `attribute` + alone. The default name for it is `metadata`. + + Returns: + List[Document]: List of (Document, similarity) + """ + q_str = self._build_qstr(embedding, k, where_str) + try: + return [ + Document( + page_content=r[self.config.column_map["text"]], + metadata=r[self.config.column_map["metadata"]], + ) + for r in self.client.query(q_str).named_results() + ] + except Exception as e: + logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m") + return [] + + def similarity_search_with_relevance_scores( + self, query: str, k: int = 4, where_str: Optional[str] = None, **kwargs: Any + ) -> List[Tuple[Document, float]]: + """Perform a similarity search with MyScale + + Args: + query (str): query string + k (int, optional): Top K neighbors to retrieve. Defaults to 4. + where_str (Optional[str], optional): where condition string. + Defaults to None. + + NOTE: Please do not let end-user to fill this and always be aware + of SQL injection. When dealing with metadatas, remember to + use `{self.metadata_column}.attribute` instead of `attribute` + alone. The default name for it is `metadata`. + + Returns: + List[Document]: List of documents + """ + q_str = self._build_qstr(self.embedding_function(query), k, where_str) + try: + return [ + ( + Document( + page_content=r[self.config.column_map["text"]], + metadata=r[self.config.column_map["metadata"]], + ), + r["dist"], + ) + for r in self.client.query(q_str).named_results() + ] + except Exception as e: + logger.error(f"\033[91m\033[1m{type(e)}\033[0m \033[95m{str(e)}\033[0m") + return [] + + def drop(self) -> None: + """ + Helper function: Drop data + """ + self.client.command( + f"DROP TABLE IF EXISTS {self.config.database}.{self.config.table}" + ) + + @property + def metadata_column(self) -> str: + return self.config.column_map["metadata"] diff --git a/poetry.lock b/poetry.lock index 920a6fce159..4851247223c 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1055,7 +1055,7 @@ colorama = {version = "*", markers = "platform_system == \"Windows\""} name = "clickhouse-connect" version = "0.5.20" description = "ClickHouse core driver, SqlAlchemy, and Superset libraries" -category = "dev" +category = "main" optional = false python-versions = "~=3.7" files = [ @@ -3519,7 +3519,7 @@ dev = ["Sphinx (==5.3.0)", "colorama (==0.4.5)", "colorama (==0.4.6)", "freezegu name = "lz4" version = "4.3.2" description = "LZ4 Bindings for Python" -category = "dev" +category = "main" optional = false python-versions = ">=3.7" files = [ @@ -6293,7 +6293,7 @@ dev = ["atomicwrites (==1.2.1)", "attrs (==19.2.0)", "coverage (==6.5.0)", "hatc name = "pytz" version = "2023.3" description = "World timezone definitions, modern and historical" -category = "dev" +category = "main" optional = false python-versions = "*" files = [ @@ -9212,7 +9212,7 @@ testing = ["big-O", "flake8 (<5)", "jaraco.functools", "jaraco.itertools", "more name = "zstandard" version = "0.21.0" description = "Zstandard bindings for Python" -category = "dev" +category = "main" optional = false python-versions = ">=3.7" files = [ @@ -9268,7 +9268,7 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\ cffi = ["cffi (>=1.11)"] [extras] -all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"] +all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"] cohere = ["cohere"] llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"] openai = ["openai"] @@ -9277,4 +9277,4 @@ qdrant = ["qdrant-client"] [metadata] lock-version = "2.0" python-versions = ">=3.8.1,<4.0" -content-hash = "8b0be7a924d83d9afc5e21e95aa529258a3ae916418e0c1c159732291a615af8" +content-hash = "da027a1b27f348548ca828c6da40795e2f57a7a7858bdeac1a08573d3e031e12" diff --git a/pyproject.toml b/pyproject.toml index 5e554660391..d067dde5476 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -34,6 +34,7 @@ jinja2 = {version = "^3", optional = true} tiktoken = {version = "^0.3.2", optional = true, python="^3.9"} pinecone-client = {version = "^2", optional = true} pinecone-text = {version = "^0.4.2", optional = true} +clickhouse-connect = {version="^0.5.14", optional=true} weaviate-client = {version = "^3", optional = true} google-api-python-client = {version = "2.70.0", optional = true} wolframalpha = {version = "5.0.0", optional = true} @@ -106,6 +107,7 @@ elasticsearch = {extras = ["async"], version = "^8.6.2"} redis = "^4.5.4" pinecone-client = "^2.2.1" pinecone-text = "^0.4.2" +clickhouse-connect = "^0.5.14" pgvector = "^0.1.6" transformers = "^4.27.4" pandas = "^2.0.0" @@ -142,7 +144,7 @@ llms = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "manifes qdrant = ["qdrant-client"] openai = ["openai"] cohere = ["cohere"] -all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity"] +all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence_transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect"] [tool.ruff] select = [ diff --git a/tests/integration_tests/vectorstores/test_myscale.py b/tests/integration_tests/vectorstores/test_myscale.py new file mode 100644 index 00000000000..0ed72742462 --- /dev/null +++ b/tests/integration_tests/vectorstores/test_myscale.py @@ -0,0 +1,108 @@ +"""Test MyScale functionality.""" +import pytest + +from langchain.docstore.document import Document +from langchain.vectorstores import MyScale, MyScaleSettings +from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings + + +def test_myscale() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + config = MyScaleSettings() + config.table = "test_myscale" + docsearch = MyScale.from_texts(texts, FakeEmbeddings(), config=config) + output = docsearch.similarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"_dummy": 0})] + docsearch.drop() + + +@pytest.mark.asyncio +async def test_myscale_async() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + config = MyScaleSettings() + config.table = "test_myscale_async" + docsearch = MyScale.from_texts( + texts=texts, embedding=FakeEmbeddings(), config=config + ) + output = await docsearch.asimilarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"_dummy": 0})] + docsearch.drop() + + +def test_myscale_with_metadatas() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + config = MyScaleSettings() + config.table = "test_myscale_with_metadatas" + docsearch = MyScale.from_texts( + texts=texts, + embedding=FakeEmbeddings(), + config=config, + metadatas=metadatas, + ) + output = docsearch.similarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"page": "0"})] + docsearch.drop() + + +def test_myscale_with_metadatas_with_relevance_scores() -> None: + """Test end to end construction and scored search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + config = MyScaleSettings() + config.table = "test_myscale_with_metadatas_with_relevance_scores" + docsearch = MyScale.from_texts( + texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config + ) + output = docsearch.similarity_search_with_relevance_scores("foo", k=1) + assert output[0][0] == Document(page_content="foo", metadata={"page": "0"}) + docsearch.drop() + + +def test_myscale_search_filter() -> None: + """Test end to end construction and search with metadata filtering.""" + texts = ["far", "bar", "baz"] + metadatas = [{"first_letter": "{}".format(text[0])} for text in texts] + config = MyScaleSettings() + config.table = "test_myscale_search_filter" + docsearch = MyScale.from_texts( + texts=texts, embedding=FakeEmbeddings(), metadatas=metadatas, config=config + ) + output = docsearch.similarity_search( + "far", k=1, where_str=f"{docsearch.metadata_column}.first_letter='f'" + ) + assert output == [Document(page_content="far", metadata={"first_letter": "f"})] + output = docsearch.similarity_search( + "bar", k=1, where_str=f"{docsearch.metadata_column}.first_letter='b'" + ) + assert output == [Document(page_content="bar", metadata={"first_letter": "b"})] + docsearch.drop() + + +def test_myscale_with_persistence() -> None: + """Test end to end construction and search, with persistence.""" + config = MyScaleSettings() + config.table = "test_myscale_with_persistence" + texts = [ + "foo", + "bar", + "baz", + ] + docsearch = MyScale.from_texts( + texts=texts, embedding=FakeEmbeddings(), config=config + ) + + output = docsearch.similarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"_dummy": 0})] + + # Get a new VectorStore with same config + # it will reuse the table spontaneously + # unless you drop it + docsearch = MyScale(embedding=FakeEmbeddings(), config=config) + output = docsearch.similarity_search("foo", k=1) + + # Clean up + docsearch.drop()