Add integration for Timescale Vector(Postgres) (#10650)

**Description:**
This commit adds a vector store for the Postgres-based vector database
(`TimescaleVector`).

Timescale Vector(https://www.timescale.com/ai) is PostgreSQL++ for AI
applications. It enables you to efficiently store and query billions of
vector embeddings in `PostgreSQL`:
- Enhances `pgvector` with faster and more accurate similarity search on
1B+ vectors via DiskANN inspired indexing algorithm.
- Enables fast time-based vector search via automatic time-based
partitioning and indexing.
- Provides a familiar SQL interface for querying vector embeddings and
relational data.

Timescale Vector scales with you from POC to production:
- Simplifies operations by enabling you to store relational metadata,
vector embeddings, and time-series data in a single database.
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade
feature liked streaming backups and replication, high-availability and
row-level security.
- Enables a worry-free experience with enterprise-grade security and
compliance.

Timescale Vector is available on Timescale, the cloud PostgreSQL
platform. (There is no self-hosted version at this time.) LangChain
users get a 90-day free trial for Timescale Vector.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Avthar Sewrathan <avthar@timescale.com>
This commit is contained in:
Matvey Arye 2023-09-21 09:33:37 -05:00 committed by GitHub
parent 55570e54e1
commit 6e02c45ca4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 3863 additions and 470 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,534 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "13afcae7",
"metadata": {},
"source": [
"# Timescale Vector (Postgres) self-querying \n",
"\n",
"[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications. It enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.\n",
"\n",
"This notebook shows how to use the Postgres vector database (`TimescaleVector`) to perform self-querying. In the notebook we'll demo the `SelfQueryRetriever` wrapped around a TimescaleVector vector store. \n",
"\n",
"## What is Timescale Vector?\n",
"**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**\n",
"\n",
"Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
"- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.\n",
"- Enables fast time-based vector search via automatic time-based partitioning and indexing.\n",
"- Provides a familiar SQL interface for querying vector embeddings and relational data.\n",
"\n",
"Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:\n",
"- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.\n",
"- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.\n",
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
"\n",
"## How to access Timescale Vector\n",
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
"\n",
"LangChain users get a 90-day free trial for Timescale Vector.\n",
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
"- See the [Timescale Vector explainer blog](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) for more details and performance benchmarks.\n",
"- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a TimescaleVector vectorstore\n",
"First we'll want to create a Timescale Vector vectorstore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `timescale-vector` package."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "63a8af5b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install lark"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "22431060-52c4-48a7-a97b-9f542b8b0928",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install timescale-vector "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "83811610-7df3-4ede-b268-68a6a83ba9e2",
"metadata": {},
"source": [
"In this example, we'll use `OpenAIEmbeddings`, so let's load your OpenAI API key."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dd01b61b-7d32-4a55-85d6-b2d2d4f18840",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Get openAI api key by reading local .env file\n",
"# The .env file should contain a line starting with `OPENAI_API_KEY=sk-`\n",
"import os\n",
"from dotenv import load_dotenv, find_dotenv\n",
"_ = load_dotenv(find_dotenv())\n",
"\n",
"OPENAI_API_KEY = os.environ['OPENAI_API_KEY']\n",
"# Alternatively, use getpass to enter the key in a prompt\n",
"#import os\n",
"#import getpass\n",
"#os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "766e9c4b",
"metadata": {},
"source": [
"To connect to your PostgreSQL database, you'll need your service URI, which can be found in the cheatsheet or `.env` file you downloaded after creating a new database. \n",
"\n",
"If you haven't already, [signup for Timescale](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), and create a new database.\n",
"\n",
"The URI will look something like this: `postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6bd6877e",
"metadata": {},
"outputs": [],
"source": [
"# Get the service url by reading local .env file\n",
"# The .env file should contain a line starting with `TIMESCALE_SERVICE_URL=postgresql://`\n",
"_ = load_dotenv(find_dotenv())\n",
"TIMESCALE_SERVICE_URL = os.environ[\"TIMESCALE_SERVICE_URL\"]\n",
"\n",
"# Alternatively, use getpass to enter the key in a prompt\n",
"#import os\n",
"#import getpass\n",
"#TIMESCALE_SERVICE_URL = getpass.getpass(\"Timescale Service URL:\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "cb4a5787",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.schema import Document\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores.timescalevector import TimescaleVector\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a4f863f5",
"metadata": {},
"source": [
"Here's the sample documents we'll use for this demo. The data is about movies, and has both content and metadata fields with information about particular movie."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "bcbe04d9",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = [\n",
" Document(\n",
" page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
" metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
" metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n",
" ),\n",
" Document(\n",
" page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
" metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n",
" ),\n",
" Document(\n",
" page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
" metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n",
" ),\n",
" Document(\n",
" page_content=\"Toys come alive and have a blast doing so\",\n",
" metadata={\"year\": 1995, \"genre\": \"animated\"},\n",
" ),\n",
" Document(\n",
" page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
" metadata={\n",
" \"year\": 1979,\n",
" \"rating\": 9.9,\n",
" \"director\": \"Andrei Tarkovsky\",\n",
" \"genre\": \"science fiction\",\n",
" \"rating\": 9.9,\n",
" },\n",
" ),\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7d0d771e",
"metadata": {},
"source": [
"Finally, we'll create our Timescale Vector vectorstore. Note that the collection name will be the name of the PostgreSQL table in which the documents are stored in."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2428d1ba",
"metadata": {},
"outputs": [],
"source": [
"COLLECTION_NAME = \"langchain_self_query_demo\"\n",
"vectorstore = TimescaleVector.from_documents(\n",
" embedding=embeddings,\n",
" documents=docs,\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=TIMESCALE_SERVICE_URL,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5ecaab6d",
"metadata": {},
"source": [
"## Creating our self-querying retriever\n",
"Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "86e34dbf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Give LLM info about the metadata fields\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"genre\",\n",
" description=\"The genre of the movie\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"year\",\n",
" description=\"The year the movie was released\",\n",
" type=\"integer\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"director\",\n",
" description=\"The name of the movie director\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
" ),\n",
"]\n",
"document_content_description = \"Brief summary of a movie\"\n",
"\n",
"# Instantiate the self-query retriever from an LLM\n",
"llm = OpenAI(temperature=0)\n",
"retriever = SelfQueryRetriever.from_llm(\n",
" llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ea9df8d4",
"metadata": {},
"source": [
"## Self Querying Retrieval with Timescale Vector\n",
"And now we can try actually using our retriever!\n",
"\n",
"Run the queries below and note how you can specify a query, filter, composite filter (filters with AND, OR) in natural language and the self-query retriever will translate that query into SQL and perform the search on the Timescale Vector (Postgres) vectorstore.\n",
"\n",
"This illustrates the power of the self-query retriever. You can use it to perform complex searches over your vectorstore without you or your users having to write any SQL directly!"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "38a126e9",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='dinosaur' filter=None limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
" Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
" Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n",
" Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example only specifies a relevant query\n",
"retriever.get_relevant_documents(\"What are some movies about dinosaurs\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "fc3f1e6e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'}),\n",
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'})]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example only specifies a filter\n",
"retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b19d4da0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'}),\n",
" Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'})]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a query and a filter\n",
"retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "f900e40e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='rating', value=8.5), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction')]) limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'})]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a composite filter\n",
"retriever.get_relevant_documents(\n",
" \"What's a highly rated (above 8.5) science fiction film?\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "12a51522",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')]) limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a query and composite filter\n",
"retriever.get_relevant_documents(\n",
" \"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\"\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51",
"metadata": {},
"source": [
"### Filter k\n",
"\n",
"We can also use the self query retriever to specify `k`: the number of documents to fetch.\n",
"\n",
"We can do this by passing `enable_limit=True` to the constructor."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "bff36b88-b506-4877-9c63-e5a1a8d78e64",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"retriever = SelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
" enable_limit=True,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "2758d229-4f97-499c-819f-888acaf8ee10",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='dinosaur' filter=None limit=2\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
" Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7})]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a query with a LIMIT value\n",
"retriever.get_relevant_documents(\"what are two movies about dinosaurs\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -18,6 +18,7 @@ from langchain.retrievers.self_query.pinecone import PineconeTranslator
from langchain.retrievers.self_query.qdrant import QdrantTranslator
from langchain.retrievers.self_query.redis import RedisTranslator
from langchain.retrievers.self_query.supabase import SupabaseVectorTranslator
from langchain.retrievers.self_query.timescalevector import TimescaleVectorTranslator
from langchain.retrievers.self_query.vectara import VectaraTranslator
from langchain.retrievers.self_query.weaviate import WeaviateTranslator
from langchain.schema import BaseRetriever, Document
@ -33,6 +34,7 @@ from langchain.vectorstores import (
Qdrant,
Redis,
SupabaseVectorStore,
TimescaleVector,
Vectara,
VectorStore,
Weaviate,
@ -53,6 +55,7 @@ def _get_builtin_translator(vectorstore: VectorStore) -> Visitor:
ElasticsearchStore: ElasticsearchTranslator,
Milvus: MilvusTranslator,
SupabaseVectorStore: SupabaseVectorTranslator,
TimescaleVector: TimescaleVectorTranslator,
}
if isinstance(vectorstore, Qdrant):
return QdrantTranslator(metadata_key=vectorstore.metadata_payload_key)

View File

@ -0,0 +1,84 @@
from __future__ import annotations
from typing import TYPE_CHECKING, Tuple, Union
from langchain.chains.query_constructor.ir import (
Comparator,
Comparison,
Operation,
Operator,
StructuredQuery,
Visitor,
)
if TYPE_CHECKING:
from timescale_vector import client
class TimescaleVectorTranslator(Visitor):
"""Translate the internal query language elements to valid filters."""
allowed_operators = [Operator.AND, Operator.OR, Operator.NOT]
"""Subset of allowed logical operators."""
allowed_comparators = [
Comparator.EQ,
Comparator.GT,
Comparator.GTE,
Comparator.LT,
Comparator.LTE,
]
COMPARATOR_MAP = {
Comparator.EQ: "==",
Comparator.GT: ">",
Comparator.GTE: ">=",
Comparator.LT: "<",
Comparator.LTE: "<=",
}
OPERATOR_MAP = {Operator.AND: "AND", Operator.OR: "OR", Operator.NOT: "NOT"}
def _format_func(self, func: Union[Operator, Comparator]) -> str:
self._validate_func(func)
if isinstance(func, Operator):
value = self.OPERATOR_MAP[func.value] # type: ignore
elif isinstance(func, Comparator):
value = self.COMPARATOR_MAP[func.value] # type: ignore
return f"{value}"
def visit_operation(self, operation: Operation) -> client.Predicates:
try:
from timescale_vector import client
except ImportError as e:
raise ImportError(
"Cannot import timescale-vector. Please install with `pip install "
"timescale-vector`."
) from e
args = [arg.accept(self) for arg in operation.arguments]
return client.Predicates(*args, operator=self._format_func(operation.operator))
def visit_comparison(self, comparison: Comparison) -> client.Predicates:
try:
from timescale_vector import client
except ImportError as e:
raise ImportError(
"Cannot import timescale-vector. Please install with `pip install "
"timescale-vector`."
) from e
return client.Predicates(
(
comparison.attribute,
self._format_func(comparison.comparator),
comparison.value,
)
)
def visit_structured_query(
self, structured_query: StructuredQuery
) -> Tuple[str, dict]:
if structured_query.filter is None:
kwargs = {}
else:
kwargs = {"predicates": structured_query.filter.accept(self)}
return structured_query.query, kwargs

View File

@ -70,6 +70,7 @@ from langchain.vectorstores.supabase import SupabaseVectorStore
from langchain.vectorstores.tair import Tair
from langchain.vectorstores.tencentvectordb import TencentVectorDB
from langchain.vectorstores.tigris import Tigris
from langchain.vectorstores.timescalevector import TimescaleVector
from langchain.vectorstores.typesense import Typesense
from langchain.vectorstores.usearch import USearch
from langchain.vectorstores.vald import Vald
@ -135,6 +136,7 @@ __all__ = [
"SupabaseVectorStore",
"Tair",
"Tigris",
"TimescaleVector",
"Typesense",
"USearch",
"Vald",

View File

@ -0,0 +1,871 @@
"""VectorStore wrapper around a Postgres-TimescaleVector database."""
from __future__ import annotations
import enum
import logging
import uuid
from datetime import timedelta
from typing import (
TYPE_CHECKING,
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Tuple,
Type,
Union,
)
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.utils import get_from_dict_or_env
from langchain.vectorstores.base import VectorStore
from langchain.vectorstores.utils import DistanceStrategy
if TYPE_CHECKING:
from timescale_vector import Predicates
DEFAULT_DISTANCE_STRATEGY = DistanceStrategy.COSINE
ADA_TOKEN_COUNT = 1536
_LANGCHAIN_DEFAULT_COLLECTION_NAME = "langchain_store"
class TimescaleVector(VectorStore):
"""VectorStore implementation using the timescale vector client to store vectors
in Postgres.
To use, you should have the ``timescale_vector`` python package installed.
Args:
service_url: Service url on timescale cloud.
embedding: Any embedding function implementing
`langchain.embeddings.base.Embeddings` interface.
collection_name: The name of the collection to use. (default: langchain_store)
This will become the table name used for the collection.
distance_strategy: The distance strategy to use. (default: COSINE)
pre_delete_collection: If True, will delete the collection if it exists.
(default: False). Useful for testing.
Example:
.. code-block:: python
from langchain.vectorstores import TimescaleVector
from langchain.embeddings.openai import OpenAIEmbeddings
SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"
COLLECTION_NAME = "state_of_the_union_test"
embeddings = OpenAIEmbeddings()
vectorestore = TimescaleVector.from_documents(
embedding=embeddings,
documents=docs,
collection_name=COLLECTION_NAME,
service_url=SERVICE_URL,
)
""" # noqa: E501
def __init__(
self,
service_url: str,
embedding: Embeddings,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
num_dimensions: int = ADA_TOKEN_COUNT,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
pre_delete_collection: bool = False,
logger: Optional[logging.Logger] = None,
relevance_score_fn: Optional[Callable[[float], float]] = None,
time_partition_interval: Optional[timedelta] = None,
) -> None:
try:
from timescale_vector import client
except ImportError:
raise ImportError(
"Could not import timescale_vector python package. "
"Please install it with `pip install timescale-vector`."
)
self.service_url = service_url
self.embedding = embedding
self.collection_name = collection_name
self.num_dimensions = num_dimensions
self._distance_strategy = distance_strategy
self.pre_delete_collection = pre_delete_collection
self.logger = logger or logging.getLogger(__name__)
self.override_relevance_score_fn = relevance_score_fn
self._time_partition_interval = time_partition_interval
self.sync_client = client.Sync(
self.service_url,
self.collection_name,
self.num_dimensions,
self._distance_strategy.value.lower(),
time_partition_interval=self._time_partition_interval,
)
self.async_client = client.Async(
self.service_url,
self.collection_name,
self.num_dimensions,
self._distance_strategy.value.lower(),
time_partition_interval=self._time_partition_interval,
)
self.__post_init__()
def __post_init__(
self,
) -> None:
"""
Initialize the store.
"""
self.sync_client.create_tables()
if self.pre_delete_collection:
self.sync_client.delete_all()
@property
def embeddings(self) -> Embeddings:
return self.embedding
def drop_tables(self) -> None:
self.sync_client.drop_table()
@classmethod
def __from(
cls,
texts: List[str],
embeddings: List[List[float]],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
service_url: Optional[str] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
num_dimensions = len(embeddings[0])
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
if not metadatas:
metadatas = [{} for _ in texts]
if service_url is None:
service_url = cls.get_service_url(kwargs)
store = cls(
service_url=service_url,
num_dimensions=num_dimensions,
collection_name=collection_name,
embedding=embedding,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
store.add_embeddings(
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
)
return store
@classmethod
async def __afrom(
cls,
texts: List[str],
embeddings: List[List[float]],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
service_url: Optional[str] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
num_dimensions = len(embeddings[0])
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
if not metadatas:
metadatas = [{} for _ in texts]
if service_url is None:
service_url = cls.get_service_url(kwargs)
store = cls(
service_url=service_url,
num_dimensions=num_dimensions,
collection_name=collection_name,
embedding=embedding,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
await store.aadd_embeddings(
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
)
return store
def add_embeddings(
self,
texts: Iterable[str],
embeddings: List[List[float]],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Add embeddings to the vectorstore.
Args:
texts: Iterable of strings to add to the vectorstore.
embeddings: List of list of embedding vectors.
metadatas: List of metadatas associated with the texts.
kwargs: vectorstore specific parameters
"""
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
if not metadatas:
metadatas = [{} for _ in texts]
records = list(zip(ids, metadatas, texts, embeddings))
self.sync_client.upsert(records)
return ids
async def aadd_embeddings(
self,
texts: Iterable[str],
embeddings: List[List[float]],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Add embeddings to the vectorstore.
Args:
texts: Iterable of strings to add to the vectorstore.
embeddings: List of list of embedding vectors.
metadatas: List of metadatas associated with the texts.
kwargs: vectorstore specific parameters
"""
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
if not metadatas:
metadatas = [{} for _ in texts]
records = list(zip(ids, metadatas, texts, embeddings))
await self.async_client.upsert(records)
return ids
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Run more texts through the embeddings and add to the vectorstore.
Args:
texts: Iterable of strings to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
kwargs: vectorstore specific parameters
Returns:
List of ids from adding the texts into the vectorstore.
"""
embeddings = self.embedding.embed_documents(list(texts))
return self.add_embeddings(
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
)
async def aadd_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Run more texts through the embeddings and add to the vectorstore.
Args:
texts: Iterable of strings to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
kwargs: vectorstore specific parameters
Returns:
List of ids from adding the texts into the vectorstore.
"""
embeddings = self.embedding.embed_documents(list(texts))
return await self.aadd_embeddings(
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
)
def similarity_search(
self,
query: str,
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Document]:
"""Run similarity search with TimescaleVector with distance.
Args:
query (str): Query text to search for.
k (int): Number of results to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query.
"""
embedding = self.embedding.embed_query(text=query)
return self.similarity_search_by_vector(
embedding=embedding,
k=k,
filter=filter,
predicates=predicates,
**kwargs,
)
async def asimilarity_search(
self,
query: str,
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Document]:
"""Run similarity search with TimescaleVector with distance.
Args:
query (str): Query text to search for.
k (int): Number of results to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query.
"""
embedding = self.embedding.embed_query(text=query)
return await self.asimilarity_search_by_vector(
embedding=embedding,
k=k,
filter=filter,
predicates=predicates,
**kwargs,
)
def similarity_search_with_score(
self,
query: str,
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
"""Return docs most similar to query.
Args:
query: Text to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query and score for each
"""
embedding = self.embedding.embed_query(query)
docs = self.similarity_search_with_score_by_vector(
embedding=embedding,
k=k,
filter=filter,
predicates=predicates,
**kwargs,
)
return docs
async def asimilarity_search_with_score(
self,
query: str,
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
"""Return docs most similar to query.
Args:
query: Text to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query and score for each
"""
embedding = self.embedding.embed_query(query)
return await self.asimilarity_search_with_score_by_vector(
embedding=embedding,
k=k,
filter=filter,
predicates=predicates,
**kwargs,
)
def date_to_range_filter(self, **kwargs: Any) -> Any:
constructor_args = {
key: kwargs[key]
for key in [
"start_date",
"end_date",
"time_delta",
"start_inclusive",
"end_inclusive",
]
if key in kwargs
}
if not constructor_args or len(constructor_args) == 0:
return None
try:
from timescale_vector import client
except ImportError:
raise ImportError(
"Could not import timescale_vector python package. "
"Please install it with `pip install timescale-vector`."
)
return client.UUIDTimeRange(**constructor_args)
def similarity_search_with_score_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
try:
from timescale_vector import client
except ImportError:
raise ImportError(
"Could not import timescale_vector python package. "
"Please install it with `pip install timescale-vector`."
)
results = self.sync_client.search(
embedding,
limit=k,
filter=filter,
predicates=predicates,
uuid_time_filter=self.date_to_range_filter(**kwargs),
)
docs = [
(
Document(
page_content=result[client.SEARCH_RESULT_CONTENTS_IDX],
metadata=result[client.SEARCH_RESULT_METADATA_IDX],
),
result[client.SEARCH_RESULT_DISTANCE_IDX],
)
for result in results
]
return docs
async def asimilarity_search_with_score_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
try:
from timescale_vector import client
except ImportError:
raise ImportError(
"Could not import timescale_vector python package. "
"Please install it with `pip install timescale-vector`."
)
results = await self.async_client.search(
embedding,
limit=k,
filter=filter,
predicates=predicates,
uuid_time_filter=self.date_to_range_filter(**kwargs),
)
docs = [
(
Document(
page_content=result[client.SEARCH_RESULT_CONTENTS_IDX],
metadata=result[client.SEARCH_RESULT_METADATA_IDX],
),
result[client.SEARCH_RESULT_DISTANCE_IDX],
)
for result in results
]
return docs
def similarity_search_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Document]:
"""Return docs most similar to embedding vector.
Args:
embedding: Embedding to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query vector.
"""
docs_and_scores = self.similarity_search_with_score_by_vector(
embedding=embedding, k=k, filter=filter, predicates=predicates, **kwargs
)
return [doc for doc, _ in docs_and_scores]
async def asimilarity_search_by_vector(
self,
embedding: List[float],
k: int = 4,
filter: Optional[Union[dict, list]] = None,
predicates: Optional[Predicates] = None,
**kwargs: Any,
) -> List[Document]:
"""Return docs most similar to embedding vector.
Args:
embedding: Embedding to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
List of Documents most similar to the query vector.
"""
docs_and_scores = await self.asimilarity_search_with_score_by_vector(
embedding=embedding, k=k, filter=filter, predicates=predicates, **kwargs
)
return [doc for doc, _ in docs_and_scores]
@classmethod
def from_texts(
cls: Type[TimescaleVector],
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
ids: Optional[List[str]] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
"""
Return VectorStore initialized from texts and embeddings.
Postgres connection string is required
"Either pass it as a parameter
or set the TIMESCALE_SERVICE_URL environment variable.
"""
embeddings = embedding.embed_documents(list(texts))
return cls.__from(
texts,
embeddings,
embedding,
metadatas=metadatas,
ids=ids,
collection_name=collection_name,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
@classmethod
async def afrom_texts(
cls: Type[TimescaleVector],
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
ids: Optional[List[str]] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
"""
Return VectorStore initialized from texts and embeddings.
Postgres connection string is required
"Either pass it as a parameter
or set the TIMESCALE_SERVICE_URL environment variable.
"""
embeddings = embedding.embed_documents(list(texts))
return await cls.__afrom(
texts,
embeddings,
embedding,
metadatas=metadatas,
ids=ids,
collection_name=collection_name,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
@classmethod
def from_embeddings(
cls,
text_embeddings: List[Tuple[str, List[float]]],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
ids: Optional[List[str]] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
"""Construct TimescaleVector wrapper from raw documents and pre-
generated embeddings.
Return VectorStore initialized from documents and embeddings.
Postgres connection string is required
"Either pass it as a parameter
or set the TIMESCALE_SERVICE_URL environment variable.
Example:
.. code-block:: python
from langchain.vectorstores import TimescaleVector
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text_embeddings = embeddings.embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
tvs = TimescaleVector.from_embeddings(text_embedding_pairs, embeddings)
"""
texts = [t[0] for t in text_embeddings]
embeddings = [t[1] for t in text_embeddings]
return cls.__from(
texts,
embeddings,
embedding,
metadatas=metadatas,
ids=ids,
collection_name=collection_name,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
@classmethod
async def afrom_embeddings(
cls,
text_embeddings: List[Tuple[str, List[float]]],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
ids: Optional[List[str]] = None,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
"""Construct TimescaleVector wrapper from raw documents and pre-
generated embeddings.
Return VectorStore initialized from documents and embeddings.
Postgres connection string is required
"Either pass it as a parameter
or set the TIMESCALE_SERVICE_URL environment variable.
Example:
.. code-block:: python
from langchain.vectorstores import TimescaleVector
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text_embeddings = embeddings.embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
tvs = TimescaleVector.from_embeddings(text_embedding_pairs, embeddings)
"""
texts = [t[0] for t in text_embeddings]
embeddings = [t[1] for t in text_embeddings]
return await cls.__afrom(
texts,
embeddings,
embedding,
metadatas=metadatas,
ids=ids,
collection_name=collection_name,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
**kwargs,
)
@classmethod
def from_existing_index(
cls: Type[TimescaleVector],
embedding: Embeddings,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
pre_delete_collection: bool = False,
**kwargs: Any,
) -> TimescaleVector:
"""
Get intsance of an existing TimescaleVector store.This method will
return the instance of the store without inserting any new
embeddings
"""
service_url = cls.get_service_url(kwargs)
store = cls(
service_url=service_url,
collection_name=collection_name,
embedding=embedding,
distance_strategy=distance_strategy,
pre_delete_collection=pre_delete_collection,
)
return store
@classmethod
def get_service_url(cls, kwargs: Dict[str, Any]) -> str:
service_url: str = get_from_dict_or_env(
data=kwargs,
key="service_url",
env_key="TIMESCALE_SERVICE_URL",
)
if not service_url:
raise ValueError(
"Postgres connection string is required"
"Either pass it as a parameter"
"or set the TIMESCALE_SERVICE_URL environment variable."
)
return service_url
@classmethod
def service_url_from_db_params(
cls,
host: str,
port: int,
database: str,
user: str,
password: str,
) -> str:
"""Return connection string from database parameters."""
return f"postgresql://{user}:{password}@{host}:{port}/{database}"
def _select_relevance_score_fn(self) -> Callable[[float], float]:
"""
The 'correct' relevance function
may differ depending on a few things, including:
- the distance / similarity metric used by the VectorStore
- the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
- embedding dimensionality
- etc.
"""
if self.override_relevance_score_fn is not None:
return self.override_relevance_score_fn
# Default strategy is to rely on distance strategy provided
# in vectorstore constructor
if self._distance_strategy == DistanceStrategy.COSINE:
return self._cosine_relevance_score_fn
elif self._distance_strategy == DistanceStrategy.EUCLIDEAN_DISTANCE:
return self._euclidean_relevance_score_fn
elif self._distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
return self._max_inner_product_relevance_score_fn
else:
raise ValueError(
"No supported normalization function"
f" for distance_strategy of {self._distance_strategy}."
"Consider providing relevance_score_fn to TimescaleVector constructor."
)
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
"""Delete by vector ID or other criteria.
Args:
ids: List of ids to delete.
**kwargs: Other keyword arguments that subclasses might use.
Returns:
Optional[bool]: True if deletion is successful,
False otherwise, None if not implemented.
"""
if ids is None:
raise ValueError("No ids provided to delete.")
self.sync_client.delete_by_ids(ids)
return True
# todo should this be part of delete|()?
def delete_by_metadata(
self, filter: Union[Dict[str, str], List[Dict[str, str]]], **kwargs: Any
) -> Optional[bool]:
"""Delete by vector ID or other criteria.
Args:
ids: List of ids to delete.
**kwargs: Other keyword arguments that subclasses might use.
Returns:
Optional[bool]: True if deletion is successful,
False otherwise, None if not implemented.
"""
self.sync_client.delete_by_metadata(filter)
return True
class IndexType(str, enum.Enum):
"""Enumerator for the supported Index types"""
TIMESCALE_VECTOR = "tsv"
PGVECTOR_IVFFLAT = "ivfflat"
PGVECTOR_HNSW = "hnsw"
DEFAULT_INDEX_TYPE = IndexType.TIMESCALE_VECTOR
def create_index(
self, index_type: Union[IndexType, str] = DEFAULT_INDEX_TYPE, **kwargs: Any
) -> None:
try:
from timescale_vector import client
except ImportError:
raise ImportError(
"Could not import timescale_vector python package. "
"Please install it with `pip install timescale-vector`."
)
index_type = (
index_type.value if isinstance(index_type, self.IndexType) else index_type
)
if index_type == self.IndexType.PGVECTOR_IVFFLAT.value:
self.sync_client.create_embedding_index(client.IvfflatIndex(**kwargs))
if index_type == self.IndexType.PGVECTOR_HNSW.value:
self.sync_client.create_embedding_index(client.HNSWIndex(**kwargs))
if index_type == self.IndexType.TIMESCALE_VECTOR.value:
self.sync_client.create_embedding_index(
client.TimescaleVectorIndex(**kwargs)
)
def drop_index(self) -> None:
self.sync_client.drop_embedding_index()

File diff suppressed because it is too large Load Diff

View File

@ -129,6 +129,7 @@ markdownify = {version = "^0.11.6", optional = true}
assemblyai = {version = "^0.17.0", optional = true}
dashvector = {version = "^1.0.1", optional = true}
sqlite-vss = {version = "^0.1.2", optional = true}
timescale-vector = {version = "^0.0.1", optional = true}
[tool.poetry.group.test.dependencies]
@ -345,6 +346,7 @@ extended_testing = [
"markdownify",
"dashvector",
"sqlite-vss",
"timescale-vector",
]
[tool.ruff]

View File

@ -0,0 +1,433 @@
"""Test TimescaleVector functionality."""
import os
from datetime import datetime, timedelta
from typing import List
import pytest
from langchain.docstore.document import Document
from langchain.vectorstores.timescalevector import TimescaleVector
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
SERVICE_URL = TimescaleVector.service_url_from_db_params(
host=os.environ.get("TEST_TIMESCALE_HOST", "localhost"),
port=int(os.environ.get("TEST_TIMESCALE_PORT", "5432")),
database=os.environ.get("TEST_TIMESCALE_DATABASE", "postgres"),
user=os.environ.get("TEST_TIMESCALE_USER", "postgres"),
password=os.environ.get("TEST_TIMESCALE_PASSWORD", "postgres"),
)
ADA_TOKEN_COUNT = 1536
class FakeEmbeddingsWithAdaDimension(FakeEmbeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [
[float(1.0)] * (ADA_TOKEN_COUNT - 1) + [float(i)] for i in range(len(texts))
]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [float(1.0)] * (ADA_TOKEN_COUNT - 1) + [float(0.0)]
def test_timescalevector() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo")]
def test_timescalevector_from_documents() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
docsearch = TimescaleVector.from_documents(
documents=docs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo", metadata={"a": "b"})]
@pytest.mark.asyncio
async def test_timescalevector_afrom_documents() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
docsearch = await TimescaleVector.afrom_documents(
documents=docs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = await docsearch.asimilarity_search("foo", k=1)
assert output == [Document(page_content="foo", metadata={"a": "b"})]
def test_timescalevector_embeddings() -> None:
"""Test end to end construction with embeddings and search."""
texts = ["foo", "bar", "baz"]
text_embeddings = FakeEmbeddingsWithAdaDimension().embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
docsearch = TimescaleVector.from_embeddings(
text_embeddings=text_embedding_pairs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo")]
@pytest.mark.asyncio
async def test_timescalevector_aembeddings() -> None:
"""Test end to end construction with embeddings and search."""
texts = ["foo", "bar", "baz"]
text_embeddings = FakeEmbeddingsWithAdaDimension().embed_documents(texts)
text_embedding_pairs = list(zip(texts, text_embeddings))
docsearch = await TimescaleVector.afrom_embeddings(
text_embeddings=text_embedding_pairs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = await docsearch.asimilarity_search("foo", k=1)
assert output == [Document(page_content="foo")]
def test_timescalevector_with_metadatas() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search("foo", k=1)
assert output == [Document(page_content="foo", metadata={"page": "0"})]
def test_timescalevector_with_metadatas_with_scores() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_score("foo", k=1)
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
@pytest.mark.asyncio
async def test_timescalevector_awith_metadatas_with_scores() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = await TimescaleVector.afrom_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = await docsearch.asimilarity_search_with_score("foo", k=1)
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
def test_timescalevector_with_filter_match() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection_filter",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "0"})
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
def test_timescalevector_with_filter_distant_match() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection_filter",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "2"})
assert output == [
(Document(page_content="baz", metadata={"page": "2"}), 0.0013003906671379406)
]
def test_timescalevector_with_filter_no_match() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection_filter",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "5"})
assert output == []
def test_timescalevector_with_filter_in_set() -> None:
"""Test end to end construction and search."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection_filter",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_score(
"foo", k=2, filter=[{"page": "0"}, {"page": "2"}]
)
assert output == [
(Document(page_content="foo", metadata={"page": "0"}), 0.0),
(Document(page_content="baz", metadata={"page": "2"}), 0.0013003906671379406),
]
def test_timescalevector_relevance_score() -> None:
"""Test to make sure the relevance score is scaled to 0-1."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = docsearch.similarity_search_with_relevance_scores("foo", k=3)
assert output == [
(Document(page_content="foo", metadata={"page": "0"}), 1.0),
(Document(page_content="bar", metadata={"page": "1"}), 0.9996744261675065),
(Document(page_content="baz", metadata={"page": "2"}), 0.9986996093328621),
]
@pytest.mark.asyncio
async def test_timescalevector_relevance_score_async() -> None:
"""Test to make sure the relevance score is scaled to 0-1."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = await TimescaleVector.afrom_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
output = await docsearch.asimilarity_search_with_relevance_scores("foo", k=3)
assert output == [
(Document(page_content="foo", metadata={"page": "0"}), 1.0),
(Document(page_content="bar", metadata={"page": "1"}), 0.9996744261675065),
(Document(page_content="baz", metadata={"page": "2"}), 0.9986996093328621),
]
def test_timescalevector_retriever_search_threshold() -> None:
"""Test using retriever for searching with threshold."""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
)
retriever = docsearch.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 3, "score_threshold": 0.999},
)
output = retriever.get_relevant_documents("summer")
assert output == [
Document(page_content="foo", metadata={"page": "0"}),
Document(page_content="bar", metadata={"page": "1"}),
]
def test_timescalevector_retriever_search_threshold_custom_normalization_fn() -> None:
"""Test searching with threshold and custom normalization function"""
texts = ["foo", "bar", "baz"]
metadatas = [{"page": str(i)} for i in range(len(texts))]
docsearch = TimescaleVector.from_texts(
texts=texts,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
metadatas=metadatas,
service_url=SERVICE_URL,
pre_delete_collection=True,
relevance_score_fn=lambda d: d * 0,
)
retriever = docsearch.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 3, "score_threshold": 0.5},
)
output = retriever.get_relevant_documents("foo")
assert output == []
def test_timescalevector_delete() -> None:
"""Test deleting functionality."""
texts = ["bar", "baz"]
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
docsearch = TimescaleVector.from_documents(
documents=docs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
texts = ["foo"]
meta = [{"b": "c"}]
ids = docsearch.add_texts(texts, meta)
output = docsearch.similarity_search("bar", k=10)
assert len(output) == 3
docsearch.delete(ids)
output = docsearch.similarity_search("bar", k=10)
assert len(output) == 2
docsearch.delete_by_metadata({"a": "b"})
output = docsearch.similarity_search("bar", k=10)
assert len(output) == 0
def test_timescalevector_with_index() -> None:
"""Test deleting functionality."""
texts = ["bar", "baz"]
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
docsearch = TimescaleVector.from_documents(
documents=docs,
collection_name="test_collection",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
)
texts = ["foo"]
meta = [{"b": "c"}]
docsearch.add_texts(texts, meta)
docsearch.create_index()
output = docsearch.similarity_search("bar", k=10)
assert len(output) == 3
docsearch.drop_index()
docsearch.create_index(
index_type=TimescaleVector.IndexType.TIMESCALE_VECTOR,
max_alpha=1.0,
num_neighbors=50,
)
docsearch.drop_index()
docsearch.create_index("tsv", max_alpha=1.0, num_neighbors=50)
docsearch.drop_index()
docsearch.create_index("ivfflat", num_lists=20, num_records=1000)
docsearch.drop_index()
docsearch.create_index("hnsw", m=16, ef_construction=64)
def test_timescalevector_time_partitioning() -> None:
"""Test deleting functionality."""
from timescale_vector import client
texts = ["bar", "baz"]
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
docsearch = TimescaleVector.from_documents(
documents=docs,
collection_name="test_collection_time_partitioning",
embedding=FakeEmbeddingsWithAdaDimension(),
service_url=SERVICE_URL,
pre_delete_collection=True,
time_partition_interval=timedelta(hours=1),
)
texts = ["foo"]
meta = [{"b": "c"}]
ids = [client.uuid_from_time(datetime.now() - timedelta(hours=3))]
docsearch.add_texts(texts, meta, ids)
output = docsearch.similarity_search("bar", k=10)
assert len(output) == 3
output = docsearch.similarity_search(
"bar", k=10, start_date=datetime.now() - timedelta(hours=1)
)
assert len(output) == 2
output = docsearch.similarity_search(
"bar", k=10, end_date=datetime.now() - timedelta(hours=1)
)
assert len(output) == 1
output = docsearch.similarity_search(
"bar", k=10, start_date=datetime.now() - timedelta(minutes=200)
)
assert len(output) == 3
output = docsearch.similarity_search(
"bar",
k=10,
start_date=datetime.now() - timedelta(minutes=200),
time_delta=timedelta(hours=1),
)
assert len(output) == 1

View File

@ -0,0 +1,97 @@
from typing import Dict, Tuple
import pytest as pytest
from langchain.chains.query_constructor.ir import (
Comparator,
Comparison,
Operation,
Operator,
StructuredQuery,
)
from langchain.retrievers.self_query.timescalevector import TimescaleVectorTranslator
DEFAULT_TRANSLATOR = TimescaleVectorTranslator()
@pytest.mark.requires("timescale_vector")
def test_visit_comparison() -> None:
from timescale_vector import client
comp = Comparison(comparator=Comparator.LT, attribute="foo", value=1)
expected = client.Predicates(("foo", "<", 1))
actual = DEFAULT_TRANSLATOR.visit_comparison(comp)
assert expected == actual
@pytest.mark.requires("timescale_vector")
def test_visit_operation() -> None:
from timescale_vector import client
op = Operation(
operator=Operator.AND,
arguments=[
Comparison(comparator=Comparator.LT, attribute="foo", value=2),
Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
Comparison(comparator=Comparator.GT, attribute="abc", value=2.0),
],
)
expected = client.Predicates(
client.Predicates(("foo", "<", 2)),
client.Predicates(("bar", "==", "baz")),
client.Predicates(("abc", ">", 2.0)),
)
actual = DEFAULT_TRANSLATOR.visit_operation(op)
assert expected == actual
@pytest.mark.requires("timescale_vector")
def test_visit_structured_query() -> None:
from timescale_vector import client
query = "What is the capital of France?"
structured_query = StructuredQuery(
query=query,
filter=None,
)
expected: Tuple[str, Dict] = (query, {})
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
assert expected == actual
comp = Comparison(comparator=Comparator.LT, attribute="foo", value=1)
expected = (
query,
{"predicates": client.Predicates(("foo", "<", 1))},
)
structured_query = StructuredQuery(
query=query,
filter=comp,
)
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
assert expected == actual
op = Operation(
operator=Operator.AND,
arguments=[
Comparison(comparator=Comparator.LT, attribute="foo", value=2),
Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
Comparison(comparator=Comparator.GT, attribute="abc", value=2.0),
],
)
structured_query = StructuredQuery(
query=query,
filter=op,
)
expected = (
query,
{
"predicates": client.Predicates(
client.Predicates(("foo", "<", 2)),
client.Predicates(("bar", "==", "baz")),
client.Predicates(("abc", ">", 2.0)),
)
},
)
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
assert expected == actual