mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-15 17:33:53 +00:00
Add integration for Timescale Vector(Postgres) (#10650)
**Description:** This commit adds a vector store for the Postgres-based vector database (`TimescaleVector`). Timescale Vector(https://www.timescale.com/ai) is PostgreSQL++ for AI applications. It enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`: - Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm. - Enables fast time-based vector search via automatic time-based partitioning and indexing. - Provides a familiar SQL interface for querying vector embeddings and relational data. Timescale Vector scales with you from POC to production: - Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database. - Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security. - Enables a worry-free experience with enterprise-grade security and compliance. Timescale Vector is available on Timescale, the cloud PostgreSQL platform. (There is no self-hosted version at this time.) LangChain users get a 90-day free trial for Timescale Vector. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Avthar Sewrathan <avthar@timescale.com>
This commit is contained in:
parent
55570e54e1
commit
6e02c45ca4
1696
docs/extras/integrations/vectorstores/timescalevector.ipynb
Normal file
1696
docs/extras/integrations/vectorstores/timescalevector.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,534 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "13afcae7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Timescale Vector (Postgres) self-querying \n",
|
||||
"\n",
|
||||
"[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications. It enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.\n",
|
||||
"\n",
|
||||
"This notebook shows how to use the Postgres vector database (`TimescaleVector`) to perform self-querying. In the notebook we'll demo the `SelfQueryRetriever` wrapped around a TimescaleVector vector store. \n",
|
||||
"\n",
|
||||
"## What is Timescale Vector?\n",
|
||||
"**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**\n",
|
||||
"\n",
|
||||
"Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
|
||||
"- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.\n",
|
||||
"- Enables fast time-based vector search via automatic time-based partitioning and indexing.\n",
|
||||
"- Provides a familiar SQL interface for querying vector embeddings and relational data.\n",
|
||||
"\n",
|
||||
"Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:\n",
|
||||
"- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.\n",
|
||||
"- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.\n",
|
||||
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
|
||||
"\n",
|
||||
"## How to access Timescale Vector\n",
|
||||
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
|
||||
"\n",
|
||||
"LangChain users get a 90-day free trial for Timescale Vector.\n",
|
||||
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
|
||||
"- See the [Timescale Vector explainer blog](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) for more details and performance benchmarks.\n",
|
||||
"- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "68e75fb9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Creating a TimescaleVector vectorstore\n",
|
||||
"First we'll want to create a Timescale Vector vectorstore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
|
||||
"\n",
|
||||
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `timescale-vector` package."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "63a8af5b",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip install lark"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "22431060-52c4-48a7-a97b-9f542b8b0928",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip install timescale-vector "
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "83811610-7df3-4ede-b268-68a6a83ba9e2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this example, we'll use `OpenAIEmbeddings`, so let's load your OpenAI API key."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "dd01b61b-7d32-4a55-85d6-b2d2d4f18840",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get openAI api key by reading local .env file\n",
|
||||
"# The .env file should contain a line starting with `OPENAI_API_KEY=sk-`\n",
|
||||
"import os\n",
|
||||
"from dotenv import load_dotenv, find_dotenv\n",
|
||||
"_ = load_dotenv(find_dotenv())\n",
|
||||
"\n",
|
||||
"OPENAI_API_KEY = os.environ['OPENAI_API_KEY']\n",
|
||||
"# Alternatively, use getpass to enter the key in a prompt\n",
|
||||
"#import os\n",
|
||||
"#import getpass\n",
|
||||
"#os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "766e9c4b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To connect to your PostgreSQL database, you'll need your service URI, which can be found in the cheatsheet or `.env` file you downloaded after creating a new database. \n",
|
||||
"\n",
|
||||
"If you haven't already, [signup for Timescale](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), and create a new database.\n",
|
||||
"\n",
|
||||
"The URI will look something like this: `postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "6bd6877e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get the service url by reading local .env file\n",
|
||||
"# The .env file should contain a line starting with `TIMESCALE_SERVICE_URL=postgresql://`\n",
|
||||
"_ = load_dotenv(find_dotenv())\n",
|
||||
"TIMESCALE_SERVICE_URL = os.environ[\"TIMESCALE_SERVICE_URL\"]\n",
|
||||
"\n",
|
||||
"# Alternatively, use getpass to enter the key in a prompt\n",
|
||||
"#import os\n",
|
||||
"#import getpass\n",
|
||||
"#TIMESCALE_SERVICE_URL = getpass.getpass(\"Timescale Service URL:\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "cb4a5787",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.schema import Document\n",
|
||||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||||
"from langchain.vectorstores.timescalevector import TimescaleVector\n",
|
||||
"\n",
|
||||
"embeddings = OpenAIEmbeddings()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "a4f863f5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here's the sample documents we'll use for this demo. The data is about movies, and has both content and metadata fields with information about particular movie."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "bcbe04d9",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = [\n",
|
||||
" Document(\n",
|
||||
" page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
|
||||
" metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
|
||||
" metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
|
||||
" metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
|
||||
" metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"Toys come alive and have a blast doing so\",\n",
|
||||
" metadata={\"year\": 1995, \"genre\": \"animated\"},\n",
|
||||
" ),\n",
|
||||
" Document(\n",
|
||||
" page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
|
||||
" metadata={\n",
|
||||
" \"year\": 1979,\n",
|
||||
" \"rating\": 9.9,\n",
|
||||
" \"director\": \"Andrei Tarkovsky\",\n",
|
||||
" \"genre\": \"science fiction\",\n",
|
||||
" \"rating\": 9.9,\n",
|
||||
" },\n",
|
||||
" ),\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "7d0d771e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Finally, we'll create our Timescale Vector vectorstore. Note that the collection name will be the name of the PostgreSQL table in which the documents are stored in."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "2428d1ba",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"COLLECTION_NAME = \"langchain_self_query_demo\"\n",
|
||||
"vectorstore = TimescaleVector.from_documents(\n",
|
||||
" embedding=embeddings,\n",
|
||||
" documents=docs,\n",
|
||||
" collection_name=COLLECTION_NAME,\n",
|
||||
" service_url=TIMESCALE_SERVICE_URL,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "5ecaab6d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Creating our self-querying retriever\n",
|
||||
"Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "86e34dbf",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.llms import OpenAI\n",
|
||||
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
|
||||
"from langchain.chains.query_constructor.base import AttributeInfo\n",
|
||||
"\n",
|
||||
"# Give LLM info about the metadata fields\n",
|
||||
"metadata_field_info = [\n",
|
||||
" AttributeInfo(\n",
|
||||
" name=\"genre\",\n",
|
||||
" description=\"The genre of the movie\",\n",
|
||||
" type=\"string or list[string]\",\n",
|
||||
" ),\n",
|
||||
" AttributeInfo(\n",
|
||||
" name=\"year\",\n",
|
||||
" description=\"The year the movie was released\",\n",
|
||||
" type=\"integer\",\n",
|
||||
" ),\n",
|
||||
" AttributeInfo(\n",
|
||||
" name=\"director\",\n",
|
||||
" description=\"The name of the movie director\",\n",
|
||||
" type=\"string\",\n",
|
||||
" ),\n",
|
||||
" AttributeInfo(\n",
|
||||
" name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
|
||||
" ),\n",
|
||||
"]\n",
|
||||
"document_content_description = \"Brief summary of a movie\"\n",
|
||||
"\n",
|
||||
"# Instantiate the self-query retriever from an LLM\n",
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"retriever = SelfQueryRetriever.from_llm(\n",
|
||||
" llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "ea9df8d4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Self Querying Retrieval with Timescale Vector\n",
|
||||
"And now we can try actually using our retriever!\n",
|
||||
"\n",
|
||||
"Run the queries below and note how you can specify a query, filter, composite filter (filters with AND, OR) in natural language and the self-query retriever will translate that query into SQL and perform the search on the Timescale Vector (Postgres) vectorstore.\n",
|
||||
"\n",
|
||||
"This illustrates the power of the self-query retriever. You can use it to perform complex searches over your vectorstore without you or your users having to write any SQL directly!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "38a126e9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
|
||||
" warnings.warn(\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query='dinosaur' filter=None limit=None\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
|
||||
" Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
|
||||
" Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n",
|
||||
" Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example only specifies a relevant query\n",
|
||||
"retriever.get_relevant_documents(\"What are some movies about dinosaurs\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "fc3f1e6e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
|
||||
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
|
||||
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'}),\n",
|
||||
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example only specifies a filter\n",
|
||||
"retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "b19d4da0",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig') limit=None\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'}),\n",
|
||||
" Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example specifies a query and a filter\n",
|
||||
"retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "f900e40e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='rating', value=8.5), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction')]) limit=None\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}),\n",
|
||||
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'genre': 'science fiction', 'rating': 9.9, 'director': 'Andrei Tarkovsky'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example specifies a composite filter\n",
|
||||
"retriever.get_relevant_documents(\n",
|
||||
" \"What's a highly rated (above 8.5) science fiction film?\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "12a51522",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')]) limit=None\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example specifies a query and composite filter\n",
|
||||
"retriever.get_relevant_documents(\n",
|
||||
" \"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Filter k\n",
|
||||
"\n",
|
||||
"We can also use the self query retriever to specify `k`: the number of documents to fetch.\n",
|
||||
"\n",
|
||||
"We can do this by passing `enable_limit=True` to the constructor."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "bff36b88-b506-4877-9c63-e5a1a8d78e64",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"retriever = SelfQueryRetriever.from_llm(\n",
|
||||
" llm,\n",
|
||||
" vectorstore,\n",
|
||||
" document_content_description,\n",
|
||||
" metadata_field_info,\n",
|
||||
" enable_limit=True,\n",
|
||||
" verbose=True,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "2758d229-4f97-499c-819f-888acaf8ee10",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"query='dinosaur' filter=None limit=2\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7}),\n",
|
||||
" Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'genre': 'science fiction', 'rating': 7.7})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# This example specifies a query with a LIMIT value\n",
|
||||
"retriever.get_relevant_documents(\"what are two movies about dinosaurs\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -18,6 +18,7 @@ from langchain.retrievers.self_query.pinecone import PineconeTranslator
|
||||
from langchain.retrievers.self_query.qdrant import QdrantTranslator
|
||||
from langchain.retrievers.self_query.redis import RedisTranslator
|
||||
from langchain.retrievers.self_query.supabase import SupabaseVectorTranslator
|
||||
from langchain.retrievers.self_query.timescalevector import TimescaleVectorTranslator
|
||||
from langchain.retrievers.self_query.vectara import VectaraTranslator
|
||||
from langchain.retrievers.self_query.weaviate import WeaviateTranslator
|
||||
from langchain.schema import BaseRetriever, Document
|
||||
@ -33,6 +34,7 @@ from langchain.vectorstores import (
|
||||
Qdrant,
|
||||
Redis,
|
||||
SupabaseVectorStore,
|
||||
TimescaleVector,
|
||||
Vectara,
|
||||
VectorStore,
|
||||
Weaviate,
|
||||
@ -53,6 +55,7 @@ def _get_builtin_translator(vectorstore: VectorStore) -> Visitor:
|
||||
ElasticsearchStore: ElasticsearchTranslator,
|
||||
Milvus: MilvusTranslator,
|
||||
SupabaseVectorStore: SupabaseVectorTranslator,
|
||||
TimescaleVector: TimescaleVectorTranslator,
|
||||
}
|
||||
if isinstance(vectorstore, Qdrant):
|
||||
return QdrantTranslator(metadata_key=vectorstore.metadata_payload_key)
|
||||
|
@ -0,0 +1,84 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING, Tuple, Union
|
||||
|
||||
from langchain.chains.query_constructor.ir import (
|
||||
Comparator,
|
||||
Comparison,
|
||||
Operation,
|
||||
Operator,
|
||||
StructuredQuery,
|
||||
Visitor,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from timescale_vector import client
|
||||
|
||||
|
||||
class TimescaleVectorTranslator(Visitor):
|
||||
"""Translate the internal query language elements to valid filters."""
|
||||
|
||||
allowed_operators = [Operator.AND, Operator.OR, Operator.NOT]
|
||||
"""Subset of allowed logical operators."""
|
||||
|
||||
allowed_comparators = [
|
||||
Comparator.EQ,
|
||||
Comparator.GT,
|
||||
Comparator.GTE,
|
||||
Comparator.LT,
|
||||
Comparator.LTE,
|
||||
]
|
||||
|
||||
COMPARATOR_MAP = {
|
||||
Comparator.EQ: "==",
|
||||
Comparator.GT: ">",
|
||||
Comparator.GTE: ">=",
|
||||
Comparator.LT: "<",
|
||||
Comparator.LTE: "<=",
|
||||
}
|
||||
|
||||
OPERATOR_MAP = {Operator.AND: "AND", Operator.OR: "OR", Operator.NOT: "NOT"}
|
||||
|
||||
def _format_func(self, func: Union[Operator, Comparator]) -> str:
|
||||
self._validate_func(func)
|
||||
if isinstance(func, Operator):
|
||||
value = self.OPERATOR_MAP[func.value] # type: ignore
|
||||
elif isinstance(func, Comparator):
|
||||
value = self.COMPARATOR_MAP[func.value] # type: ignore
|
||||
return f"{value}"
|
||||
|
||||
def visit_operation(self, operation: Operation) -> client.Predicates:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError as e:
|
||||
raise ImportError(
|
||||
"Cannot import timescale-vector. Please install with `pip install "
|
||||
"timescale-vector`."
|
||||
) from e
|
||||
args = [arg.accept(self) for arg in operation.arguments]
|
||||
return client.Predicates(*args, operator=self._format_func(operation.operator))
|
||||
|
||||
def visit_comparison(self, comparison: Comparison) -> client.Predicates:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError as e:
|
||||
raise ImportError(
|
||||
"Cannot import timescale-vector. Please install with `pip install "
|
||||
"timescale-vector`."
|
||||
) from e
|
||||
return client.Predicates(
|
||||
(
|
||||
comparison.attribute,
|
||||
self._format_func(comparison.comparator),
|
||||
comparison.value,
|
||||
)
|
||||
)
|
||||
|
||||
def visit_structured_query(
|
||||
self, structured_query: StructuredQuery
|
||||
) -> Tuple[str, dict]:
|
||||
if structured_query.filter is None:
|
||||
kwargs = {}
|
||||
else:
|
||||
kwargs = {"predicates": structured_query.filter.accept(self)}
|
||||
return structured_query.query, kwargs
|
@ -70,6 +70,7 @@ from langchain.vectorstores.supabase import SupabaseVectorStore
|
||||
from langchain.vectorstores.tair import Tair
|
||||
from langchain.vectorstores.tencentvectordb import TencentVectorDB
|
||||
from langchain.vectorstores.tigris import Tigris
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from langchain.vectorstores.typesense import Typesense
|
||||
from langchain.vectorstores.usearch import USearch
|
||||
from langchain.vectorstores.vald import Vald
|
||||
@ -135,6 +136,7 @@ __all__ = [
|
||||
"SupabaseVectorStore",
|
||||
"Tair",
|
||||
"Tigris",
|
||||
"TimescaleVector",
|
||||
"Typesense",
|
||||
"USearch",
|
||||
"Vald",
|
||||
|
871
libs/langchain/langchain/vectorstores/timescalevector.py
Normal file
871
libs/langchain/langchain/vectorstores/timescalevector.py
Normal file
@ -0,0 +1,871 @@
|
||||
"""VectorStore wrapper around a Postgres-TimescaleVector database."""
|
||||
from __future__ import annotations
|
||||
|
||||
import enum
|
||||
import logging
|
||||
import uuid
|
||||
from datetime import timedelta
|
||||
from typing import (
|
||||
TYPE_CHECKING,
|
||||
Any,
|
||||
Callable,
|
||||
Dict,
|
||||
Iterable,
|
||||
List,
|
||||
Optional,
|
||||
Tuple,
|
||||
Type,
|
||||
Union,
|
||||
)
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.embeddings.base import Embeddings
|
||||
from langchain.utils import get_from_dict_or_env
|
||||
from langchain.vectorstores.base import VectorStore
|
||||
from langchain.vectorstores.utils import DistanceStrategy
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from timescale_vector import Predicates
|
||||
|
||||
|
||||
DEFAULT_DISTANCE_STRATEGY = DistanceStrategy.COSINE
|
||||
|
||||
ADA_TOKEN_COUNT = 1536
|
||||
|
||||
_LANGCHAIN_DEFAULT_COLLECTION_NAME = "langchain_store"
|
||||
|
||||
|
||||
class TimescaleVector(VectorStore):
|
||||
"""VectorStore implementation using the timescale vector client to store vectors
|
||||
in Postgres.
|
||||
|
||||
To use, you should have the ``timescale_vector`` python package installed.
|
||||
|
||||
Args:
|
||||
service_url: Service url on timescale cloud.
|
||||
embedding: Any embedding function implementing
|
||||
`langchain.embeddings.base.Embeddings` interface.
|
||||
collection_name: The name of the collection to use. (default: langchain_store)
|
||||
This will become the table name used for the collection.
|
||||
distance_strategy: The distance strategy to use. (default: COSINE)
|
||||
pre_delete_collection: If True, will delete the collection if it exists.
|
||||
(default: False). Useful for testing.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
from langchain.vectorstores import TimescaleVector
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
|
||||
SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"
|
||||
COLLECTION_NAME = "state_of_the_union_test"
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorestore = TimescaleVector.from_documents(
|
||||
embedding=embeddings,
|
||||
documents=docs,
|
||||
collection_name=COLLECTION_NAME,
|
||||
service_url=SERVICE_URL,
|
||||
)
|
||||
""" # noqa: E501
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
service_url: str,
|
||||
embedding: Embeddings,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
num_dimensions: int = ADA_TOKEN_COUNT,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
pre_delete_collection: bool = False,
|
||||
logger: Optional[logging.Logger] = None,
|
||||
relevance_score_fn: Optional[Callable[[float], float]] = None,
|
||||
time_partition_interval: Optional[timedelta] = None,
|
||||
) -> None:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import timescale_vector python package. "
|
||||
"Please install it with `pip install timescale-vector`."
|
||||
)
|
||||
|
||||
self.service_url = service_url
|
||||
self.embedding = embedding
|
||||
self.collection_name = collection_name
|
||||
self.num_dimensions = num_dimensions
|
||||
self._distance_strategy = distance_strategy
|
||||
self.pre_delete_collection = pre_delete_collection
|
||||
self.logger = logger or logging.getLogger(__name__)
|
||||
self.override_relevance_score_fn = relevance_score_fn
|
||||
self._time_partition_interval = time_partition_interval
|
||||
self.sync_client = client.Sync(
|
||||
self.service_url,
|
||||
self.collection_name,
|
||||
self.num_dimensions,
|
||||
self._distance_strategy.value.lower(),
|
||||
time_partition_interval=self._time_partition_interval,
|
||||
)
|
||||
self.async_client = client.Async(
|
||||
self.service_url,
|
||||
self.collection_name,
|
||||
self.num_dimensions,
|
||||
self._distance_strategy.value.lower(),
|
||||
time_partition_interval=self._time_partition_interval,
|
||||
)
|
||||
self.__post_init__()
|
||||
|
||||
def __post_init__(
|
||||
self,
|
||||
) -> None:
|
||||
"""
|
||||
Initialize the store.
|
||||
"""
|
||||
self.sync_client.create_tables()
|
||||
if self.pre_delete_collection:
|
||||
self.sync_client.delete_all()
|
||||
|
||||
@property
|
||||
def embeddings(self) -> Embeddings:
|
||||
return self.embedding
|
||||
|
||||
def drop_tables(self) -> None:
|
||||
self.sync_client.drop_table()
|
||||
|
||||
@classmethod
|
||||
def __from(
|
||||
cls,
|
||||
texts: List[str],
|
||||
embeddings: List[List[float]],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
service_url: Optional[str] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
num_dimensions = len(embeddings[0])
|
||||
|
||||
if ids is None:
|
||||
ids = [str(uuid.uuid1()) for _ in texts]
|
||||
|
||||
if not metadatas:
|
||||
metadatas = [{} for _ in texts]
|
||||
|
||||
if service_url is None:
|
||||
service_url = cls.get_service_url(kwargs)
|
||||
|
||||
store = cls(
|
||||
service_url=service_url,
|
||||
num_dimensions=num_dimensions,
|
||||
collection_name=collection_name,
|
||||
embedding=embedding,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
store.add_embeddings(
|
||||
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
|
||||
)
|
||||
|
||||
return store
|
||||
|
||||
@classmethod
|
||||
async def __afrom(
|
||||
cls,
|
||||
texts: List[str],
|
||||
embeddings: List[List[float]],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
service_url: Optional[str] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
num_dimensions = len(embeddings[0])
|
||||
|
||||
if ids is None:
|
||||
ids = [str(uuid.uuid1()) for _ in texts]
|
||||
|
||||
if not metadatas:
|
||||
metadatas = [{} for _ in texts]
|
||||
|
||||
if service_url is None:
|
||||
service_url = cls.get_service_url(kwargs)
|
||||
|
||||
store = cls(
|
||||
service_url=service_url,
|
||||
num_dimensions=num_dimensions,
|
||||
collection_name=collection_name,
|
||||
embedding=embedding,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
await store.aadd_embeddings(
|
||||
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
|
||||
)
|
||||
|
||||
return store
|
||||
|
||||
def add_embeddings(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
embeddings: List[List[float]],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
"""Add embeddings to the vectorstore.
|
||||
|
||||
Args:
|
||||
texts: Iterable of strings to add to the vectorstore.
|
||||
embeddings: List of list of embedding vectors.
|
||||
metadatas: List of metadatas associated with the texts.
|
||||
kwargs: vectorstore specific parameters
|
||||
"""
|
||||
if ids is None:
|
||||
ids = [str(uuid.uuid1()) for _ in texts]
|
||||
|
||||
if not metadatas:
|
||||
metadatas = [{} for _ in texts]
|
||||
|
||||
records = list(zip(ids, metadatas, texts, embeddings))
|
||||
self.sync_client.upsert(records)
|
||||
|
||||
return ids
|
||||
|
||||
async def aadd_embeddings(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
embeddings: List[List[float]],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
"""Add embeddings to the vectorstore.
|
||||
|
||||
Args:
|
||||
texts: Iterable of strings to add to the vectorstore.
|
||||
embeddings: List of list of embedding vectors.
|
||||
metadatas: List of metadatas associated with the texts.
|
||||
kwargs: vectorstore specific parameters
|
||||
"""
|
||||
if ids is None:
|
||||
ids = [str(uuid.uuid1()) for _ in texts]
|
||||
|
||||
if not metadatas:
|
||||
metadatas = [{} for _ in texts]
|
||||
|
||||
records = list(zip(ids, metadatas, texts, embeddings))
|
||||
await self.async_client.upsert(records)
|
||||
|
||||
return ids
|
||||
|
||||
def add_texts(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
"""Run more texts through the embeddings and add to the vectorstore.
|
||||
|
||||
Args:
|
||||
texts: Iterable of strings to add to the vectorstore.
|
||||
metadatas: Optional list of metadatas associated with the texts.
|
||||
kwargs: vectorstore specific parameters
|
||||
|
||||
Returns:
|
||||
List of ids from adding the texts into the vectorstore.
|
||||
"""
|
||||
embeddings = self.embedding.embed_documents(list(texts))
|
||||
return self.add_embeddings(
|
||||
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
|
||||
)
|
||||
|
||||
async def aadd_texts(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
"""Run more texts through the embeddings and add to the vectorstore.
|
||||
|
||||
Args:
|
||||
texts: Iterable of strings to add to the vectorstore.
|
||||
metadatas: Optional list of metadatas associated with the texts.
|
||||
kwargs: vectorstore specific parameters
|
||||
|
||||
Returns:
|
||||
List of ids from adding the texts into the vectorstore.
|
||||
"""
|
||||
embeddings = self.embedding.embed_documents(list(texts))
|
||||
return await self.aadd_embeddings(
|
||||
texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
|
||||
)
|
||||
|
||||
def similarity_search(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
"""Run similarity search with TimescaleVector with distance.
|
||||
|
||||
Args:
|
||||
query (str): Query text to search for.
|
||||
k (int): Number of results to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query.
|
||||
"""
|
||||
embedding = self.embedding.embed_query(text=query)
|
||||
return self.similarity_search_by_vector(
|
||||
embedding=embedding,
|
||||
k=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
async def asimilarity_search(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
"""Run similarity search with TimescaleVector with distance.
|
||||
|
||||
Args:
|
||||
query (str): Query text to search for.
|
||||
k (int): Number of results to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query.
|
||||
"""
|
||||
embedding = self.embedding.embed_query(text=query)
|
||||
return await self.asimilarity_search_by_vector(
|
||||
embedding=embedding,
|
||||
k=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def similarity_search_with_score(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Tuple[Document, float]]:
|
||||
"""Return docs most similar to query.
|
||||
|
||||
Args:
|
||||
query: Text to look up documents similar to.
|
||||
k: Number of Documents to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query and score for each
|
||||
"""
|
||||
embedding = self.embedding.embed_query(query)
|
||||
docs = self.similarity_search_with_score_by_vector(
|
||||
embedding=embedding,
|
||||
k=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
**kwargs,
|
||||
)
|
||||
return docs
|
||||
|
||||
async def asimilarity_search_with_score(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Tuple[Document, float]]:
|
||||
"""Return docs most similar to query.
|
||||
|
||||
Args:
|
||||
query: Text to look up documents similar to.
|
||||
k: Number of Documents to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query and score for each
|
||||
"""
|
||||
embedding = self.embedding.embed_query(query)
|
||||
return await self.asimilarity_search_with_score_by_vector(
|
||||
embedding=embedding,
|
||||
k=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def date_to_range_filter(self, **kwargs: Any) -> Any:
|
||||
constructor_args = {
|
||||
key: kwargs[key]
|
||||
for key in [
|
||||
"start_date",
|
||||
"end_date",
|
||||
"time_delta",
|
||||
"start_inclusive",
|
||||
"end_inclusive",
|
||||
]
|
||||
if key in kwargs
|
||||
}
|
||||
if not constructor_args or len(constructor_args) == 0:
|
||||
return None
|
||||
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import timescale_vector python package. "
|
||||
"Please install it with `pip install timescale-vector`."
|
||||
)
|
||||
return client.UUIDTimeRange(**constructor_args)
|
||||
|
||||
def similarity_search_with_score_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Tuple[Document, float]]:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import timescale_vector python package. "
|
||||
"Please install it with `pip install timescale-vector`."
|
||||
)
|
||||
|
||||
results = self.sync_client.search(
|
||||
embedding,
|
||||
limit=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
uuid_time_filter=self.date_to_range_filter(**kwargs),
|
||||
)
|
||||
|
||||
docs = [
|
||||
(
|
||||
Document(
|
||||
page_content=result[client.SEARCH_RESULT_CONTENTS_IDX],
|
||||
metadata=result[client.SEARCH_RESULT_METADATA_IDX],
|
||||
),
|
||||
result[client.SEARCH_RESULT_DISTANCE_IDX],
|
||||
)
|
||||
for result in results
|
||||
]
|
||||
return docs
|
||||
|
||||
async def asimilarity_search_with_score_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Tuple[Document, float]]:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import timescale_vector python package. "
|
||||
"Please install it with `pip install timescale-vector`."
|
||||
)
|
||||
|
||||
results = await self.async_client.search(
|
||||
embedding,
|
||||
limit=k,
|
||||
filter=filter,
|
||||
predicates=predicates,
|
||||
uuid_time_filter=self.date_to_range_filter(**kwargs),
|
||||
)
|
||||
|
||||
docs = [
|
||||
(
|
||||
Document(
|
||||
page_content=result[client.SEARCH_RESULT_CONTENTS_IDX],
|
||||
metadata=result[client.SEARCH_RESULT_METADATA_IDX],
|
||||
),
|
||||
result[client.SEARCH_RESULT_DISTANCE_IDX],
|
||||
)
|
||||
for result in results
|
||||
]
|
||||
return docs
|
||||
|
||||
def similarity_search_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
"""Return docs most similar to embedding vector.
|
||||
|
||||
Args:
|
||||
embedding: Embedding to look up documents similar to.
|
||||
k: Number of Documents to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query vector.
|
||||
"""
|
||||
docs_and_scores = self.similarity_search_with_score_by_vector(
|
||||
embedding=embedding, k=k, filter=filter, predicates=predicates, **kwargs
|
||||
)
|
||||
return [doc for doc, _ in docs_and_scores]
|
||||
|
||||
async def asimilarity_search_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
filter: Optional[Union[dict, list]] = None,
|
||||
predicates: Optional[Predicates] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
"""Return docs most similar to embedding vector.
|
||||
|
||||
Args:
|
||||
embedding: Embedding to look up documents similar to.
|
||||
k: Number of Documents to return. Defaults to 4.
|
||||
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List of Documents most similar to the query vector.
|
||||
"""
|
||||
docs_and_scores = await self.asimilarity_search_with_score_by_vector(
|
||||
embedding=embedding, k=k, filter=filter, predicates=predicates, **kwargs
|
||||
)
|
||||
return [doc for doc, _ in docs_and_scores]
|
||||
|
||||
@classmethod
|
||||
def from_texts(
|
||||
cls: Type[TimescaleVector],
|
||||
texts: List[str],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
ids: Optional[List[str]] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
"""
|
||||
Return VectorStore initialized from texts and embeddings.
|
||||
Postgres connection string is required
|
||||
"Either pass it as a parameter
|
||||
or set the TIMESCALE_SERVICE_URL environment variable.
|
||||
"""
|
||||
embeddings = embedding.embed_documents(list(texts))
|
||||
|
||||
return cls.__from(
|
||||
texts,
|
||||
embeddings,
|
||||
embedding,
|
||||
metadatas=metadatas,
|
||||
ids=ids,
|
||||
collection_name=collection_name,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
async def afrom_texts(
|
||||
cls: Type[TimescaleVector],
|
||||
texts: List[str],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
ids: Optional[List[str]] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
"""
|
||||
Return VectorStore initialized from texts and embeddings.
|
||||
Postgres connection string is required
|
||||
"Either pass it as a parameter
|
||||
or set the TIMESCALE_SERVICE_URL environment variable.
|
||||
"""
|
||||
embeddings = embedding.embed_documents(list(texts))
|
||||
|
||||
return await cls.__afrom(
|
||||
texts,
|
||||
embeddings,
|
||||
embedding,
|
||||
metadatas=metadatas,
|
||||
ids=ids,
|
||||
collection_name=collection_name,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_embeddings(
|
||||
cls,
|
||||
text_embeddings: List[Tuple[str, List[float]]],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
ids: Optional[List[str]] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
"""Construct TimescaleVector wrapper from raw documents and pre-
|
||||
generated embeddings.
|
||||
|
||||
Return VectorStore initialized from documents and embeddings.
|
||||
Postgres connection string is required
|
||||
"Either pass it as a parameter
|
||||
or set the TIMESCALE_SERVICE_URL environment variable.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
from langchain.vectorstores import TimescaleVector
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
embeddings = OpenAIEmbeddings()
|
||||
text_embeddings = embeddings.embed_documents(texts)
|
||||
text_embedding_pairs = list(zip(texts, text_embeddings))
|
||||
tvs = TimescaleVector.from_embeddings(text_embedding_pairs, embeddings)
|
||||
"""
|
||||
texts = [t[0] for t in text_embeddings]
|
||||
embeddings = [t[1] for t in text_embeddings]
|
||||
|
||||
return cls.__from(
|
||||
texts,
|
||||
embeddings,
|
||||
embedding,
|
||||
metadatas=metadatas,
|
||||
ids=ids,
|
||||
collection_name=collection_name,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
async def afrom_embeddings(
|
||||
cls,
|
||||
text_embeddings: List[Tuple[str, List[float]]],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
ids: Optional[List[str]] = None,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
"""Construct TimescaleVector wrapper from raw documents and pre-
|
||||
generated embeddings.
|
||||
|
||||
Return VectorStore initialized from documents and embeddings.
|
||||
Postgres connection string is required
|
||||
"Either pass it as a parameter
|
||||
or set the TIMESCALE_SERVICE_URL environment variable.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
from langchain.vectorstores import TimescaleVector
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
embeddings = OpenAIEmbeddings()
|
||||
text_embeddings = embeddings.embed_documents(texts)
|
||||
text_embedding_pairs = list(zip(texts, text_embeddings))
|
||||
tvs = TimescaleVector.from_embeddings(text_embedding_pairs, embeddings)
|
||||
"""
|
||||
texts = [t[0] for t in text_embeddings]
|
||||
embeddings = [t[1] for t in text_embeddings]
|
||||
|
||||
return await cls.__afrom(
|
||||
texts,
|
||||
embeddings,
|
||||
embedding,
|
||||
metadatas=metadatas,
|
||||
ids=ids,
|
||||
collection_name=collection_name,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_existing_index(
|
||||
cls: Type[TimescaleVector],
|
||||
embedding: Embeddings,
|
||||
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
|
||||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
|
||||
pre_delete_collection: bool = False,
|
||||
**kwargs: Any,
|
||||
) -> TimescaleVector:
|
||||
"""
|
||||
Get intsance of an existing TimescaleVector store.This method will
|
||||
return the instance of the store without inserting any new
|
||||
embeddings
|
||||
"""
|
||||
|
||||
service_url = cls.get_service_url(kwargs)
|
||||
|
||||
store = cls(
|
||||
service_url=service_url,
|
||||
collection_name=collection_name,
|
||||
embedding=embedding,
|
||||
distance_strategy=distance_strategy,
|
||||
pre_delete_collection=pre_delete_collection,
|
||||
)
|
||||
|
||||
return store
|
||||
|
||||
@classmethod
|
||||
def get_service_url(cls, kwargs: Dict[str, Any]) -> str:
|
||||
service_url: str = get_from_dict_or_env(
|
||||
data=kwargs,
|
||||
key="service_url",
|
||||
env_key="TIMESCALE_SERVICE_URL",
|
||||
)
|
||||
|
||||
if not service_url:
|
||||
raise ValueError(
|
||||
"Postgres connection string is required"
|
||||
"Either pass it as a parameter"
|
||||
"or set the TIMESCALE_SERVICE_URL environment variable."
|
||||
)
|
||||
|
||||
return service_url
|
||||
|
||||
@classmethod
|
||||
def service_url_from_db_params(
|
||||
cls,
|
||||
host: str,
|
||||
port: int,
|
||||
database: str,
|
||||
user: str,
|
||||
password: str,
|
||||
) -> str:
|
||||
"""Return connection string from database parameters."""
|
||||
return f"postgresql://{user}:{password}@{host}:{port}/{database}"
|
||||
|
||||
def _select_relevance_score_fn(self) -> Callable[[float], float]:
|
||||
"""
|
||||
The 'correct' relevance function
|
||||
may differ depending on a few things, including:
|
||||
- the distance / similarity metric used by the VectorStore
|
||||
- the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
|
||||
- embedding dimensionality
|
||||
- etc.
|
||||
"""
|
||||
if self.override_relevance_score_fn is not None:
|
||||
return self.override_relevance_score_fn
|
||||
|
||||
# Default strategy is to rely on distance strategy provided
|
||||
# in vectorstore constructor
|
||||
if self._distance_strategy == DistanceStrategy.COSINE:
|
||||
return self._cosine_relevance_score_fn
|
||||
elif self._distance_strategy == DistanceStrategy.EUCLIDEAN_DISTANCE:
|
||||
return self._euclidean_relevance_score_fn
|
||||
elif self._distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
|
||||
return self._max_inner_product_relevance_score_fn
|
||||
else:
|
||||
raise ValueError(
|
||||
"No supported normalization function"
|
||||
f" for distance_strategy of {self._distance_strategy}."
|
||||
"Consider providing relevance_score_fn to TimescaleVector constructor."
|
||||
)
|
||||
|
||||
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
|
||||
"""Delete by vector ID or other criteria.
|
||||
|
||||
Args:
|
||||
ids: List of ids to delete.
|
||||
**kwargs: Other keyword arguments that subclasses might use.
|
||||
|
||||
Returns:
|
||||
Optional[bool]: True if deletion is successful,
|
||||
False otherwise, None if not implemented.
|
||||
"""
|
||||
if ids is None:
|
||||
raise ValueError("No ids provided to delete.")
|
||||
|
||||
self.sync_client.delete_by_ids(ids)
|
||||
return True
|
||||
|
||||
# todo should this be part of delete|()?
|
||||
def delete_by_metadata(
|
||||
self, filter: Union[Dict[str, str], List[Dict[str, str]]], **kwargs: Any
|
||||
) -> Optional[bool]:
|
||||
"""Delete by vector ID or other criteria.
|
||||
|
||||
Args:
|
||||
ids: List of ids to delete.
|
||||
**kwargs: Other keyword arguments that subclasses might use.
|
||||
|
||||
Returns:
|
||||
Optional[bool]: True if deletion is successful,
|
||||
False otherwise, None if not implemented.
|
||||
"""
|
||||
|
||||
self.sync_client.delete_by_metadata(filter)
|
||||
return True
|
||||
|
||||
class IndexType(str, enum.Enum):
|
||||
"""Enumerator for the supported Index types"""
|
||||
|
||||
TIMESCALE_VECTOR = "tsv"
|
||||
PGVECTOR_IVFFLAT = "ivfflat"
|
||||
PGVECTOR_HNSW = "hnsw"
|
||||
|
||||
DEFAULT_INDEX_TYPE = IndexType.TIMESCALE_VECTOR
|
||||
|
||||
def create_index(
|
||||
self, index_type: Union[IndexType, str] = DEFAULT_INDEX_TYPE, **kwargs: Any
|
||||
) -> None:
|
||||
try:
|
||||
from timescale_vector import client
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import timescale_vector python package. "
|
||||
"Please install it with `pip install timescale-vector`."
|
||||
)
|
||||
|
||||
index_type = (
|
||||
index_type.value if isinstance(index_type, self.IndexType) else index_type
|
||||
)
|
||||
if index_type == self.IndexType.PGVECTOR_IVFFLAT.value:
|
||||
self.sync_client.create_embedding_index(client.IvfflatIndex(**kwargs))
|
||||
|
||||
if index_type == self.IndexType.PGVECTOR_HNSW.value:
|
||||
self.sync_client.create_embedding_index(client.HNSWIndex(**kwargs))
|
||||
|
||||
if index_type == self.IndexType.TIMESCALE_VECTOR.value:
|
||||
self.sync_client.create_embedding_index(
|
||||
client.TimescaleVectorIndex(**kwargs)
|
||||
)
|
||||
|
||||
def drop_index(self) -> None:
|
||||
self.sync_client.drop_embedding_index()
|
611
libs/langchain/poetry.lock
generated
611
libs/langchain/poetry.lock
generated
File diff suppressed because it is too large
Load Diff
@ -129,6 +129,7 @@ markdownify = {version = "^0.11.6", optional = true}
|
||||
assemblyai = {version = "^0.17.0", optional = true}
|
||||
dashvector = {version = "^1.0.1", optional = true}
|
||||
sqlite-vss = {version = "^0.1.2", optional = true}
|
||||
timescale-vector = {version = "^0.0.1", optional = true}
|
||||
|
||||
|
||||
[tool.poetry.group.test.dependencies]
|
||||
@ -345,6 +346,7 @@ extended_testing = [
|
||||
"markdownify",
|
||||
"dashvector",
|
||||
"sqlite-vss",
|
||||
"timescale-vector",
|
||||
]
|
||||
|
||||
[tool.ruff]
|
||||
|
@ -0,0 +1,433 @@
|
||||
"""Test TimescaleVector functionality."""
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
from typing import List
|
||||
|
||||
import pytest
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
|
||||
|
||||
SERVICE_URL = TimescaleVector.service_url_from_db_params(
|
||||
host=os.environ.get("TEST_TIMESCALE_HOST", "localhost"),
|
||||
port=int(os.environ.get("TEST_TIMESCALE_PORT", "5432")),
|
||||
database=os.environ.get("TEST_TIMESCALE_DATABASE", "postgres"),
|
||||
user=os.environ.get("TEST_TIMESCALE_USER", "postgres"),
|
||||
password=os.environ.get("TEST_TIMESCALE_PASSWORD", "postgres"),
|
||||
)
|
||||
|
||||
|
||||
ADA_TOKEN_COUNT = 1536
|
||||
|
||||
|
||||
class FakeEmbeddingsWithAdaDimension(FakeEmbeddings):
|
||||
"""Fake embeddings functionality for testing."""
|
||||
|
||||
def embed_documents(self, texts: List[str]) -> List[List[float]]:
|
||||
"""Return simple embeddings."""
|
||||
return [
|
||||
[float(1.0)] * (ADA_TOKEN_COUNT - 1) + [float(i)] for i in range(len(texts))
|
||||
]
|
||||
|
||||
def embed_query(self, text: str) -> List[float]:
|
||||
"""Return simple embeddings."""
|
||||
return [float(1.0)] * (ADA_TOKEN_COUNT - 1) + [float(0.0)]
|
||||
|
||||
|
||||
def test_timescalevector() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo")]
|
||||
|
||||
|
||||
def test_timescalevector_from_documents() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
|
||||
docsearch = TimescaleVector.from_documents(
|
||||
documents=docs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo", metadata={"a": "b"})]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_timescalevector_afrom_documents() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
|
||||
docsearch = await TimescaleVector.afrom_documents(
|
||||
documents=docs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = await docsearch.asimilarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo", metadata={"a": "b"})]
|
||||
|
||||
|
||||
def test_timescalevector_embeddings() -> None:
|
||||
"""Test end to end construction with embeddings and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
text_embeddings = FakeEmbeddingsWithAdaDimension().embed_documents(texts)
|
||||
text_embedding_pairs = list(zip(texts, text_embeddings))
|
||||
docsearch = TimescaleVector.from_embeddings(
|
||||
text_embeddings=text_embedding_pairs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo")]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_timescalevector_aembeddings() -> None:
|
||||
"""Test end to end construction with embeddings and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
text_embeddings = FakeEmbeddingsWithAdaDimension().embed_documents(texts)
|
||||
text_embedding_pairs = list(zip(texts, text_embeddings))
|
||||
docsearch = await TimescaleVector.afrom_embeddings(
|
||||
text_embeddings=text_embedding_pairs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = await docsearch.asimilarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo")]
|
||||
|
||||
|
||||
def test_timescalevector_with_metadatas() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search("foo", k=1)
|
||||
assert output == [Document(page_content="foo", metadata={"page": "0"})]
|
||||
|
||||
|
||||
def test_timescalevector_with_metadatas_with_scores() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search_with_score("foo", k=1)
|
||||
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_timescalevector_awith_metadatas_with_scores() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = await TimescaleVector.afrom_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = await docsearch.asimilarity_search_with_score("foo", k=1)
|
||||
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
|
||||
|
||||
|
||||
def test_timescalevector_with_filter_match() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection_filter",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "0"})
|
||||
assert output == [(Document(page_content="foo", metadata={"page": "0"}), 0.0)]
|
||||
|
||||
|
||||
def test_timescalevector_with_filter_distant_match() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection_filter",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "2"})
|
||||
assert output == [
|
||||
(Document(page_content="baz", metadata={"page": "2"}), 0.0013003906671379406)
|
||||
]
|
||||
|
||||
|
||||
def test_timescalevector_with_filter_no_match() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection_filter",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search_with_score("foo", k=1, filter={"page": "5"})
|
||||
assert output == []
|
||||
|
||||
|
||||
def test_timescalevector_with_filter_in_set() -> None:
|
||||
"""Test end to end construction and search."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection_filter",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
output = docsearch.similarity_search_with_score(
|
||||
"foo", k=2, filter=[{"page": "0"}, {"page": "2"}]
|
||||
)
|
||||
assert output == [
|
||||
(Document(page_content="foo", metadata={"page": "0"}), 0.0),
|
||||
(Document(page_content="baz", metadata={"page": "2"}), 0.0013003906671379406),
|
||||
]
|
||||
|
||||
|
||||
def test_timescalevector_relevance_score() -> None:
|
||||
"""Test to make sure the relevance score is scaled to 0-1."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
|
||||
output = docsearch.similarity_search_with_relevance_scores("foo", k=3)
|
||||
assert output == [
|
||||
(Document(page_content="foo", metadata={"page": "0"}), 1.0),
|
||||
(Document(page_content="bar", metadata={"page": "1"}), 0.9996744261675065),
|
||||
(Document(page_content="baz", metadata={"page": "2"}), 0.9986996093328621),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_timescalevector_relevance_score_async() -> None:
|
||||
"""Test to make sure the relevance score is scaled to 0-1."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = await TimescaleVector.afrom_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
|
||||
output = await docsearch.asimilarity_search_with_relevance_scores("foo", k=3)
|
||||
assert output == [
|
||||
(Document(page_content="foo", metadata={"page": "0"}), 1.0),
|
||||
(Document(page_content="bar", metadata={"page": "1"}), 0.9996744261675065),
|
||||
(Document(page_content="baz", metadata={"page": "2"}), 0.9986996093328621),
|
||||
]
|
||||
|
||||
|
||||
def test_timescalevector_retriever_search_threshold() -> None:
|
||||
"""Test using retriever for searching with threshold."""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
|
||||
retriever = docsearch.as_retriever(
|
||||
search_type="similarity_score_threshold",
|
||||
search_kwargs={"k": 3, "score_threshold": 0.999},
|
||||
)
|
||||
output = retriever.get_relevant_documents("summer")
|
||||
assert output == [
|
||||
Document(page_content="foo", metadata={"page": "0"}),
|
||||
Document(page_content="bar", metadata={"page": "1"}),
|
||||
]
|
||||
|
||||
|
||||
def test_timescalevector_retriever_search_threshold_custom_normalization_fn() -> None:
|
||||
"""Test searching with threshold and custom normalization function"""
|
||||
texts = ["foo", "bar", "baz"]
|
||||
metadatas = [{"page": str(i)} for i in range(len(texts))]
|
||||
docsearch = TimescaleVector.from_texts(
|
||||
texts=texts,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
metadatas=metadatas,
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
relevance_score_fn=lambda d: d * 0,
|
||||
)
|
||||
|
||||
retriever = docsearch.as_retriever(
|
||||
search_type="similarity_score_threshold",
|
||||
search_kwargs={"k": 3, "score_threshold": 0.5},
|
||||
)
|
||||
output = retriever.get_relevant_documents("foo")
|
||||
assert output == []
|
||||
|
||||
|
||||
def test_timescalevector_delete() -> None:
|
||||
"""Test deleting functionality."""
|
||||
texts = ["bar", "baz"]
|
||||
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
|
||||
docsearch = TimescaleVector.from_documents(
|
||||
documents=docs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
texts = ["foo"]
|
||||
meta = [{"b": "c"}]
|
||||
ids = docsearch.add_texts(texts, meta)
|
||||
|
||||
output = docsearch.similarity_search("bar", k=10)
|
||||
assert len(output) == 3
|
||||
docsearch.delete(ids)
|
||||
|
||||
output = docsearch.similarity_search("bar", k=10)
|
||||
assert len(output) == 2
|
||||
|
||||
docsearch.delete_by_metadata({"a": "b"})
|
||||
output = docsearch.similarity_search("bar", k=10)
|
||||
assert len(output) == 0
|
||||
|
||||
|
||||
def test_timescalevector_with_index() -> None:
|
||||
"""Test deleting functionality."""
|
||||
texts = ["bar", "baz"]
|
||||
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
|
||||
docsearch = TimescaleVector.from_documents(
|
||||
documents=docs,
|
||||
collection_name="test_collection",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
)
|
||||
texts = ["foo"]
|
||||
meta = [{"b": "c"}]
|
||||
docsearch.add_texts(texts, meta)
|
||||
|
||||
docsearch.create_index()
|
||||
|
||||
output = docsearch.similarity_search("bar", k=10)
|
||||
assert len(output) == 3
|
||||
|
||||
docsearch.drop_index()
|
||||
docsearch.create_index(
|
||||
index_type=TimescaleVector.IndexType.TIMESCALE_VECTOR,
|
||||
max_alpha=1.0,
|
||||
num_neighbors=50,
|
||||
)
|
||||
|
||||
docsearch.drop_index()
|
||||
docsearch.create_index("tsv", max_alpha=1.0, num_neighbors=50)
|
||||
|
||||
docsearch.drop_index()
|
||||
docsearch.create_index("ivfflat", num_lists=20, num_records=1000)
|
||||
|
||||
docsearch.drop_index()
|
||||
docsearch.create_index("hnsw", m=16, ef_construction=64)
|
||||
|
||||
|
||||
def test_timescalevector_time_partitioning() -> None:
|
||||
"""Test deleting functionality."""
|
||||
from timescale_vector import client
|
||||
|
||||
texts = ["bar", "baz"]
|
||||
docs = [Document(page_content=t, metadata={"a": "b"}) for t in texts]
|
||||
docsearch = TimescaleVector.from_documents(
|
||||
documents=docs,
|
||||
collection_name="test_collection_time_partitioning",
|
||||
embedding=FakeEmbeddingsWithAdaDimension(),
|
||||
service_url=SERVICE_URL,
|
||||
pre_delete_collection=True,
|
||||
time_partition_interval=timedelta(hours=1),
|
||||
)
|
||||
texts = ["foo"]
|
||||
meta = [{"b": "c"}]
|
||||
|
||||
ids = [client.uuid_from_time(datetime.now() - timedelta(hours=3))]
|
||||
docsearch.add_texts(texts, meta, ids)
|
||||
|
||||
output = docsearch.similarity_search("bar", k=10)
|
||||
assert len(output) == 3
|
||||
|
||||
output = docsearch.similarity_search(
|
||||
"bar", k=10, start_date=datetime.now() - timedelta(hours=1)
|
||||
)
|
||||
assert len(output) == 2
|
||||
|
||||
output = docsearch.similarity_search(
|
||||
"bar", k=10, end_date=datetime.now() - timedelta(hours=1)
|
||||
)
|
||||
assert len(output) == 1
|
||||
|
||||
output = docsearch.similarity_search(
|
||||
"bar", k=10, start_date=datetime.now() - timedelta(minutes=200)
|
||||
)
|
||||
assert len(output) == 3
|
||||
|
||||
output = docsearch.similarity_search(
|
||||
"bar",
|
||||
k=10,
|
||||
start_date=datetime.now() - timedelta(minutes=200),
|
||||
time_delta=timedelta(hours=1),
|
||||
)
|
||||
assert len(output) == 1
|
@ -0,0 +1,97 @@
|
||||
from typing import Dict, Tuple
|
||||
|
||||
import pytest as pytest
|
||||
|
||||
from langchain.chains.query_constructor.ir import (
|
||||
Comparator,
|
||||
Comparison,
|
||||
Operation,
|
||||
Operator,
|
||||
StructuredQuery,
|
||||
)
|
||||
from langchain.retrievers.self_query.timescalevector import TimescaleVectorTranslator
|
||||
|
||||
DEFAULT_TRANSLATOR = TimescaleVectorTranslator()
|
||||
|
||||
|
||||
@pytest.mark.requires("timescale_vector")
|
||||
def test_visit_comparison() -> None:
|
||||
from timescale_vector import client
|
||||
|
||||
comp = Comparison(comparator=Comparator.LT, attribute="foo", value=1)
|
||||
expected = client.Predicates(("foo", "<", 1))
|
||||
actual = DEFAULT_TRANSLATOR.visit_comparison(comp)
|
||||
assert expected == actual
|
||||
|
||||
|
||||
@pytest.mark.requires("timescale_vector")
|
||||
def test_visit_operation() -> None:
|
||||
from timescale_vector import client
|
||||
|
||||
op = Operation(
|
||||
operator=Operator.AND,
|
||||
arguments=[
|
||||
Comparison(comparator=Comparator.LT, attribute="foo", value=2),
|
||||
Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
|
||||
Comparison(comparator=Comparator.GT, attribute="abc", value=2.0),
|
||||
],
|
||||
)
|
||||
expected = client.Predicates(
|
||||
client.Predicates(("foo", "<", 2)),
|
||||
client.Predicates(("bar", "==", "baz")),
|
||||
client.Predicates(("abc", ">", 2.0)),
|
||||
)
|
||||
|
||||
actual = DEFAULT_TRANSLATOR.visit_operation(op)
|
||||
assert expected == actual
|
||||
|
||||
|
||||
@pytest.mark.requires("timescale_vector")
|
||||
def test_visit_structured_query() -> None:
|
||||
from timescale_vector import client
|
||||
|
||||
query = "What is the capital of France?"
|
||||
structured_query = StructuredQuery(
|
||||
query=query,
|
||||
filter=None,
|
||||
)
|
||||
expected: Tuple[str, Dict] = (query, {})
|
||||
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
|
||||
assert expected == actual
|
||||
|
||||
comp = Comparison(comparator=Comparator.LT, attribute="foo", value=1)
|
||||
expected = (
|
||||
query,
|
||||
{"predicates": client.Predicates(("foo", "<", 1))},
|
||||
)
|
||||
structured_query = StructuredQuery(
|
||||
query=query,
|
||||
filter=comp,
|
||||
)
|
||||
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
|
||||
assert expected == actual
|
||||
|
||||
op = Operation(
|
||||
operator=Operator.AND,
|
||||
arguments=[
|
||||
Comparison(comparator=Comparator.LT, attribute="foo", value=2),
|
||||
Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
|
||||
Comparison(comparator=Comparator.GT, attribute="abc", value=2.0),
|
||||
],
|
||||
)
|
||||
structured_query = StructuredQuery(
|
||||
query=query,
|
||||
filter=op,
|
||||
)
|
||||
expected = (
|
||||
query,
|
||||
{
|
||||
"predicates": client.Predicates(
|
||||
client.Predicates(("foo", "<", 2)),
|
||||
client.Predicates(("bar", "==", "baz")),
|
||||
client.Predicates(("abc", ">", 2.0)),
|
||||
)
|
||||
},
|
||||
)
|
||||
actual = DEFAULT_TRANSLATOR.visit_structured_query(structured_query)
|
||||
assert expected == actual
|
Loading…
Reference in New Issue
Block a user