Add RAG template for Timescale Vector (#12651)

--------- Co-authored-by: Matvey Arye <mat@timescale.com>
2025-08-31 18:38:48 +00:00 · 2023-10-31 09:56:29 -07:00
parent 14e8c74736
commit da94c750c5
9 changed files with 1774 additions and 2 deletions
--- a/docs/docs/integrations/vectorstores/timescalevector.ipynb
+++ b/docs/docs/integrations/vectorstores/timescalevector.ipynb
@@ -10,7 +10,7 @@
    "This notebook shows how to use the Postgres vector database `Timescale Vector`. You'll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.\n",
    "\n",
    "## What is Timescale Vector?\n",
-    "**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**\n",
+    "**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**\n",
    "\n",
    "Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
    "- Enhances `pgvector` with faster and more accurate similarity search on 100M+ vectors via `DiskANN` inspired indexing algorithm.\n",
@@ -23,7 +23,7 @@
    "- Enables a worry-free experience with enterprise-grade security and compliance.\n",
    "\n",
    "## How to access Timescale Vector\n",
-    "Timescale Vector is available on [Timescale](https://www.timescale.com/ai), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
+    "Timescale Vector is available on [Timescale](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
    "\n",
    "LangChain users get a 90-day free trial for Timescale Vector.\n",
    "- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
--- a/templates/rag-timescale-hybrid-search-time/LICENSE
+++ b/templates/rag-timescale-hybrid-search-time/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 LangChain, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/templates/rag-timescale-hybrid-search-time/README.md
+++ b/templates/rag-timescale-hybrid-search-time/README.md
@@ -0,0 +1,63 @@
+# RAG with Timescale Vector using hybrid search
+
+This template shows how to use timescale-vector with the self-query retriver to perform hybrid search on similarity and time.
+This is useful any time your data has a strong time-based component. Some examples of such data are:
+- News articles (politics, business, etc)
+- Blog posts, documentation or other published material (public or private).
+- Social media posts
+- Changelogs of any kind
+- Messages
+
+Such items are often searched by both similarity and time. For example: Show me all news about Toyota trucks from 2022.
+
+[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral)  provides superior performance when searching for embeddings within a particular
+timeframe by leveraging automatic table partitioning to isolate data for particular time-ranges.
+
+Langchain's self-query retriever allows deducing time-ranges (as well as other search criteria) from the text of user queries.
+
+## What is Timescale Vector?
+**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**
+
+Timescale Vector enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.
+- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.
+- Enables fast time-based vector search via automatic time-based partitioning and indexing.
+- Provides a familiar SQL interface for querying vector embeddings and relational data.
+
+Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:
+- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
+- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
+- Enables a worry-free experience with enterprise-grade security and compliance.
+
+### How to access Timescale Vector
+Timescale Vector is available on [Timescale](https://www.timescale.com/products?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)
+
+- LangChain users get a 90-day free trial for Timescale Vector.
+- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!
+- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.
+
+### Using Timescale Vector with this template
+
+This template uses TimescaleVector as a vectorstore and requires that `TIMESCALES_SERVICE_URL` is set.
+
+##  LLM
+
+Be sure that `OPENAI_API_KEY` is set in order to the OpenAI models.
+
+## Loading sample data
+
+We have provided a sample dataset you can use for demoing this template. It consists of the git history of the timescale project.
+
+To load this dataset, set the `LOAD_SAMPLE_DATA` environmental variable.
+
+## Loading your own dataset.
+
+To load your own dataset you will have to modify the code in the `DATASET SPECIFIC CODE` section of `chain.py`.
+This code defines the name of the collection, how to load the data, and the human-language description of both the
+contents of the collection and all of the metadata. The human-language descriptions are used by the self-query retriever
+to help the LLM convert the question into filters on the metadata when searching the data in Timescale-vector.
+
+## Using in your own applications
+
+This is a standard LangServe template. Instructions on how to use it with your LangServe applications are [here](https://github.com/langchain-ai/langchain/blob/master/templates/README.md).
+
+
--- a/templates/rag-timescale-hybrid-search-time/poetry.lock
+++ b/templates/rag-timescale-hybrid-search-time/poetry.lock
--- a/templates/rag-timescale-hybrid-search-time/pyproject.toml
+++ b/templates/rag-timescale-hybrid-search-time/pyproject.toml
@@ -0,0 +1,32 @@
+[tool.poetry]
+name = "rag_timescale_hybrid_search_time"
+version = "0.0.1"
+description = ""
+authors = []
+readme = "README.md"
+
+[tool.poetry.dependencies]
+python = ">=3.8.1,<4.0"
+langchain = ">=0.0.313, <0.1"
+openai = "^0.28.1"
+fastapi = "^0.104.0"
+sse-starlette = "^1.6.5"
+
+[tool.poetry.group.dev.dependencies]
+langc = {git = "https://github.com/pingpong-templates/cli"}
+poethepoet = "^0.24.1"
+
+[tool.langserve]
+export_module = "rag_timescale_hybrid_search_time.chain"
+export_attr = "chain"
+
+[tool.poe.tasks.start]
+cmd="poetry run uvicorn langc.dev_scripts:create_demo_server --reload --port $port --host $host"
+args = [
+    {name = "port", help = "port to run on", default = "8000"},
+    {name = "host", help = "host to run on", default = "127.0.0.1"}
+]
+
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
--- a/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/init.py
+++ b/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/init.py
@@ -0,0 +1,3 @@
+from rag_timescale_hybrid_search_time import chain
+
+__all__ = ["chain"]
--- a/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/chain.py
+++ b/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/chain.py
@@ -0,0 +1,112 @@
+# ruff: noqa: E501
+
+import os
+from datetime import timedelta
+
+from langchain.chains.query_constructor.base import AttributeInfo
+from langchain.chat_models import ChatOpenAI
+from langchain.embeddings.openai import OpenAIEmbeddings
+from langchain.llms import OpenAI
+from langchain.prompts import ChatPromptTemplate
+from langchain.retrievers.self_query.base import SelfQueryRetriever
+from langchain.schema.output_parser import StrOutputParser
+from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
+from langchain.vectorstores.timescalevector import TimescaleVector
+from pydantic import BaseModel
+
+from .load_sample_dataset import load_ts_git_dataset
+
+# to enable debug uncomment the following lines:
+# from langchain.globals import set_debug
+# set_debug(True)
+
+# from dotenv import find_dotenv, load_dotenv
+# _ = load_dotenv(find_dotenv())
+
+if os.environ.get("TIMESCALE_SERVICE_URL", None) is None:
+    raise Exception("Missing `TIMESCALE_SERVICE_URL` environment variable.")
+
+SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
+LOAD_SAMPLE_DATA = os.environ.get("LOAD_SAMPLE_DATA", False)
+
+
+# DATASET SPECIFIC CODE
+# Load the sample dataset. You will have to change this to load your own dataset.
+collection_name = "timescale_commits"
+partition_interval = timedelta(days=7)
+if LOAD_SAMPLE_DATA:
+    load_ts_git_dataset(
+        SERVICE_URL,
+        collection_name=collection_name,
+        num_records=500,
+        partition_interval=partition_interval,
+    )
+
+# This will change depending on the metadata stored in your dataset.
+document_content_description = "The git log commit summary containing the commit hash, author, date of commit, change summary and change details"
+metadata_field_info = [
+    AttributeInfo(
+        name="id",
+        description="A UUID v1 generated from the date of the commit",
+        type="uuid",
+    ),
+    AttributeInfo(
+        # This is a special attribute represent the timestamp of the uuid.
+        name="__uuid_timestamp",
+        description="The timestamp of the commit. Specify in YYYY-MM-DDTHH::MM:SSZ format",
+        type="datetime.datetime",
+    ),
+    AttributeInfo(
+        name="author_name",
+        description="The name of the author of the commit",
+        type="string",
+    ),
+    AttributeInfo(
+        name="author_email",
+        description="The email address of the author of the commit",
+        type="string",
+    ),
+]
+# END DATASET SPECIFIC CODE
+
+embeddings = OpenAIEmbeddings()
+vectorstore = TimescaleVector(
+    embedding=embeddings,
+    collection_name=collection_name,
+    service_url=SERVICE_URL,
+    time_partition_interval=partition_interval,
+)
+
+llm = OpenAI(temperature=0)
+retriever = SelfQueryRetriever.from_llm(
+    llm,
+    vectorstore,
+    document_content_description,
+    metadata_field_info,
+    enable_limit=True,
+    verbose=True,
+)
+
+template = """Answer the question based only on the following context:
+{context}
+
+Question: {question}
+"""
+prompt = ChatPromptTemplate.from_template(template)
+
+model = ChatOpenAI(temperature=0, model="gpt-4")
+
+# RAG chain
+chain = (
+    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
+    | prompt
+    | model
+    | StrOutputParser()
+)
+
+
+class Question(BaseModel):
+    __root__: str
+
+
+chain = chain.with_types(input_type=Question)
--- a/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/load_sample_dataset.py
+++ b/templates/rag-timescale-hybrid-search-time/rag_timescale_hybrid_search_time/load_sample_dataset.py
@@ -0,0 +1,84 @@
+import os
+import tempfile
+from datetime import datetime, timedelta
+
+import requests
+from langchain.document_loaders import JSONLoader
+from langchain.embeddings.openai import OpenAIEmbeddings
+from langchain.text_splitter import CharacterTextSplitter
+from langchain.vectorstores.timescalevector import TimescaleVector
+from timescale_vector import client
+
+
+def parse_date(date_string: str) -> datetime:
+    if date_string is None:
+        return None
+    time_format = "%a %b %d %H:%M:%S %Y %z"
+    return datetime.strptime(date_string, time_format)
+
+
+def extract_metadata(record: dict, metadata: dict) -> dict:
+    dt = parse_date(record["date"])
+    metadata["id"] = str(client.uuid_from_time(dt))
+    if dt is not None:
+        metadata["date"] = dt.isoformat()
+    else:
+        metadata["date"] = None
+    metadata["author"] = record["author"]
+    metadata["commit_hash"] = record["commit"]
+    return metadata
+
+
+def load_ts_git_dataset(
+    service_url,
+    collection_name="timescale_commits",
+    num_records: int = 500,
+    partition_interval=timedelta(days=7),
+):
+    json_url = "https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
+    tmp_file = "ts_git_log.json"
+
+    temp_dir = tempfile.gettempdir()
+    json_file_path = os.path.join(temp_dir, tmp_file)
+
+    if not os.path.exists(json_file_path):
+        response = requests.get(json_url)
+        if response.status_code == 200:
+            with open(json_file_path, "w") as json_file:
+                json_file.write(response.text)
+        else:
+            print(f"Failed to download JSON file. Status code: {response.status_code}")
+
+    loader = JSONLoader(
+        file_path=json_file_path,
+        jq_schema=".commit_history[]",
+        text_content=False,
+        metadata_func=extract_metadata,
+    )
+
+    documents = loader.load()
+
+    # Remove documents with None dates
+    documents = [doc for doc in documents if doc.metadata["date"] is not None]
+
+    if num_records > 0:
+        documents = documents[:num_records]
+
+    # Split the documents into chunks for embedding
+    text_splitter = CharacterTextSplitter(
+        chunk_size=1000,
+        chunk_overlap=200,
+    )
+    docs = text_splitter.split_documents(documents)
+
+    embeddings = OpenAIEmbeddings()
+
+    # Create a Timescale Vector instance from the collection of documents
+    TimescaleVector.from_documents(
+        embedding=embeddings,
+        ids=[doc.metadata["id"] for doc in docs],
+        documents=docs,
+        collection_name=collection_name,
+        service_url=service_url,
+        time_partition_interval=partition_interval,
+    )
--- a/templates/rag-timescale-hybrid-search-time/tests/init.py
+++ b/templates/rag-timescale-hybrid-search-time/tests/init.py