mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-03 12:07:36 +00:00
Add RAG template for Timescale Vector (#12651)
<!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Matvey Arye <mat@timescale.com>
This commit is contained in:
@@ -10,7 +10,7 @@
|
|||||||
"This notebook shows how to use the Postgres vector database `Timescale Vector`. You'll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.\n",
|
"This notebook shows how to use the Postgres vector database `Timescale Vector`. You'll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"## What is Timescale Vector?\n",
|
"## What is Timescale Vector?\n",
|
||||||
"**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**\n",
|
"**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
|
"Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
|
||||||
"- Enhances `pgvector` with faster and more accurate similarity search on 100M+ vectors via `DiskANN` inspired indexing algorithm.\n",
|
"- Enhances `pgvector` with faster and more accurate similarity search on 100M+ vectors via `DiskANN` inspired indexing algorithm.\n",
|
||||||
@@ -23,7 +23,7 @@
|
|||||||
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
|
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"## How to access Timescale Vector\n",
|
"## How to access Timescale Vector\n",
|
||||||
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
|
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"LangChain users get a 90-day free trial for Timescale Vector.\n",
|
"LangChain users get a 90-day free trial for Timescale Vector.\n",
|
||||||
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
|
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
|
||||||
|
21
templates/rag-timescale-hybrid-search-time/LICENSE
Normal file
21
templates/rag-timescale-hybrid-search-time/LICENSE
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2023 LangChain, Inc.
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
63
templates/rag-timescale-hybrid-search-time/README.md
Normal file
63
templates/rag-timescale-hybrid-search-time/README.md
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
# RAG with Timescale Vector using hybrid search
|
||||||
|
|
||||||
|
This template shows how to use timescale-vector with the self-query retriver to perform hybrid search on similarity and time.
|
||||||
|
This is useful any time your data has a strong time-based component. Some examples of such data are:
|
||||||
|
- News articles (politics, business, etc)
|
||||||
|
- Blog posts, documentation or other published material (public or private).
|
||||||
|
- Social media posts
|
||||||
|
- Changelogs of any kind
|
||||||
|
- Messages
|
||||||
|
|
||||||
|
Such items are often searched by both similarity and time. For example: Show me all news about Toyota trucks from 2022.
|
||||||
|
|
||||||
|
[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) provides superior performance when searching for embeddings within a particular
|
||||||
|
timeframe by leveraging automatic table partitioning to isolate data for particular time-ranges.
|
||||||
|
|
||||||
|
Langchain's self-query retriever allows deducing time-ranges (as well as other search criteria) from the text of user queries.
|
||||||
|
|
||||||
|
## What is Timescale Vector?
|
||||||
|
**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**
|
||||||
|
|
||||||
|
Timescale Vector enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.
|
||||||
|
- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.
|
||||||
|
- Enables fast time-based vector search via automatic time-based partitioning and indexing.
|
||||||
|
- Provides a familiar SQL interface for querying vector embeddings and relational data.
|
||||||
|
|
||||||
|
Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:
|
||||||
|
- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
|
||||||
|
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
|
||||||
|
- Enables a worry-free experience with enterprise-grade security and compliance.
|
||||||
|
|
||||||
|
### How to access Timescale Vector
|
||||||
|
Timescale Vector is available on [Timescale](https://www.timescale.com/products?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)
|
||||||
|
|
||||||
|
- LangChain users get a 90-day free trial for Timescale Vector.
|
||||||
|
- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!
|
||||||
|
- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.
|
||||||
|
|
||||||
|
### Using Timescale Vector with this template
|
||||||
|
|
||||||
|
This template uses TimescaleVector as a vectorstore and requires that `TIMESCALES_SERVICE_URL` is set.
|
||||||
|
|
||||||
|
## LLM
|
||||||
|
|
||||||
|
Be sure that `OPENAI_API_KEY` is set in order to the OpenAI models.
|
||||||
|
|
||||||
|
## Loading sample data
|
||||||
|
|
||||||
|
We have provided a sample dataset you can use for demoing this template. It consists of the git history of the timescale project.
|
||||||
|
|
||||||
|
To load this dataset, set the `LOAD_SAMPLE_DATA` environmental variable.
|
||||||
|
|
||||||
|
## Loading your own dataset.
|
||||||
|
|
||||||
|
To load your own dataset you will have to modify the code in the `DATASET SPECIFIC CODE` section of `chain.py`.
|
||||||
|
This code defines the name of the collection, how to load the data, and the human-language description of both the
|
||||||
|
contents of the collection and all of the metadata. The human-language descriptions are used by the self-query retriever
|
||||||
|
to help the LLM convert the question into filters on the metadata when searching the data in Timescale-vector.
|
||||||
|
|
||||||
|
## Using in your own applications
|
||||||
|
|
||||||
|
This is a standard LangServe template. Instructions on how to use it with your LangServe applications are [here](https://github.com/langchain-ai/langchain/blob/master/templates/README.md).
|
||||||
|
|
||||||
|
|
1457
templates/rag-timescale-hybrid-search-time/poetry.lock
generated
Normal file
1457
templates/rag-timescale-hybrid-search-time/poetry.lock
generated
Normal file
File diff suppressed because it is too large
Load Diff
32
templates/rag-timescale-hybrid-search-time/pyproject.toml
Normal file
32
templates/rag-timescale-hybrid-search-time/pyproject.toml
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
[tool.poetry]
|
||||||
|
name = "rag_timescale_hybrid_search_time"
|
||||||
|
version = "0.0.1"
|
||||||
|
description = ""
|
||||||
|
authors = []
|
||||||
|
readme = "README.md"
|
||||||
|
|
||||||
|
[tool.poetry.dependencies]
|
||||||
|
python = ">=3.8.1,<4.0"
|
||||||
|
langchain = ">=0.0.313, <0.1"
|
||||||
|
openai = "^0.28.1"
|
||||||
|
fastapi = "^0.104.0"
|
||||||
|
sse-starlette = "^1.6.5"
|
||||||
|
|
||||||
|
[tool.poetry.group.dev.dependencies]
|
||||||
|
langc = {git = "https://github.com/pingpong-templates/cli"}
|
||||||
|
poethepoet = "^0.24.1"
|
||||||
|
|
||||||
|
[tool.langserve]
|
||||||
|
export_module = "rag_timescale_hybrid_search_time.chain"
|
||||||
|
export_attr = "chain"
|
||||||
|
|
||||||
|
[tool.poe.tasks.start]
|
||||||
|
cmd="poetry run uvicorn langc.dev_scripts:create_demo_server --reload --port $port --host $host"
|
||||||
|
args = [
|
||||||
|
{name = "port", help = "port to run on", default = "8000"},
|
||||||
|
{name = "host", help = "host to run on", default = "127.0.0.1"}
|
||||||
|
]
|
||||||
|
|
||||||
|
[build-system]
|
||||||
|
requires = ["poetry-core"]
|
||||||
|
build-backend = "poetry.core.masonry.api"
|
@@ -0,0 +1,3 @@
|
|||||||
|
from rag_timescale_hybrid_search_time import chain
|
||||||
|
|
||||||
|
__all__ = ["chain"]
|
@@ -0,0 +1,112 @@
|
|||||||
|
# ruff: noqa: E501
|
||||||
|
|
||||||
|
import os
|
||||||
|
from datetime import timedelta
|
||||||
|
|
||||||
|
from langchain.chains.query_constructor.base import AttributeInfo
|
||||||
|
from langchain.chat_models import ChatOpenAI
|
||||||
|
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||||
|
from langchain.llms import OpenAI
|
||||||
|
from langchain.prompts import ChatPromptTemplate
|
||||||
|
from langchain.retrievers.self_query.base import SelfQueryRetriever
|
||||||
|
from langchain.schema.output_parser import StrOutputParser
|
||||||
|
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
|
||||||
|
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
from .load_sample_dataset import load_ts_git_dataset
|
||||||
|
|
||||||
|
# to enable debug uncomment the following lines:
|
||||||
|
# from langchain.globals import set_debug
|
||||||
|
# set_debug(True)
|
||||||
|
|
||||||
|
# from dotenv import find_dotenv, load_dotenv
|
||||||
|
# _ = load_dotenv(find_dotenv())
|
||||||
|
|
||||||
|
if os.environ.get("TIMESCALE_SERVICE_URL", None) is None:
|
||||||
|
raise Exception("Missing `TIMESCALE_SERVICE_URL` environment variable.")
|
||||||
|
|
||||||
|
SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
|
||||||
|
LOAD_SAMPLE_DATA = os.environ.get("LOAD_SAMPLE_DATA", False)
|
||||||
|
|
||||||
|
|
||||||
|
# DATASET SPECIFIC CODE
|
||||||
|
# Load the sample dataset. You will have to change this to load your own dataset.
|
||||||
|
collection_name = "timescale_commits"
|
||||||
|
partition_interval = timedelta(days=7)
|
||||||
|
if LOAD_SAMPLE_DATA:
|
||||||
|
load_ts_git_dataset(
|
||||||
|
SERVICE_URL,
|
||||||
|
collection_name=collection_name,
|
||||||
|
num_records=500,
|
||||||
|
partition_interval=partition_interval,
|
||||||
|
)
|
||||||
|
|
||||||
|
# This will change depending on the metadata stored in your dataset.
|
||||||
|
document_content_description = "The git log commit summary containing the commit hash, author, date of commit, change summary and change details"
|
||||||
|
metadata_field_info = [
|
||||||
|
AttributeInfo(
|
||||||
|
name="id",
|
||||||
|
description="A UUID v1 generated from the date of the commit",
|
||||||
|
type="uuid",
|
||||||
|
),
|
||||||
|
AttributeInfo(
|
||||||
|
# This is a special attribute represent the timestamp of the uuid.
|
||||||
|
name="__uuid_timestamp",
|
||||||
|
description="The timestamp of the commit. Specify in YYYY-MM-DDTHH::MM:SSZ format",
|
||||||
|
type="datetime.datetime",
|
||||||
|
),
|
||||||
|
AttributeInfo(
|
||||||
|
name="author_name",
|
||||||
|
description="The name of the author of the commit",
|
||||||
|
type="string",
|
||||||
|
),
|
||||||
|
AttributeInfo(
|
||||||
|
name="author_email",
|
||||||
|
description="The email address of the author of the commit",
|
||||||
|
type="string",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
# END DATASET SPECIFIC CODE
|
||||||
|
|
||||||
|
embeddings = OpenAIEmbeddings()
|
||||||
|
vectorstore = TimescaleVector(
|
||||||
|
embedding=embeddings,
|
||||||
|
collection_name=collection_name,
|
||||||
|
service_url=SERVICE_URL,
|
||||||
|
time_partition_interval=partition_interval,
|
||||||
|
)
|
||||||
|
|
||||||
|
llm = OpenAI(temperature=0)
|
||||||
|
retriever = SelfQueryRetriever.from_llm(
|
||||||
|
llm,
|
||||||
|
vectorstore,
|
||||||
|
document_content_description,
|
||||||
|
metadata_field_info,
|
||||||
|
enable_limit=True,
|
||||||
|
verbose=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
template = """Answer the question based only on the following context:
|
||||||
|
{context}
|
||||||
|
|
||||||
|
Question: {question}
|
||||||
|
"""
|
||||||
|
prompt = ChatPromptTemplate.from_template(template)
|
||||||
|
|
||||||
|
model = ChatOpenAI(temperature=0, model="gpt-4")
|
||||||
|
|
||||||
|
# RAG chain
|
||||||
|
chain = (
|
||||||
|
RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
|
||||||
|
| prompt
|
||||||
|
| model
|
||||||
|
| StrOutputParser()
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Question(BaseModel):
|
||||||
|
__root__: str
|
||||||
|
|
||||||
|
|
||||||
|
chain = chain.with_types(input_type=Question)
|
@@ -0,0 +1,84 @@
|
|||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from langchain.document_loaders import JSONLoader
|
||||||
|
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||||
|
from langchain.text_splitter import CharacterTextSplitter
|
||||||
|
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||||
|
from timescale_vector import client
|
||||||
|
|
||||||
|
|
||||||
|
def parse_date(date_string: str) -> datetime:
|
||||||
|
if date_string is None:
|
||||||
|
return None
|
||||||
|
time_format = "%a %b %d %H:%M:%S %Y %z"
|
||||||
|
return datetime.strptime(date_string, time_format)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_metadata(record: dict, metadata: dict) -> dict:
|
||||||
|
dt = parse_date(record["date"])
|
||||||
|
metadata["id"] = str(client.uuid_from_time(dt))
|
||||||
|
if dt is not None:
|
||||||
|
metadata["date"] = dt.isoformat()
|
||||||
|
else:
|
||||||
|
metadata["date"] = None
|
||||||
|
metadata["author"] = record["author"]
|
||||||
|
metadata["commit_hash"] = record["commit"]
|
||||||
|
return metadata
|
||||||
|
|
||||||
|
|
||||||
|
def load_ts_git_dataset(
|
||||||
|
service_url,
|
||||||
|
collection_name="timescale_commits",
|
||||||
|
num_records: int = 500,
|
||||||
|
partition_interval=timedelta(days=7),
|
||||||
|
):
|
||||||
|
json_url = "https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
|
||||||
|
tmp_file = "ts_git_log.json"
|
||||||
|
|
||||||
|
temp_dir = tempfile.gettempdir()
|
||||||
|
json_file_path = os.path.join(temp_dir, tmp_file)
|
||||||
|
|
||||||
|
if not os.path.exists(json_file_path):
|
||||||
|
response = requests.get(json_url)
|
||||||
|
if response.status_code == 200:
|
||||||
|
with open(json_file_path, "w") as json_file:
|
||||||
|
json_file.write(response.text)
|
||||||
|
else:
|
||||||
|
print(f"Failed to download JSON file. Status code: {response.status_code}")
|
||||||
|
|
||||||
|
loader = JSONLoader(
|
||||||
|
file_path=json_file_path,
|
||||||
|
jq_schema=".commit_history[]",
|
||||||
|
text_content=False,
|
||||||
|
metadata_func=extract_metadata,
|
||||||
|
)
|
||||||
|
|
||||||
|
documents = loader.load()
|
||||||
|
|
||||||
|
# Remove documents with None dates
|
||||||
|
documents = [doc for doc in documents if doc.metadata["date"] is not None]
|
||||||
|
|
||||||
|
if num_records > 0:
|
||||||
|
documents = documents[:num_records]
|
||||||
|
|
||||||
|
# Split the documents into chunks for embedding
|
||||||
|
text_splitter = CharacterTextSplitter(
|
||||||
|
chunk_size=1000,
|
||||||
|
chunk_overlap=200,
|
||||||
|
)
|
||||||
|
docs = text_splitter.split_documents(documents)
|
||||||
|
|
||||||
|
embeddings = OpenAIEmbeddings()
|
||||||
|
|
||||||
|
# Create a Timescale Vector instance from the collection of documents
|
||||||
|
TimescaleVector.from_documents(
|
||||||
|
embedding=embeddings,
|
||||||
|
ids=[doc.metadata["id"] for doc in docs],
|
||||||
|
documents=docs,
|
||||||
|
collection_name=collection_name,
|
||||||
|
service_url=service_url,
|
||||||
|
time_partition_interval=partition_interval,
|
||||||
|
)
|
Reference in New Issue
Block a user