mirror of
https://github.com/hwchase17/langchain.git
synced 2025-08-30 17:29:56 +00:00
Add RAG template for Timescale Vector (#12651)
<!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Matvey Arye <mat@timescale.com>
This commit is contained in:
parent
14e8c74736
commit
da94c750c5
@ -10,7 +10,7 @@
|
||||
"This notebook shows how to use the Postgres vector database `Timescale Vector`. You'll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.\n",
|
||||
"\n",
|
||||
"## What is Timescale Vector?\n",
|
||||
"**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**\n",
|
||||
"**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**\n",
|
||||
"\n",
|
||||
"Timescale Vector enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
|
||||
"- Enhances `pgvector` with faster and more accurate similarity search on 100M+ vectors via `DiskANN` inspired indexing algorithm.\n",
|
||||
@ -23,7 +23,7 @@
|
||||
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
|
||||
"\n",
|
||||
"## How to access Timescale Vector\n",
|
||||
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
|
||||
"Timescale Vector is available on [Timescale](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
|
||||
"\n",
|
||||
"LangChain users get a 90-day free trial for Timescale Vector.\n",
|
||||
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
|
||||
|
21
templates/rag-timescale-hybrid-search-time/LICENSE
Normal file
21
templates/rag-timescale-hybrid-search-time/LICENSE
Normal file
@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2023 LangChain, Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
63
templates/rag-timescale-hybrid-search-time/README.md
Normal file
63
templates/rag-timescale-hybrid-search-time/README.md
Normal file
@ -0,0 +1,63 @@
|
||||
# RAG with Timescale Vector using hybrid search
|
||||
|
||||
This template shows how to use timescale-vector with the self-query retriver to perform hybrid search on similarity and time.
|
||||
This is useful any time your data has a strong time-based component. Some examples of such data are:
|
||||
- News articles (politics, business, etc)
|
||||
- Blog posts, documentation or other published material (public or private).
|
||||
- Social media posts
|
||||
- Changelogs of any kind
|
||||
- Messages
|
||||
|
||||
Such items are often searched by both similarity and time. For example: Show me all news about Toyota trucks from 2022.
|
||||
|
||||
[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) provides superior performance when searching for embeddings within a particular
|
||||
timeframe by leveraging automatic table partitioning to isolate data for particular time-ranges.
|
||||
|
||||
Langchain's self-query retriever allows deducing time-ranges (as well as other search criteria) from the text of user queries.
|
||||
|
||||
## What is Timescale Vector?
|
||||
**[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is PostgreSQL++ for AI applications.**
|
||||
|
||||
Timescale Vector enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.
|
||||
- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.
|
||||
- Enables fast time-based vector search via automatic time-based partitioning and indexing.
|
||||
- Provides a familiar SQL interface for querying vector embeddings and relational data.
|
||||
|
||||
Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:
|
||||
- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
|
||||
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
|
||||
- Enables a worry-free experience with enterprise-grade security and compliance.
|
||||
|
||||
### How to access Timescale Vector
|
||||
Timescale Vector is available on [Timescale](https://www.timescale.com/products?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)
|
||||
|
||||
- LangChain users get a 90-day free trial for Timescale Vector.
|
||||
- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!
|
||||
- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.
|
||||
|
||||
### Using Timescale Vector with this template
|
||||
|
||||
This template uses TimescaleVector as a vectorstore and requires that `TIMESCALES_SERVICE_URL` is set.
|
||||
|
||||
## LLM
|
||||
|
||||
Be sure that `OPENAI_API_KEY` is set in order to the OpenAI models.
|
||||
|
||||
## Loading sample data
|
||||
|
||||
We have provided a sample dataset you can use for demoing this template. It consists of the git history of the timescale project.
|
||||
|
||||
To load this dataset, set the `LOAD_SAMPLE_DATA` environmental variable.
|
||||
|
||||
## Loading your own dataset.
|
||||
|
||||
To load your own dataset you will have to modify the code in the `DATASET SPECIFIC CODE` section of `chain.py`.
|
||||
This code defines the name of the collection, how to load the data, and the human-language description of both the
|
||||
contents of the collection and all of the metadata. The human-language descriptions are used by the self-query retriever
|
||||
to help the LLM convert the question into filters on the metadata when searching the data in Timescale-vector.
|
||||
|
||||
## Using in your own applications
|
||||
|
||||
This is a standard LangServe template. Instructions on how to use it with your LangServe applications are [here](https://github.com/langchain-ai/langchain/blob/master/templates/README.md).
|
||||
|
||||
|
1457
templates/rag-timescale-hybrid-search-time/poetry.lock
generated
Normal file
1457
templates/rag-timescale-hybrid-search-time/poetry.lock
generated
Normal file
File diff suppressed because it is too large
Load Diff
32
templates/rag-timescale-hybrid-search-time/pyproject.toml
Normal file
32
templates/rag-timescale-hybrid-search-time/pyproject.toml
Normal file
@ -0,0 +1,32 @@
|
||||
[tool.poetry]
|
||||
name = "rag_timescale_hybrid_search_time"
|
||||
version = "0.0.1"
|
||||
description = ""
|
||||
authors = []
|
||||
readme = "README.md"
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = ">=3.8.1,<4.0"
|
||||
langchain = ">=0.0.313, <0.1"
|
||||
openai = "^0.28.1"
|
||||
fastapi = "^0.104.0"
|
||||
sse-starlette = "^1.6.5"
|
||||
|
||||
[tool.poetry.group.dev.dependencies]
|
||||
langc = {git = "https://github.com/pingpong-templates/cli"}
|
||||
poethepoet = "^0.24.1"
|
||||
|
||||
[tool.langserve]
|
||||
export_module = "rag_timescale_hybrid_search_time.chain"
|
||||
export_attr = "chain"
|
||||
|
||||
[tool.poe.tasks.start]
|
||||
cmd="poetry run uvicorn langc.dev_scripts:create_demo_server --reload --port $port --host $host"
|
||||
args = [
|
||||
{name = "port", help = "port to run on", default = "8000"},
|
||||
{name = "host", help = "host to run on", default = "127.0.0.1"}
|
||||
]
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core"]
|
||||
build-backend = "poetry.core.masonry.api"
|
@ -0,0 +1,3 @@
|
||||
from rag_timescale_hybrid_search_time import chain
|
||||
|
||||
__all__ = ["chain"]
|
@ -0,0 +1,112 @@
|
||||
# ruff: noqa: E501
|
||||
|
||||
import os
|
||||
from datetime import timedelta
|
||||
|
||||
from langchain.chains.query_constructor.base import AttributeInfo
|
||||
from langchain.chat_models import ChatOpenAI
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.llms import OpenAI
|
||||
from langchain.prompts import ChatPromptTemplate
|
||||
from langchain.retrievers.self_query.base import SelfQueryRetriever
|
||||
from langchain.schema.output_parser import StrOutputParser
|
||||
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from pydantic import BaseModel
|
||||
|
||||
from .load_sample_dataset import load_ts_git_dataset
|
||||
|
||||
# to enable debug uncomment the following lines:
|
||||
# from langchain.globals import set_debug
|
||||
# set_debug(True)
|
||||
|
||||
# from dotenv import find_dotenv, load_dotenv
|
||||
# _ = load_dotenv(find_dotenv())
|
||||
|
||||
if os.environ.get("TIMESCALE_SERVICE_URL", None) is None:
|
||||
raise Exception("Missing `TIMESCALE_SERVICE_URL` environment variable.")
|
||||
|
||||
SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]
|
||||
LOAD_SAMPLE_DATA = os.environ.get("LOAD_SAMPLE_DATA", False)
|
||||
|
||||
|
||||
# DATASET SPECIFIC CODE
|
||||
# Load the sample dataset. You will have to change this to load your own dataset.
|
||||
collection_name = "timescale_commits"
|
||||
partition_interval = timedelta(days=7)
|
||||
if LOAD_SAMPLE_DATA:
|
||||
load_ts_git_dataset(
|
||||
SERVICE_URL,
|
||||
collection_name=collection_name,
|
||||
num_records=500,
|
||||
partition_interval=partition_interval,
|
||||
)
|
||||
|
||||
# This will change depending on the metadata stored in your dataset.
|
||||
document_content_description = "The git log commit summary containing the commit hash, author, date of commit, change summary and change details"
|
||||
metadata_field_info = [
|
||||
AttributeInfo(
|
||||
name="id",
|
||||
description="A UUID v1 generated from the date of the commit",
|
||||
type="uuid",
|
||||
),
|
||||
AttributeInfo(
|
||||
# This is a special attribute represent the timestamp of the uuid.
|
||||
name="__uuid_timestamp",
|
||||
description="The timestamp of the commit. Specify in YYYY-MM-DDTHH::MM:SSZ format",
|
||||
type="datetime.datetime",
|
||||
),
|
||||
AttributeInfo(
|
||||
name="author_name",
|
||||
description="The name of the author of the commit",
|
||||
type="string",
|
||||
),
|
||||
AttributeInfo(
|
||||
name="author_email",
|
||||
description="The email address of the author of the commit",
|
||||
type="string",
|
||||
),
|
||||
]
|
||||
# END DATASET SPECIFIC CODE
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = TimescaleVector(
|
||||
embedding=embeddings,
|
||||
collection_name=collection_name,
|
||||
service_url=SERVICE_URL,
|
||||
time_partition_interval=partition_interval,
|
||||
)
|
||||
|
||||
llm = OpenAI(temperature=0)
|
||||
retriever = SelfQueryRetriever.from_llm(
|
||||
llm,
|
||||
vectorstore,
|
||||
document_content_description,
|
||||
metadata_field_info,
|
||||
enable_limit=True,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
template = """Answer the question based only on the following context:
|
||||
{context}
|
||||
|
||||
Question: {question}
|
||||
"""
|
||||
prompt = ChatPromptTemplate.from_template(template)
|
||||
|
||||
model = ChatOpenAI(temperature=0, model="gpt-4")
|
||||
|
||||
# RAG chain
|
||||
chain = (
|
||||
RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
|
||||
| prompt
|
||||
| model
|
||||
| StrOutputParser()
|
||||
)
|
||||
|
||||
|
||||
class Question(BaseModel):
|
||||
__root__: str
|
||||
|
||||
|
||||
chain = chain.with_types(input_type=Question)
|
@ -0,0 +1,84 @@
|
||||
import os
|
||||
import tempfile
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
import requests
|
||||
from langchain.document_loaders import JSONLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores.timescalevector import TimescaleVector
|
||||
from timescale_vector import client
|
||||
|
||||
|
||||
def parse_date(date_string: str) -> datetime:
|
||||
if date_string is None:
|
||||
return None
|
||||
time_format = "%a %b %d %H:%M:%S %Y %z"
|
||||
return datetime.strptime(date_string, time_format)
|
||||
|
||||
|
||||
def extract_metadata(record: dict, metadata: dict) -> dict:
|
||||
dt = parse_date(record["date"])
|
||||
metadata["id"] = str(client.uuid_from_time(dt))
|
||||
if dt is not None:
|
||||
metadata["date"] = dt.isoformat()
|
||||
else:
|
||||
metadata["date"] = None
|
||||
metadata["author"] = record["author"]
|
||||
metadata["commit_hash"] = record["commit"]
|
||||
return metadata
|
||||
|
||||
|
||||
def load_ts_git_dataset(
|
||||
service_url,
|
||||
collection_name="timescale_commits",
|
||||
num_records: int = 500,
|
||||
partition_interval=timedelta(days=7),
|
||||
):
|
||||
json_url = "https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
|
||||
tmp_file = "ts_git_log.json"
|
||||
|
||||
temp_dir = tempfile.gettempdir()
|
||||
json_file_path = os.path.join(temp_dir, tmp_file)
|
||||
|
||||
if not os.path.exists(json_file_path):
|
||||
response = requests.get(json_url)
|
||||
if response.status_code == 200:
|
||||
with open(json_file_path, "w") as json_file:
|
||||
json_file.write(response.text)
|
||||
else:
|
||||
print(f"Failed to download JSON file. Status code: {response.status_code}")
|
||||
|
||||
loader = JSONLoader(
|
||||
file_path=json_file_path,
|
||||
jq_schema=".commit_history[]",
|
||||
text_content=False,
|
||||
metadata_func=extract_metadata,
|
||||
)
|
||||
|
||||
documents = loader.load()
|
||||
|
||||
# Remove documents with None dates
|
||||
documents = [doc for doc in documents if doc.metadata["date"] is not None]
|
||||
|
||||
if num_records > 0:
|
||||
documents = documents[:num_records]
|
||||
|
||||
# Split the documents into chunks for embedding
|
||||
text_splitter = CharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
)
|
||||
docs = text_splitter.split_documents(documents)
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
|
||||
# Create a Timescale Vector instance from the collection of documents
|
||||
TimescaleVector.from_documents(
|
||||
embedding=embeddings,
|
||||
ids=[doc.metadata["id"] for doc in docs],
|
||||
documents=docs,
|
||||
collection_name=collection_name,
|
||||
service_url=service_url,
|
||||
time_partition_interval=partition_interval,
|
||||
)
|
Loading…
Reference in New Issue
Block a user