Pgvector template (#13267)

Including pvector template, adapting what is covered in the [cookbook](https://github.com/langchain-ai/langchain/blob/master/cookbook/retrieval_in_sql.ipynb). --------- Co-authored-by: Lance Martin <lance@langchain.dev> Co-authored-by: Erick Friis <erick@langchain.dev>
2025-09-22 11:00:37 +00:00 · 2023-11-14 12:47:48 -03:00
parent be854225c7
commit 58f5a4d30a
9 changed files with 2114 additions and 0 deletions
--- a/templates/sql-pgvector/.gitignore
+++ b/templates/sql-pgvector/.gitignore
@@ -0,0 +1 @@
+__pycache__
--- a/templates/sql-pgvector/LICENSE
+++ b/templates/sql-pgvector/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 LangChain, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/templates/sql-pgvector/README.md
+++ b/templates/sql-pgvector/README.md
@@ -0,0 +1,105 @@
+# sql-pgvector
+
+This template enables  user to use `pgvector` for combining postgreSQL with semantic search / RAG. 
+
+It uses [PGVector](https://github.com/pgvector/pgvector) extension as shown in the [RAG empowered SQL cookbook](cookbook/retrieval_in_sql.ipynb)
+
+## Environment Setup
+
+If you are using `ChatOpenAI` as your LLM, make sure the `OPENAI_API_KEY` is set in your environment. You can change both the LLM and embeddings model inside `chain.py`
+
+And you can configure configure the following environment variables
+for use by the template (defaults are in parentheses)
+
+- `POSTGRES_USER` (postgres)
+- `POSTGRES_PASSWORD` (test)
+- `POSTGRES_DB` (vectordb)
+- `POSTGRES_HOST` (localhost)
+- `POSTGRES_PORT` (5432)
+
+If you don't have a postgres instance, you can run one locally in docker:
+
+```bash
+docker run \
+  --name some-postgres \
+  -e POSTGRES_PASSWORD=test \
+  -e POSTGRES_USER=postgres \
+  -e POSTGRES_DB=vectordb \
+  -p 5432:5432 \
+  postgres:16
+```
+
+And to start again later, use the `--name` defined above:
+```bash
+docker start some-postgres
+```
+
+### PostgreSQL Database setup
+
+Apart from having `pgvector` extension enabled, you will need to do some setup before being able to run semantic search within your SQL queries.
+
+In order to run RAG over your postgreSQL database you will need to generate the embeddings for the specific columns you want. 
+
+This process is covered in the [RAG empowered SQL cookbook](cookbook/retrieval_in_sql.ipynb), but the overall approach consist of:
+1. Querying for unique values in the column
+2. Generating embeddings for those values
+3. Store the embeddings in a separate column or in an auxiliary table.
+
+## Usage
+
+To use this package, you should first have the LangChain CLI installed:
+
+```shell
+pip install -U langchain-cli
+```
+
+To create a new LangChain project and install this as the only package, you can do:
+
+```shell
+langchain app new my-app --package sql-pgvector
+```
+
+If you want to add this to an existing project, you can just run:
+
+```shell
+langchain app add sql-pgvector
+```
+
+And add the following code to your `server.py` file:
+```python
+from sql_pgvector import chain as sql_pgvector_chain
+
+add_routes(app, sql_pgvector_chain, path="/sql-pgvector")
+```
+
+(Optional) Let's now configure LangSmith. 
+LangSmith will help us trace, monitor and debug LangChain applications. 
+LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). 
+If you don't have access, you can skip this section
+
+
+```shell
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=<your-api-key>
+export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
+```
+
+If you are inside this directory, then you can spin up a LangServe instance directly by:
+
+```shell
+langchain serve
+```
+
+This will start the FastAPI app with a server is running locally at 
+[http://localhost:8000](http://localhost:8000)
+
+We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+We can access the playground at [http://127.0.0.1:8000/sql-pgvector/playground](http://127.0.0.1:8000/sql-pgvector/playground)  
+
+We can access the template from code with:
+
+```python
+from langserve.client import RemoteRunnable
+
+runnable = RemoteRunnable("http://localhost:8000/sql-pgvector")
+```
--- a/templates/sql-pgvector/poetry.lock
+++ b/templates/sql-pgvector/poetry.lock
--- a/templates/sql-pgvector/pyproject.toml
+++ b/templates/sql-pgvector/pyproject.toml
@@ -0,0 +1,26 @@
+[tool.poetry]
+name = "sql-pgvector"
+version = "0.0.1"
+description = ""
+authors = []
+readme = "README.md"
+
+[tool.poetry.dependencies]
+python = ">=3.8.1,<4.0"
+langchain = ">=0.0.313, <0.1"
+openai = "^0.28.1"
+psycopg2 = "^2.9.9"
+tiktoken = "^0.5.1"
+
+[tool.poetry.group.dev.dependencies]
+langchain-cli = ">=0.0.15"
+fastapi = "^0.104.0"
+sse-starlette = "^1.6.5"
+
+[tool.langserve]
+export_module = "sql_pgvector"
+export_attr = "chain"
+
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
--- a/templates/sql-pgvector/sql_pgvector/init.py
+++ b/templates/sql-pgvector/sql_pgvector/init.py
@@ -0,0 +1,3 @@
+from sql_pgvector.chain import chain
+
+__all__ = ["chain"]
--- a/templates/sql-pgvector/sql_pgvector/chain.py
+++ b/templates/sql-pgvector/sql_pgvector/chain.py
@@ -0,0 +1,118 @@
+import os
+import re
+
+from langchain.chat_models import ChatOpenAI
+from langchain.embeddings import OpenAIEmbeddings
+from langchain.prompts import ChatPromptTemplate
+from langchain.pydantic_v1 import BaseModel
+from langchain.schema.output_parser import StrOutputParser
+from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
+from langchain.sql_database import SQLDatabase
+
+from sql_pgvector.prompt_templates import final_template, postgresql_template
+
+"""
+IMPORTANT: For using this template, you will need to 
+follow the setup steps in the readme file
+"""
+
+if os.environ.get("OPENAI_API_KEY", None) is None:
+    raise Exception("Missing `OPENAI_API_KEY` environment variable")
+
+postgres_user = os.environ.get("POSTGRES_USER", "postgres")
+postgres_password = os.environ.get("POSTGRES_PASSWORD", "test")
+postgres_db = os.environ.get("POSTGRES_DB", "vectordb")
+postgres_host = os.environ.get("POSTGRES_HOST", "localhost")
+postgres_port = os.environ.get("POSTGRES_PORT", "5432")
+
+# Connect to DB
+# Replace with your own
+CONNECTION_STRING = (
+    f"postgresql+psycopg2://{postgres_user}:{postgres_password}"
+    f"@{postgres_host}:{postgres_port}/{postgres_db}"
+)
+db = SQLDatabase.from_uri(CONNECTION_STRING)
+
+# Choose LLM and embeddings model
+llm = ChatOpenAI(temperature=0)
+embeddings_model = OpenAIEmbeddings()
+
+
+# # Ingest code - you will need to run this the first time
+# # Insert your query e.g. "SELECT Name FROM Track"
+# column_to_embed = db.run('replace-with-your-own-select-query')
+# column_values = [s[0] for s in eval(column_to_embed)]
+# embeddings = embeddings_model.embed_documents(column_values)
+
+# for i in range(len(embeddings)):
+#     value = column_values[i].replace("'", "''")
+#     embedding = embeddings[i]
+
+#     # Replace with your own SQL command for your column and table.
+#     sql_command = (
+#         f'UPDATE "Track" SET "embeddings" = ARRAY{embedding} WHERE "Name" ='
+#         + f"'{value}'"
+#     )
+#     db.run(sql_command)
+
+
+# -----------------
+# Define functions
+# -----------------
+def get_schema(_):
+    return db.get_table_info()
+
+
+def run_query(query):
+    return db.run(query)
+
+
+def replace_brackets(match):
+    words_inside_brackets = match.group(1).split(", ")
+    embedded_words = [
+        str(embeddings_model.embed_query(word)) for word in words_inside_brackets
+    ]
+    return "', '".join(embedded_words)
+
+
+def get_query(query):
+    sql_query = re.sub(r"\[([\w\s,]+)\]", replace_brackets, query)
+    return sql_query
+
+
+# -----------------------
+# Now we create the chain
+# -----------------------
+
+query_generation_prompt = ChatPromptTemplate.from_messages(
+    [("system", postgresql_template), ("human", "{question}")]
+)
+
+sql_query_chain = (
+    RunnablePassthrough.assign(schema=get_schema)
+    | query_generation_prompt
+    | llm.bind(stop=["\nSQLResult:"])
+    | StrOutputParser()
+)
+
+
+final_prompt = ChatPromptTemplate.from_messages(
+    [("system", final_template), ("human", "{question}")]
+)
+
+full_chain = (
+    RunnablePassthrough.assign(query=sql_query_chain)
+    | RunnablePassthrough.assign(
+        schema=get_schema,
+        response=RunnableLambda(lambda x: db.run(get_query(x["query"]))),
+    )
+    | final_prompt
+    | llm
+)
+
+
+class InputType(BaseModel):
+    question: str
+
+
+chain = full_chain.with_types(input_type=InputType)
--- a/templates/sql-pgvector/sql_pgvector/prompt_templates.py
+++ b/templates/sql-pgvector/sql_pgvector/prompt_templates.py
@@ -0,0 +1,50 @@
+postgresql_template = (
+    "You are a Postgres expert. Given an input question, first create a "
+    "syntactically correct Postgres query to run, then look at the results "
+    "of the query and return the answer to the input question.\n"
+    "Unless the user specifies in the question a specific number of "
+    "examples to obtain, query for at most 5 results using the LIMIT clause "
+    "as per Postgres. You can order the results to return the most "
+    "informative data in the database.\n"
+    "Never query for all columns from a table. You must query only the "
+    "columns that are needed to answer the question. Wrap each column name "
+    'in double quotes (") to denote them as delimited identifiers.\n'
+    "Pay attention to use only the column names you can see in the tables "
+    "below. Be careful to not query for columns that do not exist. Also, "
+    "pay attention to which column is in which table.\n"
+    "Pay attention to use date('now') function to get the current date, "
+    'if the question involves "today".\n\n'
+    "You can use an extra extension which allows you to run semantic "
+    "similarity using <-> operator on tables containing columns named "
+    '"embeddings".\n'
+    "<-> operator can ONLY be used on embeddings vector columns.\n"
+    "The embeddings value for a given row typically represents the semantic "
+    "meaning of that row.\n"
+    "The vector represents an embedding representation of the question, "
+    "given below. \n"
+    "Do NOT fill in the vector values directly, but rather specify a "
+    "`[search_word]` placeholder, which should contain the word that would "
+    "be embedded for filtering.\n"
+    "For example, if the user asks for songs about 'the feeling of "
+    "loneliness' the query could be:\n"
+    '\'SELECT "[whatever_table_name]"."SongName" FROM '
+    '"[whatever_table_name]" ORDER BY "embeddings" <-> \'[loneliness]\' '
+    "LIMIT 5'\n\n"
+    "Use the following format:\n\n"
+    "Question: <Question here>\n"
+    "SQLQuery: <SQL Query to run>\n"
+    "SQLResult: <Result of the SQLQuery>\n"
+    "Answer: <Final answer here>\n\n"
+    "Only use the following tables:\n\n"
+    "{schema}\n"
+)
+
+
+final_template = (
+    "Based on the table schema below, question, sql query, and sql response, "
+    "write a natural language response:\n"
+    "{schema}\n\n"
+    "Question: {question}\n"
+    "SQL Query: {query}\n"
+    "SQL Response: {response}"
+)
--- a/templates/sql-pgvector/tests/init.py
+++ b/templates/sql-pgvector/tests/init.py