community[minor]: Add DocumentDBVectorSearch VectorStore (#17757)

**Description:**
- Added Amazon DocumentDB Vector Search integration (HNSW index)
- Added integration tests
- Updated AWS documentation with DocumentDB Vector Search instructions
- Added notebook for DocumentDB integration with example usage

---------

Co-authored-by: EC2 Default User <ec2-user@ip-172-31-95-226.ec2.internal>
This commit is contained in:
Sam Khano
2024-03-06 15:11:34 -08:00
committed by GitHub
parent 51f3902bc4
commit 1b4dcf22f3
7 changed files with 1270 additions and 0 deletions

View File

@@ -220,6 +220,35 @@ See a [usage example](/docs/integrations/vectorstores/opensearch#using-aos-amazo
from langchain_community.vectorstores import OpenSearchVectorSearch
```
### Amazon DocumentDB Vector Search
>[Amazon DocumentDB (with MongoDB Compatibility)](https://docs.aws.amazon.com/documentdb/) makes it easy to set up, operate, and scale MongoDB-compatible databases in the cloud.
> With Amazon DocumentDB, you can run the same application code and use the same drivers and tools that you use with MongoDB.
> Vector search for Amazon DocumentDB combines the flexibility and rich querying capability of a JSON-based document database with the power of vector search.
#### Installation and Setup
See [detail configuration instructions](/docs/integrations/vectorstores/documentdb).
We need to install the `pymongo` python package.
```bash
pip install pymongo
```
#### Deploy DocumentDB on AWS
[Amazon DocumentDB (with MongoDB Compatibility)](https://docs.aws.amazon.com/documentdb/) is a fast, reliable, and fully managed database service. Amazon DocumentDB makes it easy to set up, operate, and scale MongoDB-compatible databases in the cloud.
AWS offers services for computing, databases, storage, analytics, and other functionality. For an overview of all AWS services, see [Cloud Computing with Amazon Web Services](https://aws.amazon.com/what-is-aws/).
See a [usage example](/docs/integrations/vectorstores/documentdb).
```python
from langchain.vectorstores import DocumentDBVectorSearch
```
## Tools
### AWS Lambda

View File

@@ -0,0 +1,477 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "245c0aa70db77606",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# Amazon Document DB\n",
"\n",
">[Amazon DocumentDB (with MongoDB Compatibility)](https://docs.aws.amazon.com/documentdb/) makes it easy to set up, operate, and scale MongoDB-compatible databases in the cloud.\n",
"> With Amazon DocumentDB, you can run the same application code and use the same drivers and tools that you use with MongoDB.\n",
"> Vector search for Amazon DocumentDB combines the flexibility and rich querying capability of a JSON-based document database with the power of vector search.\n",
"\n",
"\n",
"This notebook shows you how to use [Amazon Document DB Vector Search](https://docs.aws.amazon.com/documentdb/latest/developerguide/vector-search.html) to store documents in collections, create indicies and perform vector search queries using approximate nearest neighbor algorithms such \"cosine\", \"euclidean\", and \"dotProduct\". By default, DocumentDB creates Hierarchical Navigable Small World (HNSW) indexes. To learn about other supported vector index types, please refer to the document linked above.\n",
"\n",
"To use DocumentDB, you must first deploy a cluster. Please refer to the [Developer Guide](https://docs.aws.amazon.com/documentdb/latest/developerguide/what-is.html) for more details.\n",
"\n",
"[Sign Up](https://aws.amazon.com/free/) for free to get started today.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ab8e45f5bd435ade",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:20:00.721985Z",
"start_time": "2023-10-10T17:19:57.996265Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"!pip install pymongo"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9c7ce9e7b26efbb0",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:50:03.615234Z",
"start_time": "2023-10-10T17:50:03.604289Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import getpass\n",
"\n",
"# DocumentDB connection string\n",
"# i.e., \"mongodb://{username}:{pass}@{cluster_endpoint}:{port}/?{params}\"\n",
"CONNECTION_STRING = getpass.getpass(\"DocumentDB Cluster URI:\")\n",
"\n",
"INDEX_NAME = \"izzy-test-index\"\n",
"NAMESPACE = \"izzy_test_db.izzy_test_collection\"\n",
"DB_NAME, COLLECTION_NAME = NAMESPACE.split(\".\")"
]
},
{
"cell_type": "markdown",
"id": "f2e66b097c6ce2e3",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"We want to use `OpenAIEmbeddings` so we need to set up our OpenAI environment variables. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4a052d99c6b8a2a7",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:50:11.712929Z",
"start_time": "2023-10-10T17:50:11.703871Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"# Set up the OpenAI Environment Variables\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
"os.environ[\n",
" \"OPENAI_EMBEDDINGS_DEPLOYMENT\"\n",
"] = \"smart-agent-embedding-ada\" # the deployment name for the embedding model\n",
"os.environ[\"OPENAI_EMBEDDINGS_MODEL_NAME\"] = \"text-embedding-ada-002\" # the model name"
]
},
{
"cell_type": "markdown",
"id": "ebaa28c6e2b35063",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Now, we will load the documents into the collection, create the index, and then perform queries against the index.\n",
"\n",
"Please refer to the [documentation](https://docs.aws.amazon.com/documentdb/latest/developerguide/vector-search.html) if you have questions about certain parameters"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "183741cf8f4c7c53",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:50:16.732718Z",
"start_time": "2023-10-10T17:50:16.716642Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores.documentdb import (\n",
" DocumentDBSimilarityType,\n",
" DocumentDBVectorSearch,\n",
")\n",
"\n",
"SOURCE_FILE_NAME = \"../../modules/state_of_the_union.txt\"\n",
"\n",
"loader = TextLoader(SOURCE_FILE_NAME)\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"# OpenAI Settings\n",
"model_deployment = os.getenv(\n",
" \"OPENAI_EMBEDDINGS_DEPLOYMENT\", \"smart-agent-embedding-ada\"\n",
")\n",
"model_name = os.getenv(\"OPENAI_EMBEDDINGS_MODEL_NAME\", \"text-embedding-ada-002\")\n",
"\n",
"\n",
"openai_embeddings: OpenAIEmbeddings = OpenAIEmbeddings(\n",
" deployment=model_deployment, model=model_name\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39ae6058c2f7fdf1",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:51:17.980698Z",
"start_time": "2023-10-10T17:51:11.786336Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"{ 'createdCollectionAutomatically' : false,\n",
" 'numIndexesBefore' : 1,\n",
" 'numIndexesAfter' : 2,\n",
" 'ok' : 1,\n",
" 'operationTime' : Timestamp(1703656982, 1)}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pymongo import MongoClient\n",
"\n",
"INDEX_NAME = \"izzy-test-index-2\"\n",
"NAMESPACE = \"izzy_test_db.izzy_test_collection\"\n",
"DB_NAME, COLLECTION_NAME = NAMESPACE.split(\".\")\n",
"\n",
"client: MongoClient = MongoClient(CONNECTION_STRING)\n",
"collection = client[DB_NAME][COLLECTION_NAME]\n",
"\n",
"model_deployment = os.getenv(\n",
" \"OPENAI_EMBEDDINGS_DEPLOYMENT\", \"smart-agent-embedding-ada\"\n",
")\n",
"model_name = os.getenv(\"OPENAI_EMBEDDINGS_MODEL_NAME\", \"text-embedding-ada-002\")\n",
"\n",
"vectorstore = DocumentDBVectorSearch.from_documents(\n",
" documents=docs,\n",
" embedding=openai_embeddings,\n",
" collection=collection,\n",
" index_name=INDEX_NAME,\n",
")\n",
"\n",
"# number of dimensions used by model above\n",
"dimensions = 1536\n",
"\n",
"# specify similarity algorithm, valid options are:\n",
"# cosine (COS), euclidean (EUC), dotProduct (DOT)\n",
"similarity_algorithm = DocumentDBSimilarityType.COS\n",
"\n",
"vectorstore.create_index(dimensions, similarity_algorithm)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0666efbe",
"metadata": {},
"outputs": [],
"source": [
"# perform a similarity search between the embedding of the query and the embeddings of the documents\n",
"query = \"What did the President say about Ketanji Brown Jackson\"\n",
"docs = vectorstore.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "48b6dcca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "37e4df8c7d7db851",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Once the documents have been loaded and the index has been created, you can now instantiate the vector store directly and run queries against the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c218ab6f59301f7",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:52:14.994861Z",
"start_time": "2023-10-10T17:52:13.986379Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"vectorstore = DocumentDBVectorSearch.from_connection_string(\n",
" connection_string=CONNECTION_STRING,\n",
" namespace=NAMESPACE,\n",
" embedding=openai_embeddings,\n",
" index_name=INDEX_NAME,\n",
")\n",
"\n",
"# perform a similarity search between a query and the ingested documents\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = vectorstore.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba431631-eb5c-4559-b504-4546a9247048",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd67e4d92c9ab32f",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-10T17:53:21.145431Z",
"start_time": "2023-10-10T17:53:20.884531Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# perform a similarity search between a query and the ingested documents\n",
"query = \"Which stats did the President share about the U.S. economy\"\n",
"docs = vectorstore.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b63c73c7e905001c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"And unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind. \n",
"\n",
"And it worked. It created jobs. Lots of jobs. \n",
"\n",
"In fact—our economy created over 6.5 Million new jobs just last year, more jobs created in one year \n",
"than ever before in the history of America. \n",
"\n",
"Our economy grew at a rate of 5.7% last year, the strongest growth in nearly 40 years, the first step in bringing fundamental change to an economy that hasnt worked for the working people of this nation for too long. \n",
"\n",
"For the past 40 years we were told that if we gave tax breaks to those at the very top, the benefits would trickle down to everyone else. \n",
"\n",
"But that trickle-down theory led to weaker economic growth, lower wages, bigger deficits, and the widest gap between those at the top and everyone else in nearly a century.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "0f9ded8b",
"metadata": {},
"source": [
"## Question Answering"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67351360",
"metadata": {},
"outputs": [],
"source": [
"qa_retriever = vectorstore.as_retriever(\n",
" search_type=\"similarity\",\n",
" search_kwargs={\"k\": 25},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aadaeca5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"\n",
"prompt_template = \"\"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n",
"\n",
"{context}\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
"PROMPT = PromptTemplate(\n",
" template=prompt_template, input_variables=[\"context\", \"question\"]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2280140e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain_openai import OpenAI\n",
"\n",
"qa = RetrievalQA.from_chain_type(\n",
" llm=OpenAI(),\n",
" chain_type=\"stuff\",\n",
" retriever=qa_retriever,\n",
" return_source_documents=True,\n",
" chain_type_kwargs={\"prompt\": PROMPT},\n",
")\n",
"\n",
"docs = qa({\"query\": \"gpt-4 compute requirements\"})\n",
"\n",
"print(docs[\"result\"])\n",
"print(docs[\"source_documents\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}