[docs]: vector store integration pages (#24858)

Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Isaac Francisco
2024-08-06 10:20:27 -07:00
committed by GitHub
parent 2c798622cd
commit a72fddbf8d
29 changed files with 5649 additions and 4436 deletions

View File

@@ -0,0 +1,2 @@
# files generated by faiss.ipynb
faiss_index

View File

@@ -5,33 +5,13 @@
"id": "66d0270a-b74f-4110-901e-7960b00297af",
"metadata": {},
"source": [
"# Astra DB\n",
"# Astra DB Vector Store\n",
"\n",
"This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store."
]
},
{
"cell_type": "markdown",
"id": "ab8cd64f-3bb2-4f16-a0a9-12d7b1789bf6",
"metadata": {},
"source": [
"> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API."
]
},
{
"cell_type": "markdown",
"id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
"metadata": {},
"source": [
"_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
]
},
{
"cell_type": "markdown",
"id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
"metadata": {},
"source": [
"## Setup and general dependencies"
"This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store.\n",
"\n",
"> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API.\n",
"\n",
"## Setup"
]
},
{
@@ -39,7 +19,7 @@
"id": "dbe7c156-0413-47e3-9237-4769c4248869",
"metadata": {},
"source": [
"Use of the integration requires the corresponding Python package:"
"Use of the integration requires the `langchain-astradb` partner package:"
]
},
{
@@ -49,54 +29,61 @@
"metadata": {},
"outputs": [],
"source": [
"pip install -qU langchain-astradb"
"pip install -qU \"langchain-astradb>=0.3.3\""
]
},
{
"cell_type": "markdown",
"id": "2453d83a-bc8f-41e1-a692-befe4dd90156",
"id": "319bf84b",
"metadata": {},
"source": [
"_Make sure you have installed the packages required to run all of this demo:_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56c1f86e-5921-4976-ac8f-1d62e5a512b0",
"metadata": {},
"outputs": [],
"source": [
"pip install -qU langchain langchain-community langchain-openai datasets pypdf"
]
},
{
"cell_type": "markdown",
"id": "c2910035-e61f-48d9-a110-d68c401b62aa",
"metadata": {},
"source": [
"### Import dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b06619af-fea2-4863-8149-7f239a8c9c82",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"### Credentials\n",
"\n",
"from astrapy.info import CollectionVectorServiceOptions\n",
"from datasets import load_dataset\n",
"from langchain_community.document_loaders import PyPDFLoader\n",
"from langchain_core.documents import Document\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter"
"In order to use the AstraDB vector store, you must first head to the [AstraDB website](https://astra.datastax.com), create an account, and then create a new database - the initialization might take a few minutes. \n",
"\n",
"Once the database has been initialized, you should [create an application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html#generate-application-token) and save it for later use. \n",
"\n",
"You will also want to copy the `API Endpoint` from the `Database Details` and store that in the `ASTRA_DB_API_ENDPOINT` variable.\n",
"\n",
"You may optionally provide a namespace, which you can manage from the `Data Explorer` tab of your database dashboard. If you don't wish to set a namespace, you can leave the `getpass` prompt for `ASTRA_DB_NAMESPACE` empty."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b7843c22",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"\n",
"ASTRA_DB_API_ENDPOINT = getpass.getpass(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_APPLICATION_TOKEN = getpass.getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n",
"\n",
"desired_namespace = getpass.getpass(\"ASTRA_DB_NAMESPACE = \")\n",
"if desired_namespace:\n",
" ASTRA_DB_NAMESPACE = desired_namespace\n",
"else:\n",
" ASTRA_DB_NAMESPACE = None"
]
},
{
"cell_type": "markdown",
"id": "e1c5cd9e",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cb739c0",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
@@ -104,48 +91,59 @@
"id": "22866f09-e10d-4f05-a24b-b9420129462e",
"metadata": {},
"source": [
"## Import the Vector Store"
"## Initialization\n",
"\n",
"There are two ways to create an Astra DB vector store, which differ in how the embeddings are computed.\n",
"\n",
"#### Method 1: Explicit embeddings\n",
"\n",
"You can separately instantiate a `langchain_core.embeddings.Embeddings` class and pass it to the `AstraDBVectorStore` constructor, just like with most other LangChain vector stores.\n",
"\n",
"#### Method 2: Integrated embedding computation\n",
"\n",
"Alternatively, you can use the [Vectorize](https://www.datastax.com/blog/simplifying-vector-embedding-generation-with-astra-vectorize) feature of Astra DB and simply specify the name of a supported embedding model when creating the store. The embedding computations are entirely handled within the database. (To proceed with this method, you must have enabled the desired embedding integration for your database, as described [in the docs](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).)\n",
"\n",
"### Explicit Embedding Initialization\n",
"\n",
"Below, we instantiate our vector store using the explicit embedding class:\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"id": "d71a1dcb",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "0b32730d-176e-414c-9d91-fd3644c54211",
"metadata": {},
"outputs": [],
"source": [
"from langchain_astradb import AstraDBVectorStore"
]
},
{
"cell_type": "markdown",
"id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
"metadata": {},
"source": [
"## DB Connection parameters\n",
"from langchain_astradb import AstraDBVectorStore\n",
"\n",
"These are found on your Astra DB dashboard:\n",
"\n",
"- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
"- the Token looks like `AstraCS:6gBhNmsk135....`\n",
"- you may optionally provide a _Namespace_ such as `my_namespace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
"metadata": {},
"outputs": [],
"source": [
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_APPLICATION_TOKEN = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n",
"\n",
"desired_namespace = input(\"(optional) Namespace = \")\n",
"if desired_namespace:\n",
" ASTRA_DB_KEYSPACE = desired_namespace\n",
"else:\n",
" ASTRA_DB_KEYSPACE = None"
"vector_store = AstraDBVectorStore(\n",
" collection_name=\"astra_vector_langchain\",\n",
" embedding=embeddings,\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_APPLICATION_TOKEN,\n",
" namespace=ASTRA_DB_NAMESPACE,\n",
")"
]
},
{
@@ -153,85 +151,14 @@
"id": "84a1fe85-a42c-4f15-92e1-f79f1dd43ea2",
"metadata": {},
"source": [
"## Create the vector store\n",
"\n",
"There are two ways to create an Astra DB vector store, which differ in how the embeddings are computed.\n",
"\n",
"*Explicit embeddings*. You can separately instantiate a `langchain_core.embeddings.Embeddings` class and pass it to the `AstraDBVectorStore` constructor, just like with most other LangChain vector stores.\n",
"\n",
"*Integrated embedding computation*. Alternatively, you can use the [Vectorize](https://www.datastax.com/blog/simplifying-vector-embedding-generation-with-astra-vectorize) feature of Astra DB and simply specify the name of a supported embedding model when creating the store. The embedding computations are entirely handled within the database. (To proceed with this method, you must have enabled the desired embedding integration for your database, as described [in the docs](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).)\n",
"\n",
"**Please choose one method and run the corresponding cells only.**"
]
},
{
"cell_type": "markdown",
"id": "8c435386-e8d5-41f4-a9e5-7b609ef781f9",
"metadata": {},
"source": [
"### Method 1: provide embeddings explicitly\n",
"\n",
"This demo will use an OpenAI embedding model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfa5c005-9738-4c53-b8a8-8540fcbb8bad",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3accae6f-73e2-483a-83f7-76eb33558a1f",
"metadata": {},
"outputs": [],
"source": [
"my_embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "465b1b16-5363-4c4f-9917-a49e02a86c14",
"metadata": {},
"source": [
"Now you can create the vector store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
"metadata": {},
"outputs": [],
"source": [
"vstore = AstraDBVectorStore(\n",
" embedding=my_embeddings,\n",
" collection_name=\"astra_vector_demo\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_APPLICATION_TOKEN,\n",
" namespace=ASTRA_DB_KEYSPACE,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5d5d2bfa-c071-4a5b-8b6e-3daa1b6de164",
"metadata": {},
"source": [
"### Method 2: use Astra Vectorize (embeddings integrated in Astra DB)\n",
"### Integrated Embedding Initialization\n",
"\n",
"Here it is assumed that you have\n",
"\n",
"- enabled the OpenAI integration in your Astra DB organization,\n",
"- added an API Key named `\"MY_OPENAI_API_KEY\"` to the integration, and\n",
"- scoped it to the database you are using.\n",
"- Enabled the OpenAI integration in your Astra DB organization,\n",
"- Added an API Key named `\"OPENAI_API_KEY\"` to the integration, and scoped it to the database you are using.\n",
"\n",
"For more details please consult the [documentation](https://docs.datastax.com/en/astra-db-serverless/integrations/embedding-providers/openai.html)."
"For more details on how to do this, please consult the [documentation](https://docs.datastax.com/en/astra-db-serverless/integrations/embedding-providers/openai.html)."
]
},
{
@@ -241,312 +168,355 @@
"metadata": {},
"outputs": [],
"source": [
"from astrapy.info import CollectionVectorServiceOptions\n",
"\n",
"openai_vectorize_options = CollectionVectorServiceOptions(\n",
" provider=\"openai\",\n",
" model_name=\"text-embedding-3-small\",\n",
" authentication={\n",
" \"providerKey\": \"MY_OPENAI_API_KEY\",\n",
" \"providerKey\": \"OPENAI_API_KEY\",\n",
" },\n",
")\n",
"\n",
"vstore = AstraDBVectorStore(\n",
" collection_name=\"astra_vectorize_demo\",\n",
"vector_store_integrated = AstraDBVectorStore(\n",
" collection_name=\"astra_vector_langchain_integrated\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_APPLICATION_TOKEN,\n",
" namespace=ASTRA_DB_KEYSPACE,\n",
" namespace=ASTRA_DB_NAMESPACE,\n",
" collection_vector_service_options=openai_vectorize_options,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
"id": "d3796b39",
"metadata": {},
"source": [
"## Load a dataset"
]
},
{
"cell_type": "markdown",
"id": "552e56b0-301a-4b06-99c7-57ba6faa966f",
"metadata": {},
"source": [
"Convert each entry in the source dataset into a `Document`, then write them into the vector store:"
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
"execution_count": 23,
"id": "afb3e155",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"[UUID('89a5cea1-5f3d-47c1-89dc-7e36e12cf4de'),\n",
" UUID('d4e78c48-f954-4612-8a38-af22923ba23b'),\n",
" UUID('058e4046-ded0-4fc1-b8ac-60e5a5f08ea0'),\n",
" UUID('50ab2a9a-762c-4b78-b102-942a86d77288'),\n",
" UUID('1da5a3c1-ba51-4f2f-aaaf-79a8f5011ce3'),\n",
" UUID('f3055d9e-2eb1-4d25-838e-2c70548f91b5'),\n",
" UUID('4bf0613d-08d0-4fbc-a43c-4955e4c9e616'),\n",
" UUID('18008625-8fd4-45c2-a0d7-92a2cde23dbc'),\n",
" UUID('c712e06f-790b-4fd4-9040-7ab3898965d0'),\n",
" UUID('a9b84820-3445-4810-a46c-e77b76ab85bc')]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
"from uuid import uuid4\n",
"\n",
"docs = []\n",
"for entry in philo_dataset:\n",
" metadata = {\"author\": entry[\"author\"]}\n",
" doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
" docs.append(doc)\n",
"from langchain_core.documents import Document\n",
"\n",
"inserted_ids = vstore.add_documents(docs)\n",
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "79d4f436-ef04-4288-8f79-97c9abb983ed",
"id": "dfce4edc",
"metadata": {},
"source": [
"In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n",
"### Delete items from vector store\n",
"\n",
"_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._"
]
},
{
"cell_type": "markdown",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
"metadata": {},
"source": [
"Add some more entries, this time with `add_texts`:"
"We can delete items from our vector store by ID by using the `delete` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
"execution_count": 24,
"id": "d3f69315",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
"metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
"ids = [\"desc_01\", \"huss_xy\"]\n",
"vector_store.delete(ids=uuids[-1])"
]
},
{
"cell_type": "markdown",
"id": "d12e1a07",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
]
},
{
"cell_type": "markdown",
"id": "63840eb3-8b29-4017-bc2f-301bf5001f28",
"metadata": {},
"source": [
"_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n",
"_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n",
"_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._"
]
},
{
"cell_type": "markdown",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
"metadata": {},
"source": [
"## Run searches"
]
},
{
"cell_type": "markdown",
"id": "02a77d8e-1aae-4054-8805-01c77947c49f",
"metadata": {},
"source": [
"This section demonstrates metadata filtering and getting the similarity scores back:"
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search with filtering on metadata can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1761806a-1afd-4491-867c-25a80d92b9fe",
"execution_count": 15,
"id": "770b3467",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter={\"source\": \"tweet\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
"cell_type": "markdown",
"id": "ce112165",
"metadata": {},
"outputs": [],
"source": [
"results_filtered = vstore.similarity_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"plato\"},\n",
")\n",
"for res in results_filtered:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
"#### Similarity search with score\n",
"\n",
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
"execution_count": 16,
"id": "5924309a",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.776585] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]\n"
]
}
],
"source": [
"results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
"results = vector_store.similarity_search_with_score(\n",
" \"Will it be hot tomorrow?\", k=1, filter={\"source\": \"news\"}\n",
")\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "b14ea558-bfbe-41ce-807e-d70670060ada",
"id": "fead7af5",
"metadata": {},
"source": [
"### MMR (Maximal-marginal-relevance) search\n",
"#### Other search methods\n",
"\n",
"_Note: the MMR search method is not (yet) supported for vector stores built with Astra Vectorize._"
"There are a variety of other search methods that are not covered in this notebook, such as MMR search or searching by vector. For a full list of the search abilities available for `AstraDBVectorStore` check out the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html)."
]
},
{
"cell_type": "markdown",
"id": "7e40f714",
"metadata": {},
"source": [
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. \n",
"\n",
"Here is how to transform your vector store into a retriever and then invoke the retreiever with a simple query and filter."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
"execution_count": 17,
"id": "dcee50e6",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = vstore.max_marginal_relevance_search(\n",
" \"Our life is what we make of it\",\n",
" k=3,\n",
" filter={\"author\": \"aristotle\"},\n",
"retriever = vector_store.as_retriever(\n",
" search_type=\"similarity_score_threshold\",\n",
" search_kwargs={\"k\": 1, \"score_threshold\": 0.5},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "60fda5df-14e4-4fb0-bd17-65a393fab8a9",
"id": "734e683a",
"metadata": {},
"source": [
"### Async\n",
"## Chain usage\n",
"\n",
"Note that the Astra DB vector store supports all fully async methods (`asimilarity_search`, `afrom_texts`, `adelete` and so on) natively, i.e. without thread wrapping involved."
]
},
{
"cell_type": "markdown",
"id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
"metadata": {},
"source": [
"## Deleting stored documents"
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38a70ec4-b522-4d32-9ead-c642864fca37",
"execution_count": 25,
"id": "9b3cc97b",
"metadata": {},
"outputs": [],
"source": [
"delete_1 = vstore.delete(inserted_ids[:3])\n",
"print(f\"all_succeed={delete_1}\") # True, all documents deleted"
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
"execution_count": 21,
"id": "08401498",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of such applications. Its capabilities make it a preferred choice for developers in this domain.'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delete_2 = vstore.delete(inserted_ids[2:5])\n",
"print(f\"some_succeeds={delete_2}\") # True, though some IDs were gone already"
]
},
{
"cell_type": "markdown",
"id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
"metadata": {},
"source": [
"## A minimal RAG chain"
]
},
{
"cell_type": "markdown",
"id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
"metadata": {},
"source": [
"The next cells will implement a simple RAG pipeline:\n",
"- download a sample PDF file and load it onto the store;\n",
"- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
"- run the question-answering chain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
"metadata": {},
"outputs": [],
"source": [
"!curl -L \\\n",
" \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
" -o \"what-is-philosophy.pdf\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
"metadata": {},
"outputs": [],
"source": [
"pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
"docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
"inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
"print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
"metadata": {},
"outputs": [],
"source": [
"retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"philo_template = \"\"\"\n",
"You are a philosopher that draws inspiration from great thinkers of the past\n",
"to craft well-thought answers to user questions. Use the provided context as the basis\n",
"for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
"Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
"\n",
"CONTEXT:\n",
"{context}\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"QUESTION: {question}\n",
"\n",
"YOUR ANSWER:\"\"\"\n",
"\n",
"philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
"\n",
"llm = ChatOpenAI()\n",
"\n",
"chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | philo_prompt\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
"metadata": {},
"outputs": [],
"source": [
"chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
@@ -562,7 +532,7 @@
"id": "177610c7-50d0-4b7b-8634-b03338054c8e",
"metadata": {},
"source": [
"## Cleanup"
"## Cleanup vector store"
]
},
{
@@ -582,7 +552,17 @@
"metadata": {},
"outputs": [],
"source": [
"vstore.delete_collection()"
"vector_store.delete_collection()"
]
},
{
"cell_type": "markdown",
"id": "a14c34be",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `AstraDBVectorStore` features and configurations head to the API reference:https://api.python.langchain.com/en/latest/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html"
]
}
],
@@ -602,7 +582,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -7,30 +7,23 @@
"source": [
"# Chroma\n",
"\n",
">[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.\n",
"This notebook covers how to get started with the `Chroma` vector store.\n",
"\n",
">[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0. View the full docs of `Chroma` at [this page](https://docs.trychroma.com/reference/py-collection), and find the API reference for the LangChain integration at [this page](https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html).\n",
"\n",
"Install Chroma with:\n",
"## Setup\n",
"\n",
"```sh\n",
"pip install langchain-chroma\n",
"```\n",
"\n",
"Chroma runs in various modes. See below for examples of each integrated with LangChain.\n",
"- `in-memory` - in a python script or jupyter notebook\n",
"- `in-memory with persistance` - in a script or notebook and save/load to disk\n",
"- `in a docker container` - as a server running your local machine or in the cloud\n",
"\n",
"Like any other database, you can: \n",
"- `.add` \n",
"- `.get` \n",
"- `.update`\n",
"- `.upsert`\n",
"- `.delete`\n",
"- `.peek`\n",
"- and `.query` runs the similarity search.\n",
"\n",
"View full docs at [docs](https://docs.trychroma.com/reference/py-collection). To access these methods directly, you can do `._collection.method()`\n"
"To access `Chroma` vector stores you'll need to install the `langchain-chroma` integration package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "83a43688",
"metadata": {},
"outputs": [],
"source": [
"pip install -qU \"langchain-chroma>=0.1.2\""
]
},
{
@@ -38,149 +31,94 @@
"id": "2b5ffbf8",
"metadata": {},
"source": [
"## Basic Example\n",
"### Credentials\n",
"\n",
"In this basic example, we take the most recent State of the Union Address, split it into chunks, embed it using an open-source embedding model, load it into Chroma, and then query it."
"You can use the `Chroma` vector store without any credentials, simply installing the package above is enough!"
]
},
{
"cell_type": "markdown",
"id": "cd17cfed",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd7e1243",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"id": "f47f73f4",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"### Basic Initialization \n",
"\n",
"Below is a basic initialization, including the use of a directory to save the data locally.\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ae9fcf3e",
"id": "d3ed0a9a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"outputs": [],
"source": [
"# import\n",
"from langchain_chroma import Chroma\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.embeddings.sentence_transformer import (\n",
" SentenceTransformerEmbeddings,\n",
")\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"# load the document and split it into chunks\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"\n",
"# split it into chunks\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"# create the open-source embedding function\n",
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
"\n",
"# load it into Chroma\n",
"db = Chroma.from_documents(docs, embedding_function)\n",
"\n",
"# query it\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"\n",
"# print results\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "5c9a11cc",
"metadata": {},
"source": [
"## Basic Example (including saving to disk)\n",
"\n",
"Extending the previous example, if you want to save to disk, simply initialize the Chroma client and pass the directory where you want the data to be saved to. \n",
"\n",
"`Caution`: Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stop each other's work. As a best practice, only have one client per path running at any given time."
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "49f9bd49",
"execution_count": 16,
"id": "3ea11a7b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"outputs": [],
"source": [
"# save to disk\n",
"db2 = Chroma.from_documents(docs, embedding_function, persist_directory=\"./chroma_db\")\n",
"docs = db2.similarity_search(query)\n",
"from langchain_chroma import Chroma\n",
"\n",
"# load from disk\n",
"db3 = Chroma(persist_directory=\"./chroma_db\", embedding_function=embedding_function)\n",
"docs = db3.similarity_search(query)\n",
"print(docs[0].page_content)"
"vector_store = Chroma(\n",
" collection_name=\"example_collection\",\n",
" embedding_function=embeddings,\n",
" persist_directory=\"./chroma_langchain_db\", # Where to save data locally, remove if not neccesary\n",
")"
]
},
{
"cell_type": "markdown",
"id": "63318cc9",
"id": "ccb62a8c",
"metadata": {},
"source": [
"## Passing a Chroma Client into Langchain\n",
"### Initialization from client\n",
"\n",
"You can also create a Chroma Client and pass it to LangChain. This is particularly useful if you want easier access to the underlying database.\n",
"\n",
"You can also specify the collection name that you want LangChain to use."
"You can also initialize from a `Chroma` client, which is particularly useful if you want easier access to the underlying database."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "22f4a0ce",
"id": "3fe4457f",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Add of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Add of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Add of existing embedding ID: 1\n",
"Insert of existing embedding ID: 1\n",
"Add of existing embedding ID: 2\n",
"Insert of existing embedding ID: 2\n",
"Add of existing embedding ID: 3\n",
"Insert of existing embedding ID: 3\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 3 in the collection\n"
]
}
],
"outputs": [],
"source": [
"import chromadb\n",
"\n",
@@ -188,320 +126,320 @@
"collection = persistent_client.get_or_create_collection(\"collection_name\")\n",
"collection.add(ids=[\"1\", \"2\", \"3\"], documents=[\"a\", \"b\", \"c\"])\n",
"\n",
"langchain_chroma = Chroma(\n",
"vector_store_from_client = Chroma(\n",
" client=persistent_client,\n",
" collection_name=\"collection_name\",\n",
" embedding_function=embedding_function,\n",
")\n",
"\n",
"print(\"There are\", langchain_chroma._collection.count(), \"in the collection\")"
" embedding_function=embeddings,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e9cf6d70",
"id": "9d037340",
"metadata": {},
"source": [
"## Basic Example (using the Docker Container)\n",
"## Manage vector store\n",
"\n",
"You can also run the Chroma Server in a Docker container separately, create a Client to connect to it, and then pass that to LangChain. \n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"Chroma has the ability to handle multiple `Collections` of documents, but the LangChain interface expects one, so we need to specify the collection name. The default collection name used by LangChain is \"langchain\".\n",
"### Add items to vector store\n",
"\n",
"Here is how to clone, build, and run the Docker Image:\n",
"```sh\n",
"git clone git@github.com:chroma-core/chroma.git\n",
"```\n",
"\n",
"Edit the `docker-compose.yml` file and add `ALLOW_RESET=TRUE` under `environment`\n",
"```yaml\n",
" ...\n",
" command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8000 --log-config log_config.yml\n",
" environment:\n",
" - IS_PERSISTENT=TRUE\n",
" - ALLOW_RESET=TRUE\n",
" ports:\n",
" - 8000:8000\n",
" ...\n",
"```\n",
"\n",
"Then run `docker-compose up -d --build`"
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "74aee70e",
"execution_count": 17,
"id": "da279339",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"# create the chroma client\n",
"import uuid\n",
"\n",
"import chromadb\n",
"from chromadb.config import Settings\n",
"\n",
"client = chromadb.HttpClient(settings=Settings(allow_reset=True))\n",
"client.reset() # resets the database\n",
"collection = client.create_collection(\"my_collection\")\n",
"for doc in docs:\n",
" collection.add(\n",
" ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content\n",
" )\n",
"\n",
"# tell LangChain to use our client and collection name\n",
"db4 = Chroma(\n",
" client=client,\n",
" collection_name=\"my_collection\",\n",
" embedding_function=embedding_function,\n",
")\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db4.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "9ed3ec50",
"metadata": {},
"source": [
"## Update and Delete\n",
"\n",
"While building toward a real application, you want to go beyond adding data, and also update and delete data. \n",
"\n",
"Chroma has users provide `ids` to simplify the bookkeeping here. `ids` can be the name of the file, or a combined has like `filename_paragraphNumber`, etc.\n",
"\n",
"Chroma supports all these operations - though some of them are still being integrated all the way through the LangChain interface. Additional workflow improvements will be added soon.\n",
"\n",
"Here is a basic example showing how to do various operations:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "81a02810",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': '../../../state_of_the_union.txt'}\n",
"{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../../state_of_the_union.txt'}], 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.']}\n",
"count before 46\n",
"count after 45\n"
]
}
],
"source": [
"# create simple ids\n",
"ids = [str(i) for i in range(1, len(docs) + 1)]\n",
"\n",
"# add data\n",
"example_db = Chroma.from_documents(docs, embedding_function, ids=ids)\n",
"docs = example_db.similarity_search(query)\n",
"print(docs[0].metadata)\n",
"\n",
"# update the metadata for a document\n",
"docs[0].metadata = {\n",
" \"source\": \"../../how_to/state_of_the_union.txt\",\n",
" \"new_value\": \"hello world\",\n",
"}\n",
"example_db.update_document(ids[0], docs[0])\n",
"print(example_db._collection.get(ids=[ids[0]]))\n",
"\n",
"# delete the last document\n",
"print(\"count before\", example_db._collection.count())\n",
"example_db._collection.delete(ids=[ids[-1]])\n",
"print(\"count after\", example_db._collection.count())"
]
},
{
"cell_type": "markdown",
"id": "ac6bc71a",
"metadata": {},
"source": [
"## Use OpenAI Embeddings\n",
"\n",
"Many people like to use OpenAIEmbeddings, here is how to set that up."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "42080f37-8fd1-4cec-acd9-15d2b03b2f4d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# get a token: https://platform.openai.com/account/api-keys\n",
"\n",
"from getpass import getpass\n",
"\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"OPENAI_API_KEY = getpass()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c7a94d6c-b4d4-4498-9bdd-eb50c92b85c5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5eabdb75",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"embeddings = OpenAIEmbeddings()\n",
"new_client = chromadb.EphemeralClient()\n",
"openai_lc_client = Chroma.from_documents(\n",
" docs, embeddings, client=new_client, collection_name=\"openai_collection\"\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = openai_lc_client.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "6d9c28ad",
"metadata": {},
"source": [
"***\n",
"\n",
"## Other Information"
]
},
{
"cell_type": "markdown",
"id": "18152965",
"metadata": {},
"source": [
"### Similarity search with score"
]
},
{
"cell_type": "markdown",
"id": "346347d7",
"metadata": {},
"source": [
"The returned distance score is cosine distance. Therefore, a lower score is better."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "72aaa9c8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d88e958e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),\n",
" 1.1972057819366455)"
"['f22ed484-6db3-4b76-adb1-18a777426cd6',\n",
" 'e0d5bab4-6453-4511-9a37-023d9d288faa',\n",
" '877d76b8-3580-4d9e-a13f-eed0fa3d134a',\n",
" '26eaccab-81ce-4c0a-8e76-bf542647df18',\n",
" 'bcaa8239-7986-4050-bf40-e14fb7dab997',\n",
" 'cdc44b38-a83f-4e49-b249-7765b334e09d',\n",
" 'a7a35354-2687-4bc2-8242-3849a4d18d34',\n",
" '8780caf1-d946-4f27-a707-67d037e9e1d8',\n",
" 'dec6af2a-7326-408f-893d-7d7d717dfda9',\n",
" '3b18e210-bb59-47a0-8e17-c8e51176ea5e']"
]
},
"execution_count": 10,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=1,\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
" id=2,\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=3,\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
" id=4,\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=5,\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
" id=6,\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
" id=7,\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=8,\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
" id=9,\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=10,\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "794a7552",
"id": "7add6366",
"metadata": {},
"source": [
"### Retriever options\n",
"### Update items in vector store\n",
"\n",
"This section goes over different options for how to use Chroma as a retriever.\n",
"\n",
"#### MMR\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr`."
"Now that we have added documents to our vector store, we can update existing documents by using the `update_documents` function. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "96ff911a",
"execution_count": 5,
"id": "ef5dbd1e",
"metadata": {},
"outputs": [],
"source": [
"retriever = db.as_retriever(search_type=\"mmr\")"
"updated_document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and fried eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
" id=1,\n",
")\n",
"\n",
"updated_document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
" id=2,\n",
")\n",
"\n",
"vector_store.update_document(document_id=uuids[0], document=updated_document_1)\n",
"# You can also update multiple documents at once\n",
"vector_store.update_documents(\n",
" ids=uuids[:2], documents=[updated_document_1, updated_document_1]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "74b9a13a",
"metadata": {},
"source": [
"### Delete items from vector store\n",
"\n",
"We can also delete items from our vector store as follows:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "56f17791",
"metadata": {},
"outputs": [],
"source": [
"vector_store.delete(ids=uuids[-1])"
]
},
{
"cell_type": "markdown",
"id": "213acf08",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e2b96fcf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter={\"source\": \"tweet\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "cdd117ea",
"metadata": {},
"source": [
"#### Similarity search with score\n",
"\n",
"If you want to execute a similarity search and receive the corresponding scores you can run:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2768a331",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=1.726390] The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\n",
" \"Will it be hot tomorrow?\", k=1, filter={\"source\": \"news\"}\n",
")\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "92b436c8",
"metadata": {},
"source": [
"#### Search by vector\n",
"\n",
"You can also search by vector:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8ea434a5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* I had chocalate chip pancakes and fried eggs for breakfast this morning. [{'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_by_vector(\n",
" embedding=embeddings.embed_query(\"I love green eggs and ham!\"), k=1\n",
")\n",
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "9c1c1e6f",
"metadata": {},
"source": [
"#### Other search methods\n",
"\n",
"There are a variety of other search methods that are not covered in this notebook, such as MMR search or searching by vector. For a full list of the search abilities available for `AstraDBVectorStore` check out the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html).\n",
"\n",
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. For more information on the different search types and kwargs you can pass, please visit the API reference [here](https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html#langchain_chroma.vectorstores.Chroma.as_retriever)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "f00be6d0",
"id": "7b6f7867",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
"[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 12,
@@ -510,41 +448,89 @@
}
],
"source": [
"retriever.invoke(query)[0]"
"retriever = vector_store.as_retriever(\n",
" search_type=\"mmr\", search_kwargs={\"k\": 1, \"fetch_k\": 5}\n",
")\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "275dbd0a",
"id": "a2b7b73c",
"metadata": {},
"source": [
"### Filtering on metadata\n",
"## Chain usage\n",
"\n",
"It can be helpful to narrow down the collection before working with it.\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"For example, collections can be filtered on metadata using the get method."
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "81600dc1",
"id": "9aad065b",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "84a19f48",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': []}"
"'LangGraph is used for building stateful, agentic applications. It provides a framework that supports the development of such applications efficiently.'"
]
},
"execution_count": 13,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# filter collection for updated source\n",
"example_db.get(where={\"source\": \"some_other_source\"})"
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
"cell_type": "markdown",
"id": "fed28359",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `Chroma` vector store features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html"
]
}
],
@@ -564,7 +550,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -9,37 +9,18 @@
"\n",
"> [ClickHouse](https://clickhouse.com/) is the fastest and most resource efficient open-source database for real-time apps and analytics with full SQL support and a wide range of functions to assist users in writing analytical queries. Lately added data structures and distance search functions (like `L2Distance`) as well as [approximate nearest neighbor search indexes](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes) enable ClickHouse to be used as a high performance and scalable vector database to store and search vectors with SQL.\n",
"\n",
"You'll need to install `langchain-community` with `pip install -qU langchain-community` to use this integration\n",
"This notebook shows how to use functionality related to the `ClickHouse` vector store.\n",
"\n",
"This notebook shows how to use functionality related to the `ClickHouse` vector search."
]
},
{
"cell_type": "markdown",
"id": "43ead5d5-2c1f-4dce-a69a-cb00e4f9d6f0",
"metadata": {},
"source": [
"## Setting up environments"
]
},
{
"cell_type": "markdown",
"id": "b2c434bc",
"metadata": {},
"source": [
"Setting up local clickhouse server with docker (optional)"
"## Setup\n",
"\n",
"First set up a local clickhouse server with docker:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "249a7751",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:43:43.035606Z",
"start_time": "2023-06-03T08:43:42.618531Z"
}
},
"id": "8c4d2e16",
"metadata": {},
"outputs": [],
"source": [
"! docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11"
@@ -47,52 +28,82 @@
},
{
"cell_type": "markdown",
"id": "7bd3c1c0",
"id": "0acb2a8d",
"metadata": {},
"source": [
"Setup up clickhouse client driver"
"You'll need to install `langchain-community` and `clickhouse-connect` to use this integration"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d614bf8",
"id": "d454fb7c",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet clickhouse-connect"
"pip install -qU langchain-community clickhouse-connect"
]
},
{
"cell_type": "markdown",
"id": "15a1d477-9cdb-4d82-b019-96951ecb2b72",
"id": "3df5501b",
"metadata": {},
"source": [
"We want to use OpenAIEmbeddings so we have to get the OpenAI API Key."
"### Credentials\n",
"\n",
"There are no credentials for this notebook, just make sure you have installed the packages as shown above."
]
},
{
"cell_type": "markdown",
"id": "54d5276f",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "91003ea5-0c8c-436c-a5de-aaeaeef2f458",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:49:35.383673Z",
"start_time": "2023-06-03T08:49:33.984547Z"
}
},
"execution_count": null,
"id": "f6fd5b03",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"id": "2b87fe34",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"if not os.environ[\"OPENAI_API_KEY\"]:\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "60276097",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aac9563e",
"metadata": {
"ExecuteTime": {
@@ -104,176 +115,178 @@
"outputs": [],
"source": [
"from langchain_community.vectorstores import Clickhouse, ClickhouseSettings\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter"
"\n",
"settings = ClickhouseSettings(table=\"clickhouse_example\")\n",
"vector_store = Clickhouse(embeddings, config=settings)"
]
},
{
"cell_type": "markdown",
"id": "32dd3f67",
"metadata": {},
"source": [
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3c3999a",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:33:32.527387Z",
"start_time": "2023-06-03T08:33:32.501312Z"
},
"tags": []
},
"execution_count": null,
"id": "944743ee",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from uuid import uuid4\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"from langchain_core.documents import Document\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6e104aee",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:33:35.503823Z",
"start_time": "2023-06-03T08:33:33.745832Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 2801.49it/s]\n"
]
}
],
"source": [
"for d in docs:\n",
" d.metadata = {\"some\": \"metadata\"}\n",
"settings = ClickhouseSettings(table=\"clickhouse_vector_search_example\")\n",
"docsearch = Clickhouse.from_documents(docs, embeddings, config=settings)\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "18af81cc",
"metadata": {},
"source": [
"### Delete items from vector store\n",
"\n",
"We can delete items from our vector store by ID by using the `delete` function."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9c608226",
"execution_count": null,
"id": "12b32762",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"outputs": [],
"source": [
"print(docs[0].page_content)"
"vector_store.delete(ids=uuids[-1])"
]
},
{
"cell_type": "markdown",
"id": "e3a8b105",
"id": "ada27577",
"metadata": {},
"source": [
"## Get connection info and data schema"
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "69996818",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:28:58.252991Z",
"start_time": "2023-06-03T08:28:58.197560Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[92m\u001b[1mdefault.clickhouse_vector_search_example @ localhost:8123\u001b[0m\n",
"\n",
"\u001b[1musername: None\u001b[0m\n",
"\n",
"Table Schema:\n",
"---------------------------------------------------\n",
"|\u001b[94mid \u001b[0m|\u001b[96mNullable(String) \u001b[0m|\n",
"|\u001b[94mdocument \u001b[0m|\u001b[96mNullable(String) \u001b[0m|\n",
"|\u001b[94membedding \u001b[0m|\u001b[96mArray(Float32) \u001b[0m|\n",
"|\u001b[94mmetadata \u001b[0m|\u001b[96mObject('json') \u001b[0m|\n",
"|\u001b[94muuid \u001b[0m|\u001b[96mUUID \u001b[0m|\n",
"---------------------------------------------------\n",
"\n"
]
}
],
"execution_count": null,
"id": "015831a3",
"metadata": {},
"outputs": [],
"source": [
"print(str(docsearch))"
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\", k=2\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "324ac147",
"id": "623d3b9d",
"metadata": {},
"source": [
"### Clickhouse table schema"
]
},
{
"cell_type": "markdown",
"id": "b5bd7c5b",
"metadata": {},
"source": [
"> Clickhouse table will be automatically created if not exist by default. Advanced users could pre-create the table with optimized settings. For distributed Clickhouse cluster with sharding, table engine should be configured as `Distributed`."
"#### Similarity search with score\n",
"\n",
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "54f4f561",
"execution_count": null,
"id": "e7d43430",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Clickhouse Table DDL:\n",
"\n",
"CREATE TABLE IF NOT EXISTS default.clickhouse_vector_search_example(\n",
" id Nullable(String),\n",
" document Nullable(String),\n",
" embedding Array(Float32),\n",
" metadata JSON,\n",
" uuid UUID DEFAULT generateUUIDv4(),\n",
" CONSTRAINT cons_vec_len CHECK length(embedding) = 1536,\n",
" INDEX vec_idx embedding TYPE annoy(100,'L2Distance') GRANULARITY 1000\n",
") ENGINE = MergeTree ORDER BY uuid SETTINGS index_granularity = 8192\n"
]
}
],
"outputs": [],
"source": [
"print(f\"Clickhouse Table DDL:\\n\\n{docsearch.schema}\")"
"results = vector_store.similarity_search_with_score(\"Will it be hot tomorrow?\", k=1)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "f59360c0",
"id": "f5a90c12",
"metadata": {},
"source": [
"## Filtering\n",
@@ -287,94 +300,131 @@
},
{
"cell_type": "code",
"execution_count": 9,
"id": "232055f6",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:29:36.680805Z",
"start_time": "2023-06-03T08:29:34.963676Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 6939.56it/s]\n"
]
}
],
"execution_count": null,
"id": "169d01d1",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.vectorstores import Clickhouse, ClickhouseSettings\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"for i, d in enumerate(docs):\n",
" d.metadata = {\"doc_id\": i}\n",
"\n",
"docsearch = Clickhouse.from_documents(docs, embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ddbcee77",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:29:43.487436Z",
"start_time": "2023-06-03T08:29:43.040831Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.6779101415357189 {'doc_id': 0} Madam Speaker, Madam...\n",
"0.6997970363474885 {'doc_id': 8} And so many families...\n",
"0.7044504914336727 {'doc_id': 1} Groups of citizens b...\n",
"0.7053558702165094 {'doc_id': 6} And Im taking robus...\n"
]
}
],
"source": [
"meta = docsearch.metadata_column\n",
"output = docsearch.similarity_search_with_relevance_scores(\n",
" \"What did the president say about Ketanji Brown Jackson?\",\n",
"meta = vector_store.metadata_column\n",
"results = vector_store.similarity_search_with_relevance_scores(\n",
" \"What did I eat for breakfast?\",\n",
" k=4,\n",
" where_str=f\"{meta}.doc_id<10\",\n",
" where_str=f\"{meta}.source = 'tweet'\",\n",
")\n",
"for d, dist in output:\n",
" print(dist, d.metadata, d.page_content[:20] + \"...\")"
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "a359ed74",
"id": "d86fa4bf",
"metadata": {},
"source": [
"## Deleting your data"
"#### Other search methods\n",
"\n",
"There are a variety of other search methods that are not covered in this notebook, such as MMR search or searching by vector. For a full list of the search abilities available for `Clickhouse` vector store check out the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.clickhouse.Clickhouse.html)."
]
},
{
"cell_type": "markdown",
"id": "afacfd4e",
"metadata": {},
"source": [
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. \n",
"\n",
"Here is how to transform your vector store into a retriever and then invoke the retreiever with a simple query and filter."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fb6a9d36",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-03T08:30:24.822384Z",
"start_time": "2023-06-03T08:30:24.798571Z"
}
},
"execution_count": null,
"id": "97187188",
"metadata": {},
"outputs": [],
"source": [
"docsearch.drop()"
"retriever = vector_store.as_retriever(\n",
" search_type=\"similarity_score_threshold\",\n",
" search_kwargs={\"k\": 1, \"score_threshold\": 0.5},\n",
")\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "57fade30",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a7fec6b",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae6871dc",
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
"cell_type": "markdown",
"id": "db24787c",
"metadata": {},
"source": [
"For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
]
},
{
"cell_type": "markdown",
"id": "02452d34",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `AstraDBVectorStore` features and configurations head to the API reference:https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.clickhouse.Clickhouse.html"
]
}
],
@@ -394,7 +444,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -10,7 +10,7 @@
"\n",
"Vector Search is a part of the [Full Text Search Service](https://docs.couchbase.com/server/current/learn/services-and-indexes/services/search-service.html) (Search Service) in Couchbase.\n",
"\n",
"This tutorial explains how to use Vector Search in Couchbase. You can work with both [Couchbase Capella](https://www.couchbase.com/products/capella/) and your self-managed Couchbase Server."
"This tutorial explains how to use Vector Search in Couchbase. You can work with either [Couchbase Capella](https://www.couchbase.com/products/capella/) and your self-managed Couchbase Server."
]
},
{
@@ -18,30 +18,64 @@
"id": "43326be4-4433-4de2-ad42-6eb91a722bad",
"metadata": {},
"source": [
"## Installation"
"## Setup\n",
"\n",
"To access the `CouchbaseVectorStore` you first need to install the `langchain-couchbase` partner package:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "bec8d532-fec7-4dc7-9be3-020aa7bdb01f",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain langchain-openai langchain-couchbase"
"pip install -qU langchain-couchbase"
]
},
{
"cell_type": "markdown",
"id": "30d6861e",
"metadata": {},
"source": [
"### Credentials\n",
"\n",
"Head over to the Couchbase [website](https://cloud.couchbase.com) and create a new connection, making sure to save your database username and password:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4a972cbc-bf59-46eb-9b50-e5dc3a69dcf0",
"execution_count": null,
"id": "d98e3baa",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
"COUCHBASE_CONNECTION_STRING = getpass.getpass(\n",
" \"Enter the connection string for the Couchbase cluster: \"\n",
")\n",
"DB_USERNAME = getpass.getpass(\"Enter the username for the Couchbase cluster: \")\n",
"DB_PASSWORD = getpass.getpass(\"Enter the password for the Couchbase cluster: \")"
]
},
{
"cell_type": "markdown",
"id": "23ac2c64",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c25ec38",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
"# os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
]
},
{
@@ -49,18 +83,9 @@
"id": "acf1b168-622f-465c-a9a5-d27a6d7e7a8f",
"metadata": {},
"source": [
"## Import the Vector Store and Embeddings"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "23ce45ab-bfd2-42e1-b681-514a550f0232",
"metadata": {},
"outputs": [],
"source": [
"from langchain_couchbase.vectorstores import CouchbaseVectorStore\n",
"from langchain_openai import OpenAIEmbeddings"
"## Initialization\n",
"\n",
"Before instantiating we need to create a connection."
]
},
{
@@ -68,31 +93,18 @@
"id": "3144ba02-1eaa-4449-853e-f034ca5706bf",
"metadata": {},
"source": [
"## Create Couchbase Connection Object\n",
"### Create Couchbase Connection Object\n",
"\n",
"We create a connection to the Couchbase cluster initially and then pass the cluster object to the Vector Store. \n",
"\n",
"Here, we are connecting using the username and password. You can also connect using any other supported way to your cluster. \n",
"Here, we are connecting using the username and password from above. You can also connect using any other supported way to your cluster. \n",
"\n",
"For more information on connecting to the Couchbase cluster, please check the [Python SDK documentation](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html#connect)."
"For more information on connecting to the Couchbase cluster, please check the [documentation](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html#connect)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "52fe583a-12db-4dc2-9281-1174bf1d4e5c",
"metadata": {},
"outputs": [],
"source": [
"COUCHBASE_CONNECTION_STRING = (\n",
" \"couchbase://localhost\" # or \"couchbases://localhost\" if using TLS\n",
")\n",
"DB_USERNAME = \"Administrator\"\n",
"DB_PASSWORD = \"Password\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "9986c6b9",
"metadata": {},
"outputs": [],
@@ -123,145 +135,15 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"id": "1b1d0a26-e9d4-4823-9800-9549d24d3d16",
"metadata": {},
"outputs": [],
"source": [
"BUCKET_NAME = \"testing\"\n",
"BUCKET_NAME = \"langchain_bucket\"\n",
"SCOPE_NAME = \"_default\"\n",
"COLLECTION_NAME = \"_default\"\n",
"SEARCH_INDEX_NAME = \"vector-index\""
]
},
{
"cell_type": "markdown",
"id": "efbac6ff-c2ac-4443-9250-7cc88061346b",
"metadata": {},
"source": [
"For this tutorial, we will use OpenAI embeddings"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "87625579-86d7-4de4-8a4d-cee674a6b676",
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "3677b4b0-3711-419c-89ff-32ef4d3e3022",
"metadata": {},
"source": [
"## Create the Search Index\n",
"Currently, the Search index needs to be created from the Couchbase Capella or Server UI or using the REST interface. \n",
"\n",
"Let us define a Search index with the name `vector-index` on the testing bucket\n",
"\n",
"For this example, let us use the Import Index feature on the Search Service on the UI. \n",
"\n",
"We are defining an index on the `testing` bucket's `_default` scope on the `_default` collection with the vector field set to `embedding` with 1536 dimensions and the text field set to `text`. We are also indexing and storing all the fields under `metadata` in the document as a dynamic mapping to account for varying document structures. The similarity metric is set to `dot_product`."
]
},
{
"cell_type": "markdown",
"id": "655117ae-9b1f-4139-b437-ca7685975a54",
"metadata": {},
"source": [
"### How to Import an Index to the Full Text Search service?\n",
" - [Couchbase Server](https://docs.couchbase.com/server/current/search/import-search-index.html)\n",
" - Click on Search -> Add Index -> Import\n",
" - Copy the following Index definition in the Import screen\n",
" - Click on Create Index to create the index.\n",
" - [Couchbase Capella](https://docs.couchbase.com/cloud/search/import-search-index.html)\n",
" - Copy the index definition to a new file `index.json`\n",
" - Import the file in Capella using the instructions in the documentation.\n",
" - Click on Create Index to create the index.\n",
" \n"
]
},
{
"cell_type": "markdown",
"id": "f85bc468-d9b8-487d-999a-3b5d2fb78e41",
"metadata": {},
"source": [
"### Index Definition\n",
"```\n",
"{\n",
" \"name\": \"vector-index\",\n",
" \"type\": \"fulltext-index\",\n",
" \"params\": {\n",
" \"doc_config\": {\n",
" \"docid_prefix_delim\": \"\",\n",
" \"docid_regexp\": \"\",\n",
" \"mode\": \"type_field\",\n",
" \"type_field\": \"type\"\n",
" },\n",
" \"mapping\": {\n",
" \"default_analyzer\": \"standard\",\n",
" \"default_datetime_parser\": \"dateTimeOptional\",\n",
" \"default_field\": \"_all\",\n",
" \"default_mapping\": {\n",
" \"dynamic\": true,\n",
" \"enabled\": true,\n",
" \"properties\": {\n",
" \"metadata\": {\n",
" \"dynamic\": true,\n",
" \"enabled\": true\n",
" },\n",
" \"embedding\": {\n",
" \"enabled\": true,\n",
" \"dynamic\": false,\n",
" \"fields\": [\n",
" {\n",
" \"dims\": 1536,\n",
" \"index\": true,\n",
" \"name\": \"embedding\",\n",
" \"similarity\": \"dot_product\",\n",
" \"type\": \"vector\",\n",
" \"vector_index_optimized_for\": \"recall\"\n",
" }\n",
" ]\n",
" },\n",
" \"text\": {\n",
" \"enabled\": true,\n",
" \"dynamic\": false,\n",
" \"fields\": [\n",
" {\n",
" \"index\": true,\n",
" \"name\": \"text\",\n",
" \"store\": true,\n",
" \"type\": \"text\"\n",
" }\n",
" ]\n",
" }\n",
" }\n",
" },\n",
" \"default_type\": \"_default\",\n",
" \"docvalues_dynamic\": false,\n",
" \"index_dynamic\": true,\n",
" \"store_dynamic\": true,\n",
" \"type_field\": \"_type\"\n",
" },\n",
" \"store\": {\n",
" \"indexType\": \"scorch\",\n",
" \"segmentVersion\": 16\n",
" }\n",
" },\n",
" \"sourceType\": \"gocbcore\",\n",
" \"sourceName\": \"testing\",\n",
" \"sourceParams\": {},\n",
" \"planParams\": {\n",
" \"maxPartitionsPerPIndex\": 103,\n",
" \"indexPartitions\": 10,\n",
" \"numReplicas\": 0\n",
" }\n",
"}\n",
"```"
"COLLECTION_NAME = \"default\"\n",
"SEARCH_INDEX_NAME = \"langchain-test-index\""
]
},
{
@@ -269,7 +151,7 @@
"id": "556dc68c-9089-4390-8dc9-b77051e7fc34",
"metadata": {},
"source": [
"For more details on how to create a Search index with support for Vector fields, please refer to the documentation.\n",
"For details on how to create a Search index with support for Vector fields, please refer to the documentation.\n",
"\n",
"- [Couchbase Capella](https://docs.couchbase.com/cloud/vector-search/create-vector-search-index-ui.html)\n",
" \n",
@@ -281,17 +163,40 @@
"id": "75f4037d-e509-4de7-a8d1-63a05de24e9d",
"metadata": {},
"source": [
"## Create Vector Store\n",
"We create the vector store object with the cluster information and the search index name."
"### Simple Instantiation\n",
"\n",
"Below, we create the vector store object with the cluster information and the search index name. \n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"id": "6706efdd",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33db4670-76c5-49ba-94d6-a8fa35583058",
"metadata": {},
"outputs": [],
"source": [
"from langchain_couchbase.vectorstores import CouchbaseVectorStore\n",
"\n",
"vector_store = CouchbaseVectorStore(\n",
" cluster=cluster,\n",
" bucket_name=BUCKET_NAME,\n",
@@ -308,9 +213,18 @@
"metadata": {},
"source": [
"### Specify the Text & Embeddings Field\n",
"You can optionally specify the text & embeddings field for the document using the `text_key` and `embedding_key` fields.\n",
"```\n",
"vector_store = CouchbaseVectorStore(\n",
"\n",
"You can optionally specify the text & embeddings field for the document using the `text_key` and `embedding_key` fields."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49c38634",
"metadata": {},
"outputs": [],
"source": [
"vector_store_specific = CouchbaseVectorStore(\n",
" cluster=cluster,\n",
" bucket_name=BUCKET_NAME,\n",
" scope_name=SCOPE_NAME,\n",
@@ -319,73 +233,148 @@
" index_name=SEARCH_INDEX_NAME,\n",
" text_key=\"text\",\n",
" embedding_key=\"embedding\",\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "790dc1ac-0ab8-4cb5-989d-31ca7c241068",
"metadata": {},
"source": [
"## Basic Vector Search Example\n",
"For this example, we are going to load the \"state_of_the_union.txt\" file via the TextLoader, chunk the text into 500 character chunks with no overlaps and index all these chunks into Couchbase.\n",
"\n",
"After the data is indexed, we perform a simple query to find the top 4 chunks that are similar to the query \"What did president say about Ketanji Brown Jackson\".\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "440350df-cbc6-48f7-8009-2e783be18306",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9d3b4c7c-abd6-4dfa-ad63-470f16661319",
"metadata": {},
"outputs": [],
"source": [
"vector_store = CouchbaseVectorStore.from_documents(\n",
" documents=docs,\n",
" embedding=embeddings,\n",
" cluster=cluster,\n",
" bucket_name=BUCKET_NAME,\n",
" scope_name=SCOPE_NAME,\n",
" collection_name=COLLECTION_NAME,\n",
" index_name=SEARCH_INDEX_NAME,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "91fdce6c-8f7c-4060-865a-2fd742846664",
"cell_type": "markdown",
"id": "50e95fa6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.' metadata={'source': '../../how_to/state_of_the_union.txt'}\n"
]
}
],
"source": [
"query = \"What did president say about Ketanji Brown Jackson\"\n",
"results = vector_store.similarity_search(query)\n",
"print(results[0])"
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65a35f00",
"metadata": {},
"outputs": [],
"source": [
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "dd33b030",
"metadata": {},
"source": [
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a05f294",
"metadata": {},
"outputs": [],
"source": [
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
"cell_type": "markdown",
"id": "d2cc4126",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.\n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e00bb23",
"metadata": {},
"outputs": [],
"source": [
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
@@ -393,31 +382,21 @@
"id": "d9b46c93-65f6-4e4f-87a2-5cebea3b7a6b",
"metadata": {},
"source": [
"## Similarity Search with Score\n",
"You can fetch the scores for the results by calling the `similarity_search_with_score` method."
"#### Similarity search with Score\n",
"\n",
"You can also fetch the scores for the results by calling the `similarity_search_with_score` method."
]
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"id": "24b146b2-55a2-4fe8-8659-3649032f5dc7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.' metadata={'source': '../../how_to/state_of_the_union.txt'}\n",
"Score: 0.8211871385574341\n"
]
}
],
"outputs": [],
"source": [
"query = \"What did president say about Ketanji Brown Jackson\"\n",
"results = vector_store.similarity_search_with_score(query)\n",
"document, score = results[0]\n",
"print(document)\n",
"print(f\"Score: {score}\")"
"results = vector_store.similarity_search_with_score(\"Will it be hot tomorrow?\", k=1)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
@@ -425,7 +404,8 @@
"id": "9983e83d-efd0-4b75-80db-150e0694e822",
"metadata": {},
"source": [
"## Specifying Fields to Return\n",
"### Specifying Fields to Return\n",
"\n",
"You can specify the fields to return from the document using `fields` parameter in the searches. These fields are returned as part of the `metadata` object in the returned Document. You can fetch any field that is stored in the Search index. The `text_key` of the document is returned as part of the document's `page_content`.\n",
"\n",
"If you do not specify any fields to be fetched, all the fields stored in the index are returned.\n",
@@ -437,20 +417,12 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": null,
"id": "ffa743dc-4e89-405b-ad71-7390338889e6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.' metadata={'source': '../../how_to/state_of_the_union.txt'}\n"
]
}
],
"outputs": [],
"source": [
"query = \"What did president say about Ketanji Brown Jackson\"\n",
"query = \"What did I eat for breakfast today?\"\n",
"results = vector_store.similarity_search(query, fields=[\"metadata.source\"])\n",
"print(results[0])"
]
@@ -460,7 +432,8 @@
"id": "a5e45eb2-aa97-45df-bcc5-410e9626e506",
"metadata": {},
"source": [
"## Hybrid Search\n",
"### Hybrid Queries\n",
"\n",
"Couchbase allows you to do hybrid searches by combining Vector Search results with searches on non-vector fields of the document like the `metadata` object. \n",
"\n",
"The results will be based on the combination of the results from both Vector Search and the searches supported by Search Service. The scores of each of the component searches are added up to get the total score of the result.\n",
@@ -474,26 +447,26 @@
"id": "a5db3685-1918-4c63-8148-0bb3a71ea677",
"metadata": {},
"source": [
"### Create Diverse Metadata for Hybrid Search\n",
"#### Create Diverse Metadata for Hybrid Search\n",
"In order to simulate hybrid search, let us create some random metadata from the existing documents. \n",
"We uniformly add three fields to the metadata, `date` between 2010 & 2020, `rating` between 1 & 5 and `author` set to either John Doe or Jane Doe. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": null,
"id": "7d2e607d-6bbc-4cef-83e3-b6a28bb269ea",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'author': 'John Doe', 'date': '2016-01-01', 'rating': 2, 'source': '../../how_to/state_of_the_union.txt'}\n"
]
}
],
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"# Adding metadata to documents\n",
"for i, doc in enumerate(docs):\n",
" doc.metadata[\"date\"] = f\"{range(2010, 2020)[i % 10]}-01-01\"\n",
@@ -512,24 +485,16 @@
"id": "6cad893b-3977-4556-ab1d-d12bce68b306",
"metadata": {},
"source": [
"### Example: Search by Exact Value\n",
"### Query by Exact Value\n",
"We can search for exact matches on a textual field like the author in the `metadata` object."
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": null,
"id": "dc06ba4a-8a6b-4c55-bb69-95cd92db273f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='This is personal to me and Jill, to Kamala, and to so many of you. \\n\\nCancer is the #2 cause of death in Americasecond only to heart disease. \\n\\nLast month, I announced our plan to supercharge \\nthe Cancer Moonshot that President Obama asked me to lead six years ago. \\n\\nOur goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases. \\n\\nMore support for patients and families.' metadata={'author': 'John Doe'}\n"
]
}
],
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = vector_store.similarity_search(\n",
@@ -545,7 +510,7 @@
"id": "9106b594-b41e-4329-b98c-9b9f8a34d6f7",
"metadata": {},
"source": [
"### Example: Search by Partial Match\n",
"### Query by Partial Match\n",
"We can search for partial matches by specifying a fuzziness for the search. This is useful when you want to search for slight variations or misspellings of a search query.\n",
"\n",
"Here, \"Jae\" is close (fuzziness of 1) to \"Jane\"."
@@ -553,18 +518,10 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": null,
"id": "fd4749e6-ef4f-4cb5-95ff-37c4fa8283d8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \\n\\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.' metadata={'author': 'Jane Doe'}\n"
]
}
],
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"results = vector_store.similarity_search(\n",
@@ -582,24 +539,16 @@
"id": "1bbf9449-6e30-4bd1-9eeb-f3b60952fcab",
"metadata": {},
"source": [
"### Example: Search by Date Range Query\n",
"### Query by Date Range Query\n",
"We can search for documents that are within a date range query on a date field like `metadata.date`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": null,
"id": "b7b47e7d-c32f-4999-bce9-3c3c3cebffd0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='He will never extinguish their love of freedom. He will never weaken the resolve of the free world. \\n\\nWe meet tonight in an America that has lived through two of the hardest years this nation has ever faced. \\n\\nThe pandemic has been punishing. \\n\\nAnd so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \\n\\nI understand.' metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}\n"
]
}
],
"outputs": [],
"source": [
"query = \"Any mention about independence?\"\n",
"results = vector_store.similarity_search(\n",
@@ -622,24 +571,16 @@
"id": "a18d4ea2-bfab-4f15-9839-674faf1c6f0d",
"metadata": {},
"source": [
"### Example: Search by Numeric Range Query\n",
"### Query by Numeric Range Query\n",
"We can search for documents that are within a range for a numeric field like `metadata.rating`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": null,
"id": "7e8bf7c5-07d1-4c3f-86d7-1fa3a454dc7f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Document(page_content='He will never extinguish their love of freedom. He will never weaken the resolve of the free world. \\n\\nWe meet tonight in an America that has lived through two of the hardest years this nation has ever faced. \\n\\nThe pandemic has been punishing. \\n\\nAnd so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \\n\\nI understand.', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}), 0.9000703597577832)\n"
]
}
],
"outputs": [],
"source": [
"query = \"Any mention about independence?\"\n",
"results = vector_store.similarity_search_with_score(\n",
@@ -662,7 +603,7 @@
"id": "0f16bf86-f01c-4a77-8406-275f7313f493",
"metadata": {},
"source": [
"### Example: Combining Multiple Search Queries\n",
"### Combining Multiple Search Queries\n",
"Different search queries can be combined using AND (conjuncts) or OR (disjuncts) operators.\n",
"\n",
"In this example, we are checking for documents with a rating between 3 & 4 and dated between 2015 & 2018."
@@ -670,18 +611,10 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": null,
"id": "dd0fe7f1-aa40-4c6f-889b-99ad5efcd88b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Document(page_content='He will never extinguish their love of freedom. He will never weaken the resolve of the free world. \\n\\nWe meet tonight in an America that has lived through two of the hardest years this nation has ever faced. \\n\\nThe pandemic has been punishing. \\n\\nAnd so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \\n\\nI understand.', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}), 1.3598770370389914)\n"
]
}
],
"outputs": [],
"source": [
"query = \"Any mention about independence?\"\n",
"results = vector_store.similarity_search_with_score(\n",
@@ -710,6 +643,90 @@
"- [Couchbase Server](https://docs.couchbase.com/server/current/search/search-request-params.html#query-object)"
]
},
{
"cell_type": "markdown",
"id": "db0a1d74",
"metadata": {},
"source": [
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. \n",
"\n",
"Here is how to transform your vector store into a retriever and then invoke the retreiever with a simple query and filter."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3666265a",
"metadata": {},
"outputs": [],
"source": [
"retriever = vector_store.as_retriever(\n",
" search_type=\"similarity_score_threshold\",\n",
" search_kwargs={\"k\": 1, \"score_threshold\": 0.5},\n",
")\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "28ab35ec",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a6a849aa",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e34c9e3a",
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
"cell_type": "markdown",
"id": "80958c2b-6a67-45e6-b7f0-fd2461d75e0f",
@@ -761,6 +778,16 @@
"* [Couchbase Capella](https://docs.couchbase.com/cloud/search/create-child-mapping.html)\n",
"* [Couchbase Server](https://docs.couchbase.com/server/current/search/create-child-mapping.html)"
]
},
{
"cell_type": "markdown",
"id": "d876b769",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `CouchbaseVectorStore` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_couchbase.vectorstores.CouchbaseVectorStore.html"
]
}
],
"metadata": {
@@ -779,7 +806,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.11.9"
}
},
"nbformat": 4,

File diff suppressed because it is too large Load Diff

View File

@@ -7,11 +7,9 @@
"source": [
"# Faiss\n",
"\n",
">[Facebook AI Similarity Search (Faiss)](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.\n",
">[Facebook AI Similarity Search (FAISS)](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.\n",
"\n",
"[Faiss documentation](https://faiss.ai/).\n",
"\n",
"You'll need to install `langchain-community` with `pip install -qU langchain-community` to use this integration\n",
"You can find the FAISS documentation at [this page](https://faiss.ai/).\n",
"\n",
"This notebook shows how to use functionality related to the `FAISS` vector database. It will show functionality specific to this integration. After going through, it may be useful to explore [relevant use-case pages](/docs/how_to#qa-with-rag) to learn how to use this vectorstore as part of a larger chain."
]
@@ -25,28 +23,19 @@
"source": [
"## Setup\n",
"\n",
"The integration lives in the `langchain-community` package. We also need to install the `faiss` package itself. We will also be using OpenAI for embeddings, so we need to install those requirements. We can install these with:\n",
"The integration lives in the `langchain-community` package. We also need to install the `faiss` package itself. We can install these with:\n",
"\n",
"```bash\n",
"pip install -U langchain-community faiss-cpu langchain-openai tiktoken\n",
"```\n",
"\n",
"Note that you can also install `faiss-gpu` if you want to use the GPU enabled version\n",
"\n",
"Since we are using OpenAI, you will need an OpenAI API Key."
"Note that you can also install `faiss-gpu` if you want to use the GPU enabled version"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23984e60-c29a-461a-be2b-219108ac37ee",
"id": "08165d56",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass()"
"pip install -qU langchain-community faiss-cpu"
]
},
{
@@ -54,7 +43,7 @@
"id": "408be78f-7b0e-44d4-8d48-56a6cb9b3fb9",
"metadata": {},
"source": [
"It's also helpful (but not needed) to set up [LangSmith](https://smith.langchain.com/) for best-in-class observability"
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
@@ -73,200 +62,366 @@
"id": "78dde98a-584f-4f2a-98d5-e776fd9558fa",
"metadata": {},
"source": [
"## Ingestion\n",
"## Initialization\n",
"\n",
"Here, we ingest documents into the vectorstore"
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dc37144c-208d-4ab3-9f3a-0407a69fe052",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"42"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Uncomment the following line if you need to initialize FAISS with no AVX2 optimization\n",
"# os.environ['FAISS_NO_AVX2'] = '1'\n",
"\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.vectorstores import FAISS\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"embeddings = OpenAIEmbeddings()\n",
"db = FAISS.from_documents(docs, embeddings)\n",
"print(db.index.ntotal)"
]
},
{
"cell_type": "markdown",
"id": "ecdd7a65-f310-4b36-bc1e-2a39dfd58d5f",
"id": "5b394da3",
"metadata": {},
"outputs": [],
"source": [
"## Querying\n",
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"Now, we can query the vectorstore. There a few methods to do this. The most standard is to use `similarity_search`."
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5eabdb75",
"id": "dc37144c-208d-4ab3-9f3a-0407a69fe052",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)"
"import faiss\n",
"from langchain_community.docstore.in_memory import InMemoryDocstore\n",
"from langchain_community.vectorstores import FAISS\n",
"\n",
"index = faiss.IndexFlatL2(len(embeddings.embed_query(\"hello world\")))\n",
"\n",
"vector_store = FAISS(\n",
" embedding_function=embeddings,\n",
" index=index,\n",
" docstore=InMemoryDocstore(),\n",
" index_to_docstore_id={},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d8761614",
"metadata": {},
"source": [
"## Manage vector store\n",
"\n",
"### Add items to vector store"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4b172de8",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "6d9286c2-0802-4f02-8f9a-9f7fae7c79b0",
"metadata": {},
"source": [
"## As a Retriever\n",
"\n",
"We can also convert the vectorstore into a [Retriever](/docs/how_to#retrievers) class. This allows us to easily use it in other LangChain methods, which largely work with retrievers"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6e91b475-3878-44e0-8720-98d903754b46",
"metadata": {},
"outputs": [],
"source": [
"retriever = db.as_retriever()\n",
"docs = retriever.invoke(query)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "046739d2-91fe-4101-8b72-c0bcdd9e02b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "f13473b5",
"metadata": {},
"source": [
"## Similarity Search with score\n",
"There are some FAISS specific methods. One of them is `similarity_search_with_score`, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance. Therefore, a lower score is better."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "186ee1d8",
"metadata": {},
"outputs": [],
"source": [
"docs_and_scores = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "284e04b5",
"id": "3867e154",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../how_to/state_of_the_union.txt'}),\n",
" 0.36913747)"
"['22f5ce99-cd6f-4e0c-8dab-664128307c72',\n",
" 'dc3f061b-5f88-4fa1-a966-413550c51891',\n",
" 'd33d890b-baad-47f7-b7c1-175f5f7b4e59',\n",
" '6e6c01d2-6020-4a7b-95da-ef43d43f01b5',\n",
" 'e677223d-ad75-4c1a-bef6-b5912bd1de03',\n",
" '47e2a168-6462-4ed2-b1d9-d9edfd7391d6',\n",
" '1e4d66d6-e155-4891-9212-f7be97f36c6a',\n",
" 'c0663096-e1a5-4665-b245-1c2e6c4fb653',\n",
" '8297474a-7f7c-4006-9865-398c1781b1bc',\n",
" '44e4be03-0a8d-4316-b3c4-f35f4bb2b532']"
]
},
"execution_count": 8,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs_and_scores[0]"
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "f34420cf",
"id": "a410a2dc",
"metadata": {},
"source": [
"It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string."
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b558ebb7",
"execution_count": 4,
"id": "c3db04bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
"cell_type": "markdown",
"id": "77de24ff",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search with filtering on metadata can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "53d95d3f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter={\"source\": \"tweet\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "5ae35069",
"metadata": {},
"source": [
"#### Similarity search with score\n",
"\n",
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a9078ce9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.893688] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\n",
" \"Will it be hot tomorrow?\", k=1, filter={\"source\": \"news\"}\n",
")\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "e9091b1f",
"metadata": {},
"source": [
"#### Other search methods\n",
"\n",
"\n",
"There are a variety of other ways to search a FAISS vector store. For a complete list of those methods, please refer to the [API Reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html)\n",
"\n",
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "10da64fa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "5edd1909",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6b792eaa",
"metadata": {},
"outputs": [],
"source": [
"embedding_vector = embeddings.embed_query(query)\n",
"docs_and_scores = db.similarity_search_by_vector(embedding_vector)"
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1aca9435",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of these types of applications.'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
@@ -280,31 +435,33 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"id": "1b31fe27-e0b3-42c6-b17c-8270b517ee1f",
"metadata": {},
"outputs": [],
"source": [
"db.save_local(\"faiss_index\")\n",
"vector_store.save_local(\"faiss_index\")\n",
"\n",
"new_db = FAISS.load_local(\"faiss_index\", embeddings)\n",
"new_vector_store = FAISS.load_local(\n",
" \"faiss_index\", embeddings, allow_dangerous_deserialization=True\n",
")\n",
"\n",
"docs = new_db.similarity_search(query)"
"docs = new_vector_store.similarity_search(\"qux\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 12,
"id": "98378c4e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
"Document(metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!')"
]
},
"execution_count": 9,
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
@@ -313,33 +470,6 @@
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "30c8f57b",
"metadata": {},
"source": [
"# Serializing and De-Serializing to bytes\n",
"\n",
"you can pickle the FAISS Index by these functions. If you use embeddings model which is of 90 mb (sentence-transformers/all-MiniLM-L6-v2 or any other model), the resultant pickle size would be more than 90 mb. the size of the model is also included in the overall size. To overcome this, use the below functions. These functions only serializes FAISS index and size would be much lesser. this can be helpful if you wish to store the index in database like sql."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8faead5",
"metadata": {},
"outputs": [],
"source": [
"from langchain_huggingface import HuggingFaceEmbeddings\n",
"\n",
"pkl = db.serialize_to_bytes() # serializes the faiss\n",
"embeddings = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
"\n",
"db = FAISS.deserialize_from_bytes(\n",
" embeddings=embeddings, serialized=pkl\n",
") # Load the index"
]
},
{
"cell_type": "markdown",
"id": "57da60d4",
@@ -351,10 +481,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 13,
"id": "9b8f5e31-3f40-4e94-8d97-5883125efba7",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"{'b752e805-350e-4cf5-ba54-0883d46a3a44': Document(page_content='foo')}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db1 = FAISS.from_texts([\"foo\"], embeddings)\n",
"db2 = FAISS.from_texts([\"bar\"], embeddings)\n",
@@ -364,17 +505,17 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 14,
"id": "83392605",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}"
"{'08192d92-746d-4cd1-b681-bdfba411f459': Document(page_content='bar')}"
]
},
"execution_count": 10,
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
@@ -385,7 +526,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 15,
"id": "a3fcc1c7",
"metadata": {},
"outputs": [],
@@ -395,18 +536,18 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 16,
"id": "41c51f89",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={}),\n",
" '807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}"
"{'b752e805-350e-4cf5-ba54-0883d46a3a44': Document(page_content='foo'),\n",
" '08192d92-746d-4cd1-b681-bdfba411f459': Document(page_content='bar')}"
]
},
"execution_count": 13,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
@@ -417,169 +558,12 @@
},
{
"cell_type": "markdown",
"id": "f4294b96",
"id": "65654d80",
"metadata": {},
"source": [
"## Similarity Search with filtering\n",
"FAISS vectorstore can also support filtering, since the FAISS does not natively support filtering we have to do it manually. This is done by first fetching more results than `k` and then filtering them. This filter is either a callble that takes as input a metadata dict and returns a bool, or a metadata dict where each missing key is ignored and each present k must be in a list of values. You can also set the `fetch_k` parameter when calling any search method to set how many documents you want to fetch before filtering. Here is a small example:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "d5bf812c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15\n",
"Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15\n",
"Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15\n",
"Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15\n"
]
}
],
"source": [
"from langchain_core.documents import Document\n",
"## API reference\n",
"\n",
"list_of_documents = [\n",
" Document(page_content=\"foo\", metadata=dict(page=1)),\n",
" Document(page_content=\"bar\", metadata=dict(page=1)),\n",
" Document(page_content=\"foo\", metadata=dict(page=2)),\n",
" Document(page_content=\"barbar\", metadata=dict(page=2)),\n",
" Document(page_content=\"foo\", metadata=dict(page=3)),\n",
" Document(page_content=\"bar burr\", metadata=dict(page=3)),\n",
" Document(page_content=\"foo\", metadata=dict(page=4)),\n",
" Document(page_content=\"bar bruh\", metadata=dict(page=4)),\n",
"]\n",
"db = FAISS.from_documents(list_of_documents, embeddings)\n",
"results_with_scores = db.similarity_search_with_score(\"foo\")\n",
"for doc, score in results_with_scores:\n",
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}\")"
]
},
{
"cell_type": "markdown",
"id": "3d33c126",
"metadata": {},
"source": [
"Now we make the same query call but we filter for only `page = 1` "
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "83159330",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15\n",
"Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906\n"
]
}
],
"source": [
"results_with_scores = db.similarity_search_with_score(\"foo\", filter=dict(page=1))\n",
"# Or with a callable:\n",
"# results_with_scores = db.similarity_search_with_score(\"foo\", filter=lambda d: d[\"page\"] == 1)\n",
"for doc, score in results_with_scores:\n",
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}\")"
]
},
{
"cell_type": "markdown",
"id": "0be136e0",
"metadata": {},
"source": [
"Same thing can be done with the `max_marginal_relevance_search` as well."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "432c6980",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content: foo, Metadata: {'page': 1}\n",
"Content: bar, Metadata: {'page': 1}\n"
]
}
],
"source": [
"results = db.max_marginal_relevance_search(\"foo\", filter=dict(page=1))\n",
"for doc in results:\n",
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}\")"
]
},
{
"cell_type": "markdown",
"id": "1b4ecd86",
"metadata": {},
"source": [
"Here is an example of how to set `fetch_k` parameter when calling `similarity_search`. Usually you would want the `fetch_k` parameter >> `k` parameter. This is because the `fetch_k` parameter is the number of documents that will be fetched before filtering. If you set `fetch_k` to a low number, you might not get enough documents to filter from."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "1fd60fd1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Content: foo, Metadata: {'page': 1}\n"
]
}
],
"source": [
"results = db.similarity_search(\"foo\", filter=dict(page=1), k=1, fetch_k=4)\n",
"for doc in results:\n",
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}\")"
]
},
{
"cell_type": "markdown",
"id": "1becca53",
"metadata": {},
"source": [
"## Delete\n",
"\n",
"You can also delete records from vectorstore. In the example below `db.index_to_docstore_id` represents a dictionary with elements of the FAISS index."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "1408b870",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count before: 8\n",
"count after: 7"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"count before:\", db.index.ntotal)\n",
"db.delete([db.index_to_docstore_id[0]])\n",
"print(\"count after:\", db.index.ntotal)"
"For detailed documentation of all `FAISS` vector store features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html"
]
}
],
@@ -599,7 +583,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.1"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,29 @@
---
sidebar_position: 0
sidebar_class_name: hidden
keywords: [compatibility]
custom_edit_url:
---
# Vectorstores
## Features
The table below lists the features for some of our most popular vector stores.
Vectorstore|Delete by ID|Filtering|Search by Vector|Search with score|Async|Passes Standard Tests|Multi Tenancy|Local/Cloud|IDs in add Documents
:-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
AstraDBVectorStore|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅
Chroma|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅
Clickhouse|✅|✅|❌|✅|❌|❌|❌|❌|Local|✅
CouchbaseVectorStore|✅|✅|❌|✅|✅|❌|❌|❌|Local|✅
ElasticsearchStore|✅|✅|✅|❌|✅|❌|❌|❌|Local|✅
FAISS|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅
InMemoryVectorStore|✅|✅|❌|✅|✅|❌|❌|❌|Local|✅
Milvus|✅|✅|❌|✅|✅|❌|❌|❌|Local|✅
MongoDBAtlasVectorSearch|✅|✅|❌|❌|✅|❌|❌|❌|Local|✅
PGVector|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅
PineconeVectorStore|✅|✅|✅|❌|✅|❌|❌|❌|Local|✅
QdrantVectorStore|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅
Redis|✅|✅|✅|✅|✅|❌|❌|❌|Local|✅

View File

@@ -11,7 +11,9 @@
"\n",
"This notebook shows how to use functionality related to the Milvus vector database.\n",
"\n",
"You'll need to install `langchain-milvus` with `pip install -qU langchain-milvus` to use this integration\n"
"## Setup\n",
"\n",
"You'll need to install `langchain-milvus` with `pip install -qU langchain-milvus` to use this integration.\n"
]
},
{
@@ -23,7 +25,7 @@
},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain_milvus"
"%pip install -qU langchain_milvus"
]
},
{
@@ -31,119 +33,59 @@
"id": "633addc3",
"metadata": {},
"source": [
"The latest version of pymilvus comes with a local vector database Milvus Lite, good for prototyping. If you have large scale of data such as more than a million docs, we recommend setting up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/install_standalone-docker.md#Start-Milvus)."
"The latest version of pymilvus comes with a local vector database Milvus Lite, good for prototyping. If you have large scale of data such as more than a million docs, we recommend setting up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/install_standalone-docker.md#Start-Milvus).\n",
"\n",
"### Credentials\n",
"\n",
"No credentials are needed to use the `Milvus` vector store.\n",
"\n",
"## Initialization\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "7a0f9e02-8eb0-4aef-b11f-8861360472ee",
"cell_type": "code",
"execution_count": 25,
"id": "a7dd253f",
"metadata": {},
"source": [
"We want to use OpenAIEmbeddings so we have to get the OpenAI API Key."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8b6ed9cd-81b9-46e5-9c20-5aafca2844d0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "aac9563e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_milvus.vectorstores import Milvus\n",
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3c3999a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 28,
"id": "dcf88bdf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_milvus import Milvus\n",
"\n",
"# The easiest way is to use Milvus Lite where everything is stored in a local file.\n",
"# If you have a Milvus server you can use the server URI such as \"http://localhost:19530\".\n",
"URI = \"./milvus_demo.db\"\n",
"URI = \"./milvus_example.db\"\n",
"\n",
"vector_db = Milvus.from_documents(\n",
" docs,\n",
" embeddings,\n",
"vector_store = Milvus(\n",
" embedding_function=embeddings,\n",
" connection_args={\"uri\": URI},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a8c513ab",
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = vector_db.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fc516993",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].page_content"
]
},
{
"cell_type": "markdown",
"id": "e40d558b",
"id": "cae1a7d5",
"metadata": {},
"source": [
"### Compartmentalize the data with Milvus Collections\n",
@@ -153,7 +95,7 @@
},
{
"cell_type": "markdown",
"id": "82c00f6e",
"id": "c07cd24b",
"metadata": {},
"source": [
"Here's how you can create a new collection"
@@ -161,22 +103,24 @@
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7ff38ab",
"execution_count": 29,
"id": "c6f4973d",
"metadata": {},
"outputs": [],
"source": [
"vector_db = Milvus.from_documents(\n",
" docs,\n",
"from langchain_core.documents import Document\n",
"\n",
"vector_store_saved = Milvus.from_documents(\n",
" [Document(page_content=\"foo!\")],\n",
" embeddings,\n",
" collection_name=\"collection_1\",\n",
" collection_name=\"langchain_example\",\n",
" connection_args={\"uri\": URI},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "891cec1f",
"id": "3b12df8c",
"metadata": {},
"source": [
"And here is how you retrieve that stored collection"
@@ -184,24 +128,333 @@
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9e873e9",
"execution_count": 30,
"id": "12817d16",
"metadata": {},
"outputs": [],
"source": [
"vector_db = Milvus(\n",
"vector_store_loaded = Milvus(\n",
" embeddings,\n",
" connection_args={\"uri\": URI},\n",
" collection_name=\"collection_1\",\n",
" collection_name=\"langchain_example\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9cc65535",
"id": "f1fc3818",
"metadata": {},
"source": [
"After retrieval you can go on querying it as usual."
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "3ced24f6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['b0248595-2a41-4f6b-9c25-3a24c1278bb3',\n",
" 'fa642726-5329-4495-a072-187e948dd71f',\n",
" '9905001c-a4a3-455e-ab94-72d0ed11b476',\n",
" 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5',\n",
" '7508f7ff-c0c9-49ea-8189-634f8a0244d8',\n",
" '2e179609-3ff7-4c6a-9e05-08978903fe26',\n",
" 'fab1f2ac-43e1-45f9-b81b-fc5d334c6508',\n",
" '1206d237-ee3a-484f-baf2-b5ac38eeb314',\n",
" 'd43cbf9a-a772-4c40-993b-9439065fec01',\n",
" '25e667bb-6f09-4574-a368-661069301906']"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "e23c22d8",
"metadata": {},
"source": [
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "1f387fa8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(insert count: 0, delete count: 1, upsert count: 0, timestamp: 0, success count: 0, err count: 0, cost: 0)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
"cell_type": "markdown",
"id": "fb12fa75",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search with filtering on metadata can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "35801a55",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'pk': '9905001c-a4a3-455e-ab94-72d0ed11b476', 'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'pk': '1206d237-ee3a-484f-baf2-b5ac38eeb314', 'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter={\"source\": \"tweet\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "35574409",
"metadata": {},
"source": [
"#### Similarity search with score\n",
"\n",
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "c360af3d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=21192.628906] bar [{'pk': '2', 'source': 'https://example.com'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\n",
" \"Will it be hot tomorrow?\", k=1, filter={\"source\": \"news\"}\n",
")\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "14db337f",
"metadata": {},
"source": [
"For a full list of all the search options available when using the `Milvus` vector store, you can visit the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html).\n",
"\n",
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "f6d9357c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'pk': 'eacc7256-d7fa-4036-b1f7-83d7a4bee0c5', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"cell_type": "markdown",
"id": "8ac953f1",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "d17118c2",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "7bbe3b95",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of such applications effectively.'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
@@ -325,47 +578,12 @@
},
{
"cell_type": "markdown",
"id": "89756e9e",
"id": "f1a873c5",
"metadata": {},
"source": [
"### To delete or upsert (update/insert) one or more entities"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21c4edcf",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.documents import Document\n",
"## API reference\n",
"\n",
"# Insert data sample\n",
"docs = [\n",
" Document(page_content=\"foo\", metadata={\"id\": 1}),\n",
" Document(page_content=\"bar\", metadata={\"id\": 2}),\n",
" Document(page_content=\"baz\", metadata={\"id\": 3}),\n",
"]\n",
"vector_db = Milvus.from_documents(\n",
" docs,\n",
" embeddings,\n",
" connection_args={\"uri\": URI},\n",
")\n",
"\n",
"# Search pks (primary keys) using expression\n",
"expr = \"id in [1,2]\"\n",
"pks = vector_db.get_pks(expr)\n",
"\n",
"# Delete entities by pks\n",
"result = vector_db.delete(pks)\n",
"\n",
"# Upsert (Update/Insert)\n",
"new_docs = [\n",
" Document(page_content=\"new_foo\", metadata={\"id\": 1}),\n",
" Document(page_content=\"new_bar\", metadata={\"id\": 2}),\n",
" Document(page_content=\"upserted_bak\", metadata={\"id\": 3}),\n",
"]\n",
"upserted_pks = vector_db.upsert(pks, new_docs)"
"For detailed documentation of all __ModuleName__VectorStore features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_milvus.vectorstores.milvus.Milvus.html"
]
}
],
@@ -385,7 +603,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -19,74 +19,40 @@
"id": "359b8e9b",
"metadata": {},
"source": [
"## Prerequisites\n",
"## Setup\n",
"\n",
">*An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs).\n",
"\n",
">*An OpenAI API Key. You must have a paid OpenAI account with credits available for API requests.\n",
"To use MongoDB Atlas, you must first deploy a cluster. We have a Forever-Free tier of clusters available. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).\n",
"\n",
"You'll need to install `langchain-mongodb` to use this integration"
]
},
{
"cell_type": "markdown",
"id": "d899e588",
"metadata": {},
"source": [
"## Setting up MongoDB Atlas Cluster\n",
"To use MongoDB Atlas, you must first deploy a cluster. We have a Forever-Free tier of clusters available. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/)."
]
},
{
"cell_type": "markdown",
"id": "1b5ce18d",
"metadata": {},
"source": [
"## Usage\n",
"In the notebook we will demonstrate how to perform `Retrieval Augmented Generation` (RAG) using MongoDB Atlas, OpenAI and Langchain. We will be performing Similarity Search, Similarity Search with Metadata Pre-Filtering, and Question Answering over the PDF document for [GPT 4 technical report](https://arxiv.org/pdf/2303.08774.pdf) that came out in March 2023 and hence is not part of the OpenAI's Large Language Model(LLM)'s parametric memory, which had a knowledge cutoff of September 2021."
]
},
{
"cell_type": "markdown",
"id": "457ace44-1d95-4001-9dd5-78811ab208ad",
"metadata": {},
"source": [
"We want to use `OpenAIEmbeddings` so we need to set up our OpenAI API Key. "
"You'll need to install `langchain-mongodb` and `pymongo` to use this integration."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d8f240d",
"id": "73cf7c9f",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
"pip install -qU langchain-mongodb pymongo"
]
},
{
"cell_type": "markdown",
"id": "70482cd8",
"id": "a61832ea",
"metadata": {},
"source": [
"Now we will setup the environment variables for the MongoDB Atlas cluster"
"### Credentials\n",
"\n",
"For this notebook you will need to find your MongoDB cluster URI.\n",
"\n",
"For information on finding your cluster URI read through [this guide](https://www.mongodb.com/docs/manual/reference/connection-string/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d7788cf",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain langchain-mongodb pypdf pymongo langchain-openai tiktoken"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 33,
"id": "7ef41b37",
"metadata": {},
"outputs": [],
@@ -96,76 +62,78 @@
"MONGODB_ATLAS_CLUSTER_URI = getpass.getpass(\"MongoDB Atlas Cluster URI:\")"
]
},
{
"cell_type": "markdown",
"id": "1f23de23",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "908e7772",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"id": "a53673ae",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "f5fed614",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-small\")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "00d78318",
"metadata": {},
"outputs": [],
"source": [
"from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch\n",
"from pymongo import MongoClient\n",
"\n",
"# initialize MongoDB python client\n",
"client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)\n",
"\n",
"DB_NAME = \"langchain_db\"\n",
"COLLECTION_NAME = \"test\"\n",
"ATLAS_VECTOR_SEARCH_INDEX_NAME = \"index_name\"\n",
"DB_NAME = \"langchain_test_db\"\n",
"COLLECTION_NAME = \"langchain_test_vectorstores\"\n",
"ATLAS_VECTOR_SEARCH_INDEX_NAME = \"langchain-test-index-vectorstores\"\n",
"\n",
"MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]"
]
},
{
"cell_type": "markdown",
"id": "eb0cc10f-b84e-4e5e-b445-eb61f10bf085",
"metadata": {},
"source": [
"## Create Vector Search Index"
]
},
{
"cell_type": "markdown",
"id": "1f3ecc42",
"metadata": {},
"source": [
"Now, let's create a vector search index on your cluster. More detailed steps can be found at [Create Vector Search Index for LangChain](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#create-the-atlas-vector-search-index) section.\n",
"In the below example, `embedding` is the name of the field that contains the embedding vector. Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/) to get more details on how to define an Atlas Vector Search index.\n",
"You can name the index `{ATLAS_VECTOR_SEARCH_INDEX_NAME}` and create the index on the namespace `{DB_NAME}.{COLLECTION_NAME}`. Finally, write the following definition in the JSON editor on MongoDB Atlas:\n",
"MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]\n",
"\n",
"```json\n",
"{\n",
" \"fields\":[\n",
" {\n",
" \"type\": \"vector\",\n",
" \"path\": \"embedding\",\n",
" \"numDimensions\": 1536,\n",
" \"similarity\": \"cosine\"\n",
" }\n",
" ]\n",
"}\n",
"```\n",
"\n",
"Additionally, if you are running a MongoDB M10 cluster with server version 6.0+, you can leverage the `MongoDBAtlasVectorSearch.create_index`. To add the above index its usage would look like this.\n",
"\n",
"```python\n",
"from langchain_community.embeddings.openai import OpenAIEmbeddings\n",
"from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch\n",
"from pymongo import MongoClient\n",
"\n",
"mongo_client = MongoClient(\"<YOUR-CONNECTION-STRING>\")\n",
"collection = mongo_client[\"<db_name>\"][\"<collection_name>\"]\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"vectorstore = MongoDBAtlasVectorSearch(\n",
" collection=collection,\n",
" embedding=embeddings,\n",
" index_name=\"<ATLAS_VECTOR_SEARCH_INDEX_NAME>\",\n",
" relevance_score_fn=\"cosine\",\n",
")\n",
"\n",
"# Creates an index using the index_name provided and relevance_score_fn type\n",
"vectorstore.create_index(dimensions=1536)\n",
"```"
"vector_store = MongoDBAtlasVectorSearch(\n",
" collection=MONGODB_COLLECTION,\n",
" embedding=embeddings,\n",
" index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
" relevance_score_fn=\"cosine\",\n",
")"
]
},
{
@@ -173,126 +141,224 @@
"id": "42873e5a",
"metadata": {},
"source": [
"# Insert Data"
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 57,
"id": "aac9563e",
"metadata": {
"tags": []
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"['03ad81e8-32a0-46f0-b7d8-f5b977a6b52a',\n",
" '8396a68d-f4a3-4176-a581-a1a8c303eea4',\n",
" 'e7d95150-67f6-499f-b611-84367c50fa60',\n",
" '8c31b84e-2636-48b6-8b99-9fccb47f7051',\n",
" 'aa02e8a2-a811-446a-9785-8cea0faba7a9',\n",
" '19bd72ff-9766-4c3b-b1fd-195c732c562b',\n",
" '642d6f2f-3e34-4efa-a1ed-c4ba4ef0da8d',\n",
" '7614bb54-4eb5-4b3b-990c-00e35cb31f99',\n",
" '69e18c67-bf1b-43e5-8a6e-64fb3f240e52',\n",
" '30d599a7-4a1a-47a9-bbf8-6ed393e2e33c']"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import PyPDFLoader\n",
"from uuid import uuid4\n",
"\n",
"# Load the PDF\n",
"loader = PyPDFLoader(\"https://arxiv.org/pdf/2303.08774.pdf\")\n",
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5578113",
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"from langchain_core.documents import Document\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)\n",
"docs = text_splitter.split_documents(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d378168f",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.vectorstores import MongoDBAtlasVectorSearch\n",
"from langchain_openai import OpenAIEmbeddings\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"# insert the documents in MongoDB Atlas with their embedding\n",
"vector_search = MongoDBAtlasVectorSearch.from_documents(\n",
" documents=docs,\n",
" embedding=OpenAIEmbeddings(disallowed_special=()),\n",
" collection=MONGODB_COLLECTION,\n",
" index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bf6841e",
"metadata": {},
"outputs": [],
"source": [
"# Perform a similarity search between the embedding of the query and the embeddings of the documents\n",
"query = \"What were the compute requirements for training GPT 4\"\n",
"results = vector_search.similarity_search(query)\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"print(results[0].page_content)"
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "9e58c2d8",
"id": "639f29da",
"metadata": {},
"source": [
"# Querying data"
]
},
{
"cell_type": "markdown",
"id": "851a2ec9-9390-49a4-8412-3e132c9f789d",
"metadata": {},
"source": [
"We can also instantiate the vector store directly and execute a query as follows:"
"### Delete items from vector store\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "985d28fe",
"execution_count": 58,
"id": "bbb5fd5c",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.vectorstores import MongoDBAtlasVectorSearch\n",
"from langchain_openai import OpenAIEmbeddings\n",
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
"cell_type": "markdown",
"id": "d6111eb6",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"vector_search = MongoDBAtlasVectorSearch.from_connection_string(\n",
" MONGODB_ATLAS_CLUSTER_URI,\n",
" DB_NAME + \".\" + COLLECTION_NAME,\n",
" OpenAIEmbeddings(disallowed_special=()),\n",
" index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,\n",
")"
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"#### Similarity search\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "19b60ac0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'_id': 'e7d95150-67f6-499f-b611-84367c50fa60', 'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'_id': '7614bb54-4eb5-4b3b-990c-00e35cb31f99', 'source': 'tweet'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\", k=2\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "02aef29c-5da0-41b8-b4fc-98fd71b94abf",
"id": "6c624606",
"metadata": {},
"source": [
"## Pre-filtering with Similarity Search"
"#### Similarity search with score\n",
"\n",
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "e919fa51",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.784560] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'_id': '8396a68d-f4a3-4176-a581-a1a8c303eea4', 'source': 'news'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\"Will it be hot tomorrow?\", k=1)\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "f3b2d36d-d47a-482f-999d-85c23eb67eed",
"id": "513a1416",
"metadata": {},
"source": [
"### Pre-filtering with Similarity Search"
]
},
{
"cell_type": "markdown",
"id": "ac58c6c7",
"metadata": {},
"source": [
"Atlas Vector Search supports pre-filtering using MQL Operators for filtering. Below is an example index and query on the same data loaded above that allows you do metadata filtering on the \"page\" field. You can update your existing index with the filter defined and do pre-filtering with vector search."
@@ -300,7 +366,7 @@
},
{
"cell_type": "markdown",
"id": "2b385a46-1e54-471f-95b2-202813d90bb2",
"id": "dacac7b8",
"metadata": {},
"source": [
"```json\n",
@@ -314,7 +380,7 @@
" },\n",
" {\n",
" \"type\": \"filter\",\n",
" \"path\": \"page\"\n",
" \"path\": \"source\"\n",
" }\n",
" ]\n",
"}\n",
@@ -325,128 +391,134 @@
"```python\n",
"vectorstore.create_index(\n",
" dimensions=1536,\n",
" filters=[{\"type\":\"filter\", \"path\":\"page\"}],\n",
" filters=[{\"type\":\"filter\", \"path\":\"source\"}],\n",
" update=True\n",
")\n",
"```\n",
"\n",
"And then you can run a query with filter as follows:\n",
"\n",
"```python\n",
"results = vector_store.similarity_search(query=\"foo\",k=1,pre_filter={\"source\": {\"$eq\": \"https://example.com\"}})\n",
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "32b13a9b",
"metadata": {},
"source": [
"#### Other search methods\n",
"\n",
"There are a variety of other search methods that are not covered in this notebook, such as MMR search or searching by vector. For a full list of the search abilities available for `AstraDBVectorStore` check out the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html)."
]
},
{
"cell_type": "markdown",
"id": "01316a42",
"metadata": {},
"source": [
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. \n",
"\n",
"Here is how to transform your vector store into a retriever and then invoke the retreiever with a simple query and filter."
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "8f246301",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'_id': '8c31b84e-2636-48b6-8b99-9fccb47f7051', 'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = vector_store.as_retriever(\n",
" search_type=\"similarity_score_threshold\",\n",
" search_kwargs={\"k\": 1, \"score_threshold\": 0.2},\n",
")\n",
"retriever.invoke(\"Stealing from the bank is a crime\")"
]
},
{
"cell_type": "markdown",
"id": "72312657",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfc8487d-14ec-42c9-9670-80fe02816196",
"execution_count": 66,
"id": "a42da723",
"metadata": {},
"outputs": [],
"source": [
"query = \"What were the compute requirements for training GPT 4\"\n",
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"results = vector_search.similarity_search_with_score(\n",
" query=query, k=5, pre_filter={\"page\": {\"$eq\": 1}}\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "80c1130f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of such applications.'"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"# Display results\n",
"for result in results:\n",
" print(result)"
]
},
{
"cell_type": "markdown",
"id": "6d9a2dbe",
"metadata": {},
"source": [
"## Similarity Search with Score"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "497baffa",
"metadata": {},
"outputs": [],
"source": [
"query = \"What were the compute requirements for training GPT 4\"\n",
"\n",
"results = vector_search.similarity_search_with_score(\n",
" query=query,\n",
" k=5,\n",
")\n",
"\n",
"# Display results\n",
"for result in results:\n",
" print(result)"
]
},
{
"cell_type": "markdown",
"id": "cbade5f0",
"metadata": {},
"source": [
"## Question Answering "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc6475f9",
"metadata": {},
"outputs": [],
"source": [
"qa_retriever = vector_search.as_retriever(\n",
" search_type=\"similarity\",\n",
" search_kwargs={\"k\": 25},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e13e96c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"prompt_template = \"\"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n",
"\n",
"{context}\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
"PROMPT = PromptTemplate(\n",
" template=prompt_template, input_variables=[\"context\", \"question\"]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff0edb02",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain_openai import OpenAI\n",
"\n",
"qa = RetrievalQA.from_chain_type(\n",
" llm=OpenAI(),\n",
" chain_type=\"stuff\",\n",
" retriever=qa_retriever,\n",
" return_source_documents=True,\n",
" chain_type_kwargs={\"prompt\": PROMPT},\n",
")\n",
"\n",
"docs = qa({\"query\": \"gpt-4 compute requirements\"})\n",
"\n",
"print(docs[\"result\"])\n",
"print(docs[\"source_documents\"])"
]
},
{
"cell_type": "markdown",
"id": "61636bb2",
"metadata": {},
"source": [
"GPT-4 requires significantly more compute than earlier GPT models. On a dataset derived from OpenAI's internal codebase, GPT-4 requires 100p (petaflops) of compute to reach the lowest loss, while the smaller models require 1-10n (nanoflops)."
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
@@ -460,6 +532,16 @@
">* The langchain version 0.0.305 ([release notes](https://github.com/langchain-ai/langchain/releases/tag/v0.0.305)) introduces the support for $vectorSearch MQL stage, which is available with MongoDB Atlas 6.0.11 and 7.0.2. Users utilizing earlier versions of MongoDB Atlas need to pin their LangChain version to <=0.0.304\n",
"> "
]
},
{
"cell_type": "markdown",
"id": "186ef502",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `MongoDBAtlasVectorSearch` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/mongodb_api_reference.html"
]
}
],
"metadata": {
@@ -478,7 +560,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -11,12 +11,6 @@
"\n",
"The code lives in an integration package called: [langchain_postgres](https://github.com/langchain-ai/langchain-postgres/).\n",
"\n",
"You can run the following command to spin up a a postgres container with the `pgvector` extension:\n",
"\n",
"```shell\n",
"docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n",
"```\n",
"\n",
"## Status\n",
"\n",
"This code has been ported over from `langchain_community` into a dedicated package called `langchain-postgres`. The following changes have been made:\n",
@@ -27,30 +21,39 @@
"\n",
"\n",
"Currently, there is **no mechanism** that supports easy data migration on schema changes. So any schema changes in the vectorstore will require the user to recreate the tables and re-add the documents.\n",
"If this is a concern, please use a different vectorstore. If not, this implementation should be fine for your use case."
]
},
{
"cell_type": "markdown",
"id": "342cd5e9-f349-42b4-9713-12e63779835b",
"metadata": {},
"source": [
"## Install dependencies\n",
"If this is a concern, please use a different vectorstore. If not, this implementation should be fine for your use case.\n",
"\n",
"Here, we're using `langchain_cohere` for embeddings, but you can use other embeddings providers."
"## Setup\n",
"\n",
"First donwload the partner package:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "42d42297-11b8-44e3-bf21-7c3d1bce8277",
"metadata": {
"tags": []
},
"execution_count": null,
"id": "92df32f0",
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet -U langchain_cohere\n",
"!pip install --quiet -U langchain_postgres"
"pip install -qU langchain_postgres"
]
},
{
"cell_type": "markdown",
"id": "0dd87fcc",
"metadata": {},
"source": [
"You can run the following command to spin up a a postgres container with the `pgvector` extension:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2acbaf9b",
"metadata": {},
"outputs": [],
"source": [
"%docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16"
]
},
{
@@ -58,7 +61,56 @@
"id": "eee31ce1-2c28-484d-82be-d22d9f9a31fd",
"metadata": {},
"source": [
"## Initialize the vectorstore"
"### Credentials\n",
"\n",
"There are no credentials needed to run this notebook, just make sure you downloaded the `langchain_postgres` package and correctly started the postgres container."
]
},
{
"cell_type": "markdown",
"id": "fa4026f7",
"metadata": {},
"source": [
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f8e2f23",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"id": "ec44dfcc",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "94f5c129",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
@@ -70,7 +122,6 @@
},
"outputs": [],
"source": [
"from langchain_cohere import CohereEmbeddings\n",
"from langchain_core.documents import Document\n",
"from langchain_postgres import PGVector\n",
"from langchain_postgres.vectorstores import PGVector\n",
@@ -78,9 +129,9 @@
"# See docker command above to launch a postgres instance with pgvector enabled.\n",
"connection = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\" # Uses psycopg3!\n",
"collection_name = \"my_docs\"\n",
"embeddings = CohereEmbeddings(model=\"embed-english-v3.0\")\n",
"\n",
"vectorstore = PGVector(\n",
"\n",
"vector_store = PGVector(\n",
" embeddings=embeddings,\n",
" collection_name=collection_name,\n",
" connection=connection,\n",
@@ -88,95 +139,22 @@
")"
]
},
{
"cell_type": "markdown",
"id": "0fc32168-5a82-4629-a78d-158fe2615086",
"metadata": {},
"source": [
"## Drop tables\n",
"\n",
"If you need to drop tables (e.g., updating the embedding to a different dimension or just updating the embedding provider): "
]
},
{
"cell_type": "markdown",
"id": "5de5ef98-7dbb-4892-853f-47c7dc87b70e",
"metadata": {
"tags": []
},
"source": [
"```python\n",
"vectorstore.drop_tables()\n",
"````"
]
},
{
"cell_type": "markdown",
"id": "61a224a1-d70b-4daf-86ba-ab6e43c08b50",
"metadata": {},
"source": [
"## Add documents\n",
"## Manage vector store\n",
"\n",
"Add documents to the vectorstore"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "88a288cc-ffd4-4800-b011-750c72b9fd10",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = [\n",
" Document(\n",
" page_content=\"there are cats in the pond\",\n",
" metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"ducks are also found in the pond\",\n",
" metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"fresh apples are available at the market\",\n",
" metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the market also sells fresh oranges\",\n",
" metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the new art exhibit is fascinating\",\n",
" metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a sculpture exhibit is also at the museum\",\n",
" metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a new coffee shop opened on Main Street\",\n",
" metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the book club meets at the library\",\n",
" metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the library hosts a weekly story time for kids\",\n",
" metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
" ),\n",
"]"
"### Add items to vector store\n",
"\n",
"Note that adding documents by ID will over-write any existing documents that match that ID."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "73aa9124-9d49-4e10-8ed3-82255e7a4106",
"id": "88a288cc-ffd4-4800-b011-750c72b9fd10",
"metadata": {
"tags": []
},
@@ -192,58 +170,6 @@
"output_type": "execute_result"
}
],
"source": [
"vectorstore.add_documents(docs, ids=[doc.metadata[\"id\"] for doc in docs])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a5b2b71f-49eb-407d-b03a-dea4c0a517d6",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}),\n",
" Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}),\n",
" Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),\n",
" Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),\n",
" Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),\n",
" Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}),\n",
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}),\n",
" Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'}),\n",
" Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}),\n",
" Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'})]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\"kitty\", k=10)"
]
},
{
"cell_type": "markdown",
"id": "1d87a413-015a-4b46-a64e-332f30806524",
"metadata": {},
"source": [
"Adding documents by ID will over-write any existing documents that match that ID."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "13c69357-aaee-4de0-bcc2-7ab4419c920e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"docs = [\n",
" Document(\n",
@@ -286,7 +212,29 @@
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
" ),\n",
"]"
"]\n",
"\n",
"vector_store.add_documents(docs, ids=[doc.metadata[\"id\"] for doc in docs])"
]
},
{
"cell_type": "markdown",
"id": "0c712fa3",
"metadata": {},
"source": [
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "a5b2b71f-49eb-407d-b03a-dea4c0a517d6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"vector_store.delete(ids=[\"3\"])"
]
},
{
@@ -294,7 +242,11 @@
"id": "59f82250-7903-4279-8300-062542c83416",
"metadata": {},
"source": [
"## Filtering Support\n",
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Filtering Support\n",
"\n",
"The vectorstore supports a set of filters that can be applied against the metadata fields of the documents.\n",
"\n",
@@ -312,33 +264,38 @@
"| \\$like | Text (like) |\n",
"| \\$ilike | Text (case-insensitive like) |\n",
"| \\$and | Logical (and) |\n",
"| \\$or | Logical (or) |"
"| \\$or | Logical (or) |\n",
"\n",
"### Query directly\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 15,
"id": "f15a2359-6dc3-4099-8214-785f167a9ca4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}),\n",
" Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),\n",
" Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),\n",
" Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'})]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
"* there are cats in the pond [{'id': 1, 'topic': 'animals', 'location': 'pond'}]\n",
"* the library hosts a weekly story time for kids [{'id': 9, 'topic': 'reading', 'location': 'library'}]\n",
"* ducks are also found in the pond [{'id': 2, 'topic': 'animals', 'location': 'pond'}]\n",
"* the new art exhibit is fascinating [{'id': 5, 'topic': 'art', 'location': 'museum'}]\n"
]
}
],
"source": [
"vectorstore.similarity_search(\"kitty\", k=10, filter={\"id\": {\"$in\": [1, 5, 2, 9]}})"
"results = vector_store.similarity_search(\n",
" \"kitty\", k=10, filter={\"id\": {\"$in\": [1, 5, 2, 9]}}\n",
")\n",
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")"
]
},
{
@@ -351,7 +308,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 16,
"id": "88f919e4-e4b0-4b5f-99b3-24c675c26d33",
"metadata": {
"tags": []
@@ -360,17 +317,17 @@
{
"data": {
"text/plain": [
"[Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),\n",
" Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]"
"[Document(metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond'),\n",
" Document(metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}, page_content='ducks are also found in the pond')]"
]
},
"execution_count": 10,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\n",
"vector_store.similarity_search(\n",
" \"ducks\",\n",
" k=10,\n",
" filter={\"id\": {\"$in\": [1, 5, 2, 9]}, \"location\": {\"$in\": [\"pond\", \"market\"]}},\n",
@@ -379,7 +336,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 17,
"id": "88f423a4-6575-4fb8-9be2-a3da01106591",
"metadata": {
"tags": []
@@ -388,17 +345,17 @@
{
"data": {
"text/plain": [
"[Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),\n",
" Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]"
"[Document(metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond'),\n",
" Document(metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}, page_content='ducks are also found in the pond')]"
]
},
"execution_count": 11,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\n",
"vector_store.similarity_search(\n",
" \"ducks\",\n",
" k=10,\n",
" filter={\n",
@@ -410,34 +367,145 @@
")"
]
},
{
"cell_type": "markdown",
"id": "2e65adc1",
"metadata": {},
"source": [
"If you want to execute a similarity search and receive the corresponding scores you can run:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "65133340-2acd-4957-849e-029b6b5d60f0",
"metadata": {
"tags": []
},
"execution_count": 18,
"id": "7d92e7b3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.763449] there are cats in the pond [{'id': 1, 'topic': 'animals', 'location': 'pond'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(query=\"cats\", k=1)\n",
"for doc, score in results:\n",
" print(f\"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "8d40db8c",
"metadata": {},
"source": [
"For a full list of the different searches you can execute on a `PGVector` vector store, please refer to the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_postgres.vectorstores.PGVector.html).\n",
"\n",
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7cd1fb75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}),\n",
" Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),\n",
" Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),\n",
" Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}),\n",
" Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}),\n",
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}),\n",
" Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'}),\n",
" Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'})]"
"[Document(metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}, page_content='there are cats in the pond')]"
]
},
"execution_count": 12,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\"bird\", k=10, filter={\"location\": {\"$ne\": \"pond\"}})"
"retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
"retriever.invoke(\"kitty\")"
]
},
{
"cell_type": "markdown",
"id": "7ecd77a0",
"metadata": {},
"source": [
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f0b14168",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a4eba12c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'There are cats in the pond right now.'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"Who is at the pond right now?\")"
]
},
{
"cell_type": "markdown",
"id": "f451f361",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all __ModuleName__VectorStore features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_postgres.vectorstores.PGVector.html"
]
}
],
@@ -457,7 +525,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -12,8 +12,9 @@
"\n",
"This notebook shows how to use functionality related to the `Pinecone` vector database.\n",
"\n",
"Set the following environment variables to follow along in this doc:\n",
"- `OPENAI_API_KEY`: Your OpenAI API key, for using `OpenAIEmbeddings`"
"## Setup\n",
"\n",
"To use the `PineconeVectorStore` you first need to install the partner package, as well as the other packages used throughout this notebook."
]
},
{
@@ -25,12 +26,7 @@
},
"outputs": [],
"source": [
"%pip install --upgrade --quiet \\\n",
" langchain-pinecone \\\n",
" langchain-openai \\\n",
" langchain \\\n",
" langchain-community \\\n",
" pinecone-notebooks"
"%pip install -qU langchain-pinecone pinecone-notebooks"
]
},
{
@@ -43,76 +39,52 @@
},
{
"cell_type": "markdown",
"id": "42f2ea67",
"id": "ef6dc4de",
"metadata": {},
"source": [
"First, let's split our state of the union document into chunked `docs`."
"### Credentials\n",
"\n",
"Create a new Pinecone account, or sign into your existing one, and create an API key to use in this notebook."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "eb554814",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"import time\n",
"\n",
"from pinecone import Pinecone, ServerlessSpec\n",
"\n",
"if not os.getenv(\"PINECONE_API_KEY\"):\n",
" os.environ[\"PINECONE_API_KEY\"] = getpass.getpass(\"Enter your Pinecone API key: \")\n",
"\n",
"pinecone_api_key = os.environ.get(\"PINECONE_API_KEY\")\n",
"\n",
"pc = Pinecone(api_key=pinecone_api_key)"
]
},
{
"cell_type": "markdown",
"id": "6ef1d828",
"metadata": {},
"source": [
"If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a3c3999a",
"id": "23b5ac5e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "ef6dc4de",
"metadata": {},
"source": [
"Now let's create a new Pinecone account, or sign into your existing one, and create an API key to use in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1fdc3c36",
"metadata": {},
"outputs": [],
"source": [
"from pinecone_notebooks.colab import Authenticate\n",
"\n",
"Authenticate()"
]
},
{
"cell_type": "markdown",
"id": "54da1a39",
"metadata": {},
"source": [
"The newly created API key has been stored in the `PINECONE_API_KEY` environment variable. We will use it to setup the Pinecone client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb554814",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"pinecone_api_key = os.environ.get(\"PINECONE_API_KEY\")\n",
"pinecone_api_key\n",
"\n",
"import time\n",
"\n",
"from pinecone import Pinecone, ServerlessSpec\n",
"\n",
"pc = Pinecone(api_key=pinecone_api_key)"
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
@@ -120,26 +92,28 @@
"id": "658706a3",
"metadata": {},
"source": [
"Next, let's connect to your Pinecone index. If one named `index_name` doesn't exist, it will be created."
"## Initialization\n",
"\n",
"Before initializing our vector store, let's connect to a Pinecone index. If one named `index_name` doesn't exist, it will be created."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 12,
"id": "276a06dd",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"index_name = \"langchain-index\" # change if desired\n",
"index_name = \"langchain-test-index\" # change if desired\n",
"\n",
"existing_indexes = [index_info[\"name\"] for index_info in pc.list_indexes()]\n",
"\n",
"if index_name not in existing_indexes:\n",
" pc.create_index(\n",
" name=index_name,\n",
" dimension=1536,\n",
" dimension=3072,\n",
" metric=\"cosine\",\n",
" spec=ServerlessSpec(cloud=\"aws\", region=\"us-east-1\"),\n",
" )\n",
@@ -154,24 +128,188 @@
"id": "3a4d377f",
"metadata": {},
"source": [
"Now that our Pinecone index is setup, we can upsert those chunked docs as contents with `PineconeVectorStore.from_documents`."
"Now that our Pinecone index is setup, we can initialize our vector store. \n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 13,
"id": "1485db56",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"from langchain_pinecone import PineconeVectorStore\n",
"\n",
"docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)"
"vector_store = PineconeVectorStore(index=index, embedding=embeddings)"
]
},
{
"cell_type": "markdown",
"id": "48721e29",
"metadata": {},
"source": [
"## Manage vector store\n",
"\n",
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 15,
"id": "70e688f4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['167b8681-5974-467f-adcb-6e987a18df01',\n",
" 'd16010fd-41f8-4d49-9c22-c66d5555a3fe',\n",
" 'ffcacfb3-2bc2-44c3-a039-c2256a905c0e',\n",
" 'cf3bfc9f-5dc7-4f5e-bb41-edb957394126',\n",
" 'e99b07eb-fdff-4cb9-baa8-619fd8efeed3',\n",
" '68c93033-a24f-40bd-8492-92fa26b631a4',\n",
" 'b27a4ecb-b505-4c5d-89ff-526e3d103558',\n",
" '4868a9e6-e6fb-4079-b400-4a1dfbf0d4c4',\n",
" '921c0e9c-0550-4eb5-9a6c-ed44410788b2',\n",
" 'c446fc23-64e8-47e7-8c19-ecf985e9411e']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "120922b3",
"metadata": {},
"source": [
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "5b8437cd",
"metadata": {},
"outputs": [],
"source": [
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
"cell_type": "markdown",
"id": "5ee21c89",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "ffbcb3fb",
"metadata": {},
"outputs": [
@@ -179,214 +317,169 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
"* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)\n",
"print(docs[0].page_content)"
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\",\n",
" k=2,\n",
" filter={\"source\": \"tweet\"},\n",
")\n",
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "86a4b96b",
"id": "79f3494d",
"metadata": {},
"source": [
"### Adding More Text to an Existing Index\n",
"#### Similarity search with score\n",
"\n",
"More text can embedded and upserted to an existing Pinecone index using the `add_texts` function\n"
"You can also search with score:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "38a7a60e",
"execution_count": 18,
"id": "5fb24583",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.553187] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\n",
" \"Will it be hot tomorrow?\", k=1, filter={\"source\": \"news\"}\n",
")\n",
"for res, score in results:\n",
" print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "1855941b",
"metadata": {},
"source": [
"#### Other search methods\n",
"\n",
"There are more search methods (such as MMR) not listed in this notebook, to find all of them be sure to read the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html).\n",
"\n",
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "78140e87",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['24631802-4bad-44a7-a4ba-fd71f00cc160']"
"[Document(metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 8,
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)\n",
"\n",
"vectorstore.add_texts([\"More text!\"])"
"retriever = vector_store.as_retriever(\n",
" search_type=\"similarity_score_threshold\",\n",
" search_kwargs={\"k\": 1, \"score_threshold\": 0.5},\n",
")\n",
"retriever.invoke(\"Stealing from the bank is a crime\", filter={\"source\": \"news\"})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d46d1452",
"id": "72990cb5",
"metadata": {},
"source": [
"### Maximal Marginal Relevance Searches\n",
"## Chain usage\n",
"\n",
"In addition to using similarity search in the retriever object, you can also use `mmr` as retriever.\n"
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a359ed74",
"execution_count": 20,
"id": "f12560cb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"## Document 0\n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"\n",
"## Document 1\n",
"\n",
"And Im taking robust action to make sure the pain of our sanctions is targeted at Russias economy. And I will use every tool at our disposal to protect American businesses and consumers. \n",
"\n",
"Tonight, I can announce that the United States has worked with 30 other countries to release 60 Million barrels of oil from reserves around the world. \n",
"\n",
"America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies. \n",
"\n",
"These steps will help blunt gas prices here at home. And I know the news about whats happening can seem alarming. \n",
"\n",
"But I want you to know that we are going to be okay. \n",
"\n",
"When the history of this era is written Putins war on Ukraine will have left Russia weaker and the rest of the world stronger. \n",
"\n",
"While it shouldnt have taken something so terrible for people around the world to see whats at stake now everyone sees it clearly.\n",
"\n",
"## Document 2\n",
"\n",
"We cant change how divided weve been. But we can change how we move forward—on COVID-19 and other issues we must face together. \n",
"\n",
"I recently visited the New York City Police Department days after the funerals of Officer Wilbert Mora and his partner, Officer Jason Rivera. \n",
"\n",
"They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime prevention and community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety.\n",
"\n",
"## Document 3\n",
"\n",
"One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war—medical and hazard material, jet fuel, and more. \n",
"\n",
"When they came home, many of the worlds fittest and best trained warriors were never the same. \n",
"\n",
"Headaches. Numbness. Dizziness. \n",
"\n",
"A cancer that would put them in a flag-draped coffin. \n",
"\n",
"I know. \n",
"\n",
"One of those soldiers was my son Major Beau Biden. \n",
"\n",
"We dont know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. \n",
"\n",
"But Im committed to finding out everything we can. \n",
"\n",
"Committed to military families like Danielle Robinson from Ohio. \n",
"\n",
"The widow of Sergeant First Class Heath Robinson. \n",
"\n",
"He was born a soldier. Army National Guard. Combat medic in Kosovo and Iraq. \n",
"\n",
"Stationed near Baghdad, just yards from burn pits the size of football fields. \n",
"\n",
"Heaths widow Danielle is here with us tonight. They loved going to Ohio State football games. He loved building Legos with their daughter.\n"
]
}
],
"outputs": [],
"source": [
"retriever = docsearch.as_retriever(search_type=\"mmr\")\n",
"matched_docs = retriever.invoke(query)\n",
"for i, d in enumerate(matched_docs):\n",
" print(f\"\\n## Document {i}\\n\")\n",
" print(d.page_content)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7c477287",
"metadata": {},
"source": [
"Or use `max_marginal_relevance_search` directly:"
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9ca82740",
"execution_count": 21,
"id": "262651fc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. \n",
"\n",
"2. We cant change how divided weve been. But we can change how we move forward—on COVID-19 and other issues we must face together. \n",
"\n",
"I recently visited the New York City Police Department days after the funerals of Officer Wilbert Mora and his partner, Officer Jason Rivera. \n",
"\n",
"They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime prevention and community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety. \n",
"\n"
]
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of these types of applications.'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)\n",
"for i, doc in enumerate(found_docs):\n",
" print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
"cell_type": "markdown",
"id": "0d5722bc",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all __ModuleName__VectorStore features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html"
]
}
],
@@ -406,7 +499,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -14,6 +14,9 @@
"\n",
"> This page documents the `QdrantVectorStore` class that supports multiple retrieval modes via Qdrant's new [Query API](https://qdrant.tech/blog/qdrant-1.10.x/). It requires you to run Qdrant v1.10.0 or above.\n",
"\n",
"\n",
"## Setup\n",
"\n",
"There are various modes of how to run `Qdrant`, and depending on the chosen one, there will be some subtle differences. The options include:\n",
"- Local mode, no server required\n",
"- Docker deployments\n",
@@ -31,56 +34,30 @@
},
"outputs": [],
"source": [
"%pip install langchain-qdrant langchain-openai langchain"
"%pip install -qU langchain-qdrant 'qdrant-client[fastembed]'"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7b2f111b-357a-4f42-9730-ef0603bdc1b5",
"id": "7d387fea",
"metadata": {},
"source": [
"We will use `OpenAIEmbeddings` for demonstration."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "aac9563e",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:22.282884Z",
"start_time": "2023-04-04T10:51:21.408077Z"
},
"tags": []
},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_qdrant import QdrantVectorStore\n",
"from langchain_text_splitters import CharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a3c3999a",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:22.520144Z",
"start_time": "2023-04-04T10:51:22.285826Z"
},
"tags": []
},
"outputs": [],
"source": [
"loader = TextLoader(\"some-file.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"### Credentials\n",
"\n",
"embeddings = OpenAIEmbeddings()"
"There are no credentials needed to run the code in this notebook.\n",
"\n",
"If you want to get best in-class automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4912937d",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
@@ -89,7 +66,7 @@
"id": "eeead681",
"metadata": {},
"source": [
"## Connecting to Qdrant from LangChain\n",
"## Initialization\n",
"\n",
"### Local mode\n",
"\n",
@@ -97,12 +74,33 @@
"\n",
"#### In-memory\n",
"\n",
"For some testing scenarios and quick experiments, you may prefer to keep all the data in memory only, so it gets lost when the client is destroyed - usually at the end of your script/notebook."
"For some testing scenarios and quick experiments, you may prefer to keep all the data in memory only, so it gets lost when the client is destroyed - usually at the end of your script/notebook.\n",
"\n",
"\n",
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"id": "1df86797",
"metadata": {},
"outputs": [],
"source": [
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8429667e",
"metadata": {
"ExecuteTime": {
@@ -113,11 +111,21 @@
},
"outputs": [],
"source": [
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
" location=\":memory:\", # Local mode with in-memory storage only\n",
" collection_name=\"my_documents\",\n",
"from langchain_qdrant import QdrantVectorStore\n",
"from qdrant_client import QdrantClient\n",
"from qdrant_client.http.models import Distance, VectorParams\n",
"\n",
"client = QdrantClient(\":memory:\")\n",
"\n",
"client.create_collection(\n",
" collection_name=\"demo_collection\",\n",
" vectors_config=VectorParams(size=3072, distance=Distance.COSINE),\n",
")\n",
"\n",
"vector_store = QdrantVectorStore(\n",
" client=client,\n",
" collection_name=\"demo_collection\",\n",
" embedding=embeddings,\n",
")"
]
},
@@ -134,7 +142,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"id": "24b370e2",
"metadata": {
"ExecuteTime": {
@@ -145,11 +153,17 @@
},
"outputs": [],
"source": [
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
" path=\"/tmp/local_qdrant\",\n",
" collection_name=\"my_documents\",\n",
"client = QdrantClient(path=\"/tmp/langchain_qdrant\")\n",
"\n",
"client.create_collection(\n",
" collection_name=\"demo_collection\",\n",
" vectors_config=VectorParams(size=3072, distance=Distance.COSINE),\n",
")\n",
"\n",
"vector_store = QdrantVectorStore(\n",
" client=client,\n",
" collection_name=\"demo_collection\",\n",
" embedding=embeddings,\n",
")"
]
},
@@ -177,6 +191,7 @@
"outputs": [],
"source": [
"url = \"<---qdrant url here --->\"\n",
"docs = [] # put docs here\n",
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
@@ -252,37 +267,144 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "93540013",
"id": "3cddef6e",
"metadata": {},
"source": [
"## Recreating the collection\n",
"## Manage vector store\n",
"\n",
"The collection is reused if it already exists. Setting `force_recreate` to `True` allows to remove the old collection and start from scratch."
"Once you have created your vector store, we can interact with it by adding and deleting different items.\n",
"\n",
"### Add items to vector store\n",
"\n",
"We can add items to our vector store by using the `add_documents` function."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "30a87570",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:24.854117Z",
"start_time": "2023-04-04T10:51:24.845385Z"
"id": "7697a362",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['c04134c3-273d-4766-949a-eee46052ad32',\n",
" '9e6ba50c-794f-4b88-94e5-411f15052a02',\n",
" 'd3202666-6f2b-4186-ac43-e35389de8166',\n",
" '50d8d6ee-69bf-4173-a6a2-b254e9928965',\n",
" 'bd2eae02-74b5-43ec-9fcf-09e9d9db6fd3',\n",
" '6dae6b37-826d-4f14-8376-da4603b35de3',\n",
" 'b0964ab5-5a14-47b4-a983-37fa5c5bd154',\n",
" '91ed6c56-fe53-49e2-8199-c3bb3c33c3eb',\n",
" '42a580cb-7469-4324-9927-0febab57ce92',\n",
" 'ff774e5c-f158-4d12-94e2-0a0162b22f27']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
},
"outputs": [],
],
"source": [
"url = \"<---qdrant url here --->\"\n",
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embeddings,\n",
" url=url,\n",
" prefer_grpc=True,\n",
" collection_name=\"my_documents\",\n",
" force_recreate=True,\n",
")"
"from uuid import uuid4\n",
"\n",
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(\n",
" page_content=\"I had chocalate chip pancakes and scrambled eggs for breakfast this morning.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_2 = Document(\n",
" page_content=\"The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_3 = Document(\n",
" page_content=\"Building an exciting new project with LangChain - come check it out!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_4 = Document(\n",
" page_content=\"Robbers broke into the city bank and stole $1 million in cash.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_5 = Document(\n",
" page_content=\"Wow! That was an amazing movie. I can't wait to see it again.\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_6 = Document(\n",
" page_content=\"Is the new iPhone worth the price? Read this review to find out.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_7 = Document(\n",
" page_content=\"The top 10 soccer players in the world right now.\",\n",
" metadata={\"source\": \"website\"},\n",
")\n",
"\n",
"document_8 = Document(\n",
" page_content=\"LangGraph is the best framework for building stateful, agentic applications!\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"document_9 = Document(\n",
" page_content=\"The stock market is down 500 points today due to fears of a recession.\",\n",
" metadata={\"source\": \"news\"},\n",
")\n",
"\n",
"document_10 = Document(\n",
" page_content=\"I have a bad feeling I am going to get deleted :(\",\n",
" metadata={\"source\": \"tweet\"},\n",
")\n",
"\n",
"documents = [\n",
" document_1,\n",
" document_2,\n",
" document_3,\n",
" document_4,\n",
" document_5,\n",
" document_6,\n",
" document_7,\n",
" document_8,\n",
" document_9,\n",
" document_10,\n",
"]\n",
"uuids = [str(uuid4()) for _ in range(len(documents))]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=uuids)"
]
},
{
"cell_type": "markdown",
"id": "5fd23102",
"metadata": {},
"source": [
"### Delete items from vector store"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "999cafcc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector_store.delete(ids=[uuids[-1]])"
]
},
{
@@ -296,33 +418,18 @@
}
},
"source": [
"## Similarity search\n",
"## Query vector store\n",
"\n",
"The simplest scenario for using Qdrant vector store is to perform a similarity search. Under the hood, our query will be encoded into vector embeddings and used to find similar documents in Qdrant collection.\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"`QdrantVectorStore` supports 3 modes for similarity searches. They can be configured using the `retrieval_mode` parameter when setting up the class.\n",
"### Query directly\n",
"\n",
"- Dense Vector Search(Default)\n",
"- Sparse Vector Search\n",
"- Hybrid Search"
]
},
{
"cell_type": "markdown",
"id": "b3a78d46",
"metadata": {},
"source": [
"### Dense Vector Search\n",
"\n",
"To search with only dense vectors,\n",
"\n",
"- The `retrieval_mode` parameter should be set to `RetrievalMode.DENSE`(default).\n",
"- A [dense embeddings](https://python.langchain.com/v0.2/docs/integrations/text_embedding/) value should be provided to the `embedding` parameter."
"The simplest scenario for using Qdrant vector store is to perform a similarity search. Under the hood, our query will be encoded into vector embeddings and used to find similar documents in Qdrant collection."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"id": "a8c513ab",
"metadata": {
"ExecuteTime": {
@@ -331,20 +438,22 @@
},
"tags": []
},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet', '_id': 'd3202666-6f2b-4186-ac43-e35389de8166', '_collection_name': 'demo_collection'}]\n",
"* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet', '_id': '91ed6c56-fe53-49e2-8199-c3bb3c33c3eb', '_collection_name': 'demo_collection'}]\n"
]
}
],
"source": [
"from langchain_qdrant import RetrievalMode\n",
"\n",
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embedding=embeddings,\n",
" location=\":memory:\",\n",
" collection_name=\"my_documents\",\n",
" retrieval_mode=RetrievalMode.DENSE,\n",
"results = vector_store.similarity_search(\n",
" \"LangChain provides abstractions to make working with LLMs easy\", k=2\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.similarity_search(query)"
"for res in results:\n",
" print(f\"* {res.page_content} [{res.metadata}]\")"
]
},
{
@@ -352,6 +461,19 @@
"id": "dbd93d85",
"metadata": {},
"source": [
"`QdrantVectorStore` supports 3 modes for similarity searches. They can be configured using the `retrieval_mode` parameter when setting up the class.\n",
"\n",
"- Dense Vector Search(Default)\n",
"- Sparse Vector Search\n",
"- Hybrid Search\n",
"\n",
"### Dense Vector Search\n",
"\n",
"To search with only dense vectors,\n",
"\n",
"- The `retrieval_mode` parameter should be set to `RetrievalMode.DENSE`(default).\n",
"- A [dense embeddings](https://python.langchain.com/v0.2/docs/integrations/text_embedding/) value should be provided to the `embedding` parameter.\n",
"\n",
"### Sparse Vector Search\n",
"\n",
"To search with only sparse vectors,\n",
@@ -361,47 +483,6 @@
"\n",
"The `langchain-qdrant` package provides a [FastEmbed](https://github.com/qdrant/fastembed) based implementation out of the box.\n",
"\n",
"To use it, install the FastEmbed package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ceb493a3",
"metadata": {},
"outputs": [],
"source": [
"%pip install fastembed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "052e3412",
"metadata": {},
"outputs": [],
"source": [
"from langchain_qdrant import FastEmbedSparse, RetrievalMode\n",
"\n",
"sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/BM25\")\n",
"\n",
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" sparse_embedding=sparse_embeddings,\n",
" location=\":memory:\",\n",
" collection_name=\"my_documents\",\n",
" retrieval_mode=RetrievalMode.SPARSE,\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.similarity_search(query)"
]
},
{
"cell_type": "markdown",
"id": "f4b6c456",
"metadata": {},
"source": [
"### Hybrid Vector Search\n",
"\n",
"To perform a hybrid search using dense and sparse vectors with score fusion,\n",
@@ -413,45 +494,49 @@
"Note that if you've added documents with the `HYBRID` mode, you can switch to any retrieval mode when searching. Since both the dense and sparse vectors are available in the collection."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce56f6e9",
"metadata": {},
"outputs": [],
"source": [
"from langchain_qdrant import FastEmbedSparse, RetrievalMode\n",
"\n",
"sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/BM25\")\n",
"\n",
"qdrant = QdrantVectorStore.from_documents(\n",
" docs,\n",
" embedding=embeddings,\n",
" sparse_embedding=sparse_embeddings,\n",
" location=\":memory:\",\n",
" collection_name=\"my_documents\",\n",
" retrieval_mode=RetrievalMode.HYBRID,\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.similarity_search(query)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1bda9bf5",
"metadata": {},
"source": [
"## Similarity search with score\n",
"\n",
"Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result. \n",
"The returned distance score is cosine distance. Therefore, a lower score is better."
"If you want to execute a similarity search and receive the corresponding scores you can run:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cf772328",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"([Record(id='42a580cb-7469-4324-9927-0febab57ce92', payload={'page_content': 'The stock market is down 500 points today due to fears of a recession.', 'metadata': {'source': 'news'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='50d8d6ee-69bf-4173-a6a2-b254e9928965', payload={'page_content': 'Robbers broke into the city bank and stole $1 million in cash.', 'metadata': {'source': 'news'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='6dae6b37-826d-4f14-8376-da4603b35de3', payload={'page_content': 'Is the new iPhone worth the price? Read this review to find out.', 'metadata': {'source': 'website'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='91ed6c56-fe53-49e2-8199-c3bb3c33c3eb', payload={'page_content': 'LangGraph is the best framework for building stateful, agentic applications!', 'metadata': {'source': 'tweet'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='9e6ba50c-794f-4b88-94e5-411f15052a02', payload={'page_content': 'The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.', 'metadata': {'source': 'news'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='b0964ab5-5a14-47b4-a983-37fa5c5bd154', payload={'page_content': 'The top 10 soccer players in the world right now.', 'metadata': {'source': 'website'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='bd2eae02-74b5-43ec-9fcf-09e9d9db6fd3', payload={'page_content': \"Wow! That was an amazing movie. I can't wait to see it again.\", 'metadata': {'source': 'tweet'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='c04134c3-273d-4766-949a-eee46052ad32', payload={'page_content': 'I had chocalate chip pancakes and scrambled eggs for breakfast this morning.', 'metadata': {'source': 'tweet'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='d3202666-6f2b-4186-ac43-e35389de8166', payload={'page_content': 'Building an exciting new project with LangChain - come check it out!', 'metadata': {'source': 'tweet'}}, vector=None, shard_key=None, order_value=None),\n",
" Record(id='ff774e5c-f158-4d12-94e2-0a0162b22f27', payload={'page_content': 'I have a bad feeling I am going to get deleted :(', 'metadata': {'source': 'tweet'}}, vector=None, shard_key=None, order_value=None)],\n",
" None)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client.scroll(collection_name=\"demo_collection\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8804a21d",
"metadata": {
"ExecuteTime": {
@@ -459,27 +544,21 @@
"start_time": "2023-04-04T10:51:25.227384Z"
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "756a6887",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:25.642282Z",
"start_time": "2023-04-04T10:51:25.635947Z"
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.531834] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news', '_id': '9e6ba50c-794f-4b88-94e5-411f15052a02', '_collection_name': 'demo_collection'}]\n"
]
}
},
"outputs": [],
],
"source": [
"document, score = found_docs[0]\n",
"print(document.page_content)\n",
"print(f\"\\nScore: {score}\")"
"results = vector_store.similarity_search_with_score(\n",
" query=\"Will it be hot tomorrow\", k=1\n",
")\n",
"for doc, score in results:\n",
" print(f\"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]\")"
]
},
{
@@ -488,73 +567,46 @@
"id": "525e3582",
"metadata": {},
"source": [
"For a full list of all the search functions available for a `QdrantVectorStore`, read the [API reference](https://api.python.langchain.com/en/latest/vectorstores/langchain_qdrant.vectorstores.Qdrant.html)\n",
"\n",
"### Metadata filtering\n",
"\n",
"Qdrant has an [extensive filtering system](https://qdrant.tech/documentation/concepts/filtering/) with rich type support. It is also possible to use the filters in Langchain, by passing an additional param to both the `similarity_search_with_score` and `similarity_search` methods."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1c2c58dc",
"cell_type": "code",
"execution_count": 14,
"id": "dc7cffc8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* The top 10 soccer players in the world right now. [{'source': 'website', '_id': 'b0964ab5-5a14-47b4-a983-37fa5c5bd154', '_collection_name': 'demo_collection'}]\n"
]
}
],
"source": [
"```python\n",
"from qdrant_client.http import models\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.similarity_search_with_score(query, filter=models.Filter(...))\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c58c30bf",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:39:53.032744Z",
"start_time": "2023-04-04T10:39:53.028673Z"
}
},
"source": [
"## Maximum marginal relevance search (MMR)\n",
"\n",
"If you'd like to look up some similar documents, but you'd also like to receive diverse results, MMR is the method you should consider. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.\n",
"\n",
"Note that MMR search is only available if you've added documents with `DENSE` or `HYBRID` modes. Since it requires dense vectors."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "76810fb6",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:26.010947Z",
"start_time": "2023-04-04T10:51:25.647687Z"
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = qdrant.max_marginal_relevance_search(query, k=2, fetch_k=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80c6db11",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:26.016979Z",
"start_time": "2023-04-04T10:51:26.013329Z"
}
},
"outputs": [],
"source": [
"for i, doc in enumerate(found_docs):\n",
" print(f\"{i + 1}.\", doc.page_content, \"\\n\")"
"results = vector_store.similarity_search(\n",
" query=\"Who are the best soccer players in the world?\",\n",
" k=1,\n",
" filter=models.Filter(\n",
" should=[\n",
" models.FieldCondition(\n",
" key=\"page_content\",\n",
" match=models.MatchValue(\n",
" value=\"The top 10 soccer players in the world right now.\"\n",
" ),\n",
" ),\n",
" ]\n",
" ),\n",
")\n",
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")"
]
},
{
@@ -563,14 +615,14 @@
"id": "691a82d6",
"metadata": {},
"source": [
"## Qdrant as a Retriever\n",
"### Query by turning into retriever\n",
"\n",
"Qdrant, as all the other vector stores, is a LangChain Retriever. "
"You can also transform the vector store into a retriever for easier usage in your chains. "
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 15,
"id": "9427195f",
"metadata": {
"ExecuteTime": {
@@ -578,49 +630,90 @@
"start_time": "2023-04-04T10:51:26.018763Z"
}
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'source': 'news', '_id': '50d8d6ee-69bf-4173-a6a2-b254e9928965', '_collection_name': 'demo_collection'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = qdrant.as_retriever()"
"retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
"retriever.invoke(\"Stealing from the bank is a crime\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0c851b4f",
"id": "6ac07288",
"metadata": {},
"source": [
"It might be also specified to use MMR as a search strategy, instead of similarity."
"## Chain usage\n",
"\n",
"The code below shows how to use the vector store as a retriever in a simple RAG chain:\n",
"\n",
"```{=mdx}\n",
"import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
"\n",
"<ChatModelTabs customVarName=\"llm\" />\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64348f1b",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:26.043909Z",
"start_time": "2023-04-04T10:51:26.034284Z"
}
},
"execution_count": 16,
"id": "07bd9785",
"metadata": {},
"outputs": [],
"source": [
"retriever = qdrant.as_retriever(search_type=\"mmr\")"
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3c70c31",
"metadata": {
"ExecuteTime": {
"end_time": "2023-04-04T10:51:26.495652Z",
"start_time": "2023-04-04T10:51:26.046407Z"
"execution_count": 17,
"id": "d97f0c91",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'LangGraph is used for building stateful, agentic applications. It provides a framework that facilitates the development of such applications.'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
},
"outputs": [],
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"retriever.invoke(query)[0]"
"from langchain import hub\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What is LangGraph used for?\")"
]
},
{
@@ -647,10 +740,12 @@
},
"outputs": [],
"source": [
"from langchain_qdrant import RetrievalMode, SparseEmbeddings\n",
"\n",
"QdrantVectorStore.from_documents(\n",
" docs,\n",
" embedding=embeddings,\n",
" sparse_embedding=sparse_embeddings,\n",
" sparse_embedding=SparseEmbeddings(),\n",
" location=\":memory:\",\n",
" collection_name=\"my_documents_2\",\n",
" retrieval_mode=RetrievalMode.HYBRID,\n",
@@ -707,12 +802,14 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"id": "2300e785",
"metadata": {},
"outputs": [],
"source": []
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `QdrantVectorStore` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_qdrant.vectorstores.Qdrant.html"
]
}
],
"metadata": {
@@ -731,7 +828,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.11.9"
}
},
"nbformat": 4,

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,269 @@
import inspect
import sys
from pathlib import Path
from langchain_astradb import AstraDBVectorStore
from langchain_chroma import Chroma
from langchain_community import vectorstores
from langchain_core.vectorstores import VectorStore
from langchain_couchbase import CouchbaseVectorStore
from langchain_milvus import Milvus
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_pinecone import PineconeVectorStore
from langchain_qdrant import QdrantVectorStore
vectorstore_list = [
"FAISS",
"ElasticsearchStore",
"PGVector",
"Redis",
"Clickhouse",
"InMemoryVectorStore",
]
from_partners = [
("Chroma", Chroma),
("AstraDBVectorStore", AstraDBVectorStore),
("QdrantVectorStore", QdrantVectorStore),
("PineconeVectorStore", PineconeVectorStore),
("Milvus", Milvus),
("MongoDBAtlasVectorSearch", MongoDBAtlasVectorSearch),
("CouchbaseVectorStore", CouchbaseVectorStore),
]
VECTORSTORE_TEMPLATE = """\
---
sidebar_position: 1
sidebar_class_name: hidden
keywords: [compatibility]
custom_edit_url:
---
# Vectorstores
## Features
The table below lists the features for some of our most popular vector stores.
{table}
"""
def get_vectorstore_table():
vectorstore_feat_table = {
"FAISS": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"ElasticsearchStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"PGVector": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"Redis": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"Clickhouse": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"InMemoryVectorStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"Chroma": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"AstraDBVectorStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"QdrantVectorStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"PineconeVectorStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"Milvus": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"MongoDBAtlasVectorSearch": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
"CouchbaseVectorStore": {
"Delete by ID": True,
"Filtering": True,
"similarity_search_by_vector": True,
"similarity_search_with_score": True,
"asearch": True,
"Passes Standard Tests": False,
"Multi Tenancy": False,
"Local/Cloud": "Local",
"IDs in add Documents": True,
},
}
for vs in vectorstore_list + from_partners:
if isinstance(vs, tuple):
cls = vs[1]
vs_name = vs[0]
else:
cls = getattr(vectorstores, vs)
vs_name = vs
for feat in (
"similarity_search_with_score",
"similarity_search_by_vector",
"asearch",
):
feat, name = feat, feat
if getattr(cls, feat) != getattr(VectorStore, feat):
vectorstore_feat_table[vs_name][name] = True
else:
vectorstore_feat_table[vs_name][name] = False
if "filter" not in [
key
for key, _ in inspect.signature(
getattr(cls, "similarity_search")
).parameters.items()
]:
vectorstore_feat_table[vs_name]["Filtering"] = False
header = [
"Vectorstore",
"Delete by ID",
"Filtering",
"similarity_search_by_vector",
"similarity_search_with_score",
"asearch",
"Passes Standard Tests",
"Multi Tenancy",
"Local/Cloud",
"IDs in add Documents",
]
title = [
"Vectorstore",
"Delete by ID",
"Filtering",
"Search by Vector",
"Search with score",
"Async",
"Passes Standard Tests",
"Multi Tenancy",
"Local/Cloud",
"IDs in add Documents",
]
rows = [title, [":-"] + [":-:"] * (len(title) - 1)]
for vs, feats in sorted(vectorstore_feat_table.items()):
rows += [
[vs, ""]
+ [
("" if feats.get(h) else "") if h != "Local/Cloud" else feats.get(h)
for h in header[1:]
]
]
return "\n".join(["|".join(row) for row in rows])
if __name__ == "__main__":
output_dir = Path(sys.argv[1])
output_integrations_dir = output_dir / "integrations"
output_integrations_dir_vectorstore = output_integrations_dir / "vectorstores"
output_integrations_dir_vectorstore.mkdir(parents=True, exist_ok=True)
vectorstore_page = VECTORSTORE_TEMPLATE.format(table=get_vectorstore_table())
with open(output_integrations_dir / "vectorstores" / "index.mdx", "w") as f:
f.write(vectorstore_page)

View File

@@ -0,0 +1,75 @@
import React from "react";
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
import CodeBlock from "@theme-original/CodeBlock";
export default function EmbeddingTabs(props) {
const {
openaiParams,
hideOpenai,
huggingFaceParams,
hideHuggingFace,
fakeEmbeddingParams,
hideFakeEmbedding,
customVarName,
} = props;
const openAIParamsOrDefault = openaiParams ?? `model="text-embedding-3-large"`;
const huggingFaceParamsOrDefault = huggingFaceParams ?? `model="sentence-transformers/all-mpnet-base-v2"`;
const fakeEmbeddingParamsOrDefault = fakeEmbeddingParams ?? `size=4096`;
const embeddingVarName = customVarName ?? "embeddings";
const tabItems = [
{
value: "OpenAI",
label: "OpenAI",
text: `from langchain_openai import OpenAIEmbeddings\n\n${embeddingVarName} = OpenAIEmbeddings(${openAIParamsOrDefault})`,
apiKeyName: "OPENAI_API_KEY",
packageName: "langchain-openai",
default: true,
shouldHide: hideOpenai,
},
{
value: "HuggingFace",
label: "HuggingFace",
text: `from langchain_huggingface import HuggingFaceEmbeddings\n\n${embeddingVarName} = HuggingFaceEmbeddings(${huggingFaceParamsOrDefault})`,
apiKeyName: undefined,
packageName: "langchain-huggingface",
default: false,
shouldHide: hideHuggingFace,
},
{
value: "Fake Embedding",
label: "Fake Embedding",
text: `from langchain_core.embeddings import FakeEmbeddings\n\n${embeddingVarName} = FakeEmbeddings(${fakeEmbeddingParamsOrDefault})`,
apiKeyName: undefined,
packageName: "langchain-core",
default: false,
shouldHide: hideFakeEmbedding,
},
];
return (
<Tabs groupId="modelTabs">
{tabItems
.filter((tabItem) => !tabItem.shouldHide)
.map((tabItem) => {
const apiKeyText = tabItem.apiKeyName ? `import getpass
os.environ["${tabItem.apiKeyName}"] = getpass.getpass()` : '';
return (
<TabItem
value={tabItem.value}
label={tabItem.label}
default={tabItem.default}
>
<CodeBlock language="bash">{`pip install -qU ${tabItem.packageName}`}</CodeBlock>
<CodeBlock language="python">{apiKeyText + (apiKeyText ? "\n\n" : '') + tabItem.text}</CodeBlock>
</TabItem>
);
})
}
</Tabs>
);
}