databricks: add vector search and embeddings (#25648)

### Summary

Add `DatabricksVectorSearch` and `DatabricksEmbeddings` classes to the
`langchain-databricks` partner packages. Core functionality is
unchanged, but the vector search class is largely refactored for
readability and maintainability.

This PR does not add integration tests yet. This will be added once the
Databricks test workspace is ready.

Tagging @efriis as POC


### Tracker
[] Create a package and imgrate ChatDatabricks
[✍️] Migrate DatabricksVectorSearch, DatabricksEmbeddings, and their
docs
~[ ] Migrate UCFunctionToolkit and its doc~
[ ] Add provider document and update README.md
[ ] Add integration tests and set up secrets (after moved to an external
package)
[ ] Add deprecation note to the community implementations.

---------

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
Yuki Watanabe
2024-08-24 09:40:21 +09:00
committed by GitHub
parent 71c039571a
commit c7a8af2e75
12 changed files with 2321 additions and 215 deletions

View File

@@ -1,22 +1,34 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"cell_type": "raw",
"id": "afaf8039",
"metadata": {},
"source": [
"# Databricks\n",
"---\n",
"sidebar_label: Databricks\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "9a3d6f34",
"metadata": {},
"source": [
"# DatabricksEmbeddings\n",
"\n",
"> [Databricks](https://www.databricks.com/) Lakehouse Platform unifies data, analytics, and AI on one platform.\n",
"\n",
"This notebook provides a quick overview for getting started with Databricks [embedding models](/docs/concepts/#embedding-models). For detailed documentation of all DatabricksEmbeddings features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n",
"This notebook provides a quick overview for getting started with Databricks [embedding models](/docs/concepts/#embedding-models). For detailed documentation of all `DatabricksEmbeddings` features and configurations head to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n",
"\n",
"\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"`DatabricksEmbeddings` class wraps an embedding model endpoint hosted on [Databricks Model Serving](https://docs.databricks.com/en/machine-learning/model-serving/index.html). This example notebook shows how to wrap your serving endpoint and use it as a embedding model in your LangChain application.\n",
"\n",
"| Class | Package |\n",
"| :--- | :--- |\n",
"| [DatabricksEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_databricks.embeddings.DatabricksEmbeddings.html) | [langchain-databricks](https://api.python.langchain.com/en/latest/databricks_api_reference.html) |\n",
"\n",
"### Supported Methods\n",
"\n",
@@ -30,13 +42,9 @@
"1. Foundation Models - Curated list of state-of-the-art foundation models such as BAAI General Embedding (BGE). These endpoint are ready to use in your Databricks workspace without any set up.\n",
"2. Custom Models - You can also deploy custom embedding models to a serving endpoint via MLflow with\n",
"your choice of framework such as LangChain, Pytorch, Transformers, etc.\n",
"3. External Models - Databricks endpoints can serve models that are hosted outside Databricks as a proxy, such as proprietary model service like OpenAI text-embedding-3.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. External Models - Databricks endpoints can serve models that are hosted outside Databricks as a proxy, such as proprietary model service like OpenAI text-embedding-3.\n",
"\n",
"\n",
"## Setup\n",
"\n",
"To access Databricks models you'll need to create a Databricks account, set up credentials (only if you are outside Databricks workspace), and install required packages.\n",
@@ -51,6 +59,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "36521c2a",
"metadata": {},
"outputs": [],
"source": [
@@ -63,33 +72,27 @@
},
{
"cell_type": "markdown",
"id": "d9664366",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"The LangChain Databricks integration lives in the `langchain-community` package. Also, `mlflow >= 2.9 ` is required to run the code in this notebook."
"The LangChain Databricks integration lives in the `langchain-databricks` package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64853226",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-community mlflow>=2.9.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We first demonstrates how to query BGE model hosted as Foundation Models endpoint with `DatabricksEmbeddings`.\n",
"\n",
"For other type of endpoints, there are some difference in how to set up the endpoint itself, however, once the endpoint is ready, there is no difference in how to query it."
"%pip install -qU langchain-databricks"
]
},
{
"cell_type": "markdown",
"id": "45dd1724",
"metadata": {},
"source": [
"## Instantiation"
@@ -98,10 +101,11 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9ea7a09b",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings import DatabricksEmbeddings\n",
"from langchain_databricks import DatabricksEmbeddings\n",
"\n",
"embeddings = DatabricksEmbeddings(\n",
" endpoint=\"databricks-bge-large-en\",\n",
@@ -113,65 +117,131 @@
},
{
"cell_type": "markdown",
"id": "77d271b6",
"metadata": {},
"source": [
"## Embed single text"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.051055908203125, 0.007221221923828125, 0.003879547119140625]\n"
]
}
],
"source": [
"embeddings.embed_query(\"hello\")[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Embed documents"
"## Indexing and Retrieval\n",
"\n",
"Embedding models are often used in retrieval-augmented generation (RAG) flows, both as part of indexing data as well as later retrieving it. For more detailed instructions, please see our RAG tutorials under the [working with external knowledge tutorials](/docs/tutorials/#working-with-external-knowledge).\n",
"\n",
"Below, see how to index and retrieve data using the `embeddings` object we initialized above. In this example, we will index and retrieve a sample document in the `InMemoryVectorStore`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d817716b",
"metadata": {},
"outputs": [],
"source": [
"documents = [\"This is a dummy document.\", \"This is another dummy document.\"]\n",
"response = embeddings.embed_documents(documents)\n",
"print([e[:3] for e in response]) # Show first 3 elements of each embedding"
"# Create a vector store with a sample text\n",
"from langchain_core.vectorstores import InMemoryVectorStore\n",
"\n",
"text = \"LangChain is the framework for building context-aware reasoning applications\"\n",
"\n",
"vectorstore = InMemoryVectorStore.from_texts(\n",
" [text],\n",
" embedding=embeddings,\n",
")\n",
"\n",
"# Use the vectorstore as a retriever\n",
"retriever = vectorstore.as_retriever()\n",
"\n",
"# Retrieve the most similar text\n",
"retrieved_document = retriever.invoke(\"What is LangChain?\")\n",
"\n",
"# show the retrieved document's content\n",
"retrieved_document[0].page_content"
]
},
{
"cell_type": "markdown",
"id": "e02b9855",
"metadata": {},
"source": [
"## Wrapping Other Types of Endpoints\n",
"## Direct Usage\n",
"\n",
"The example above uses an embedding model hosted as a Foundation Models API. To learn about how to use the other endpoint types, please refer to the documentation for `ChatDatabricks`. While the model type is different, required steps are the same.\n",
"Under the hood, the vectorstore and retriever implementations are calling `embeddings.embed_documents(...)` and `embeddings.embed_query(...)` to create embeddings for the text(s) used in `from_texts` and retrieval `invoke` operations, respectively.\n",
"\n",
"* [Custom Model Endpoint](https://python.langchain.com/v0.2/docs/integrations/chat/databricks/#wrapping-custom-model-endpoint)\n",
"* [External Models](https://python.langchain.com/v0.2/docs/integrations/chat/databricks/#wrapping-external-models)"
"You can directly call these methods to get embeddings for your own use cases.\n",
"\n",
"### Embed single texts\n",
"\n",
"You can embed single texts or documents with `embed_query`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d2befcd",
"metadata": {},
"outputs": [],
"source": [
"single_vector = embeddings.embed_query(text)\n",
"print(str(single_vector)[:100]) # Show the first 100 characters of the vector"
]
},
{
"cell_type": "markdown",
"id": "1b5a7d03",
"metadata": {},
"source": [
"## API reference\n",
"### Embed multiple texts\n",
"\n",
"For detailed documentation of all ChatDatabricks features and configurations head to the API reference: https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html"
"You can embed multiple texts with `embed_documents`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f4d6e97",
"metadata": {},
"outputs": [],
"source": [
"text2 = (\n",
" \"LangGraph is a library for building stateful, multi-actor applications with LLMs\"\n",
")\n",
"two_vectors = embeddings.embed_documents([text, text2])\n",
"for vector in two_vectors:\n",
" print(str(vector)[:100]) # Show the first 100 characters of the vector"
]
},
{
"cell_type": "markdown",
"id": "98785c12",
"metadata": {},
"source": [
"### Async Usage\n",
"\n",
"You can also use `aembed_query` and `aembed_documents` for producing embeddings asynchronously:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c3bef91",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"\n",
"async def async_example():\n",
" single_vector = await embeddings.aembed_query(text)\n",
" print(str(single_vector)[:100]) # Show the first 100 characters of the vector\n",
"\n",
"\n",
"asyncio.run(async_example())"
]
},
{
"cell_type": "markdown",
"id": "0d053b64",
"metadata": {},
"source": [
"## API Reference\n",
"\n",
"For detailed documentation on `DatabricksEmbeddings` features and configuration options, please refer to the [API reference](https://python.langchain.com/v0.2/api_reference/community/embeddings/langchain_community.embeddings.databricks.DatabricksEmbeddings.html).\n"
]
}
],
@@ -191,9 +261,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
"nbformat_minor": 5
}

View File

@@ -1,139 +1,185 @@
{
"cells": [
{
"cell_type": "markdown",
"cell_type": "raw",
"id": "1957f5cb",
"metadata": {},
"source": [
"# Databricks Vector Search\n",
"---\n",
"sidebar_label: Databricks\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "ef1f0986",
"metadata": {},
"source": [
"# DatabricksVectorSearch\n",
"\n",
"Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n",
"[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n",
"\n",
"This notebook shows how to use LangChain with Databricks Vector Search."
]
},
{
"cell_type": "markdown",
"id": "36fdc060",
"metadata": {},
"source": [
"Install `databricks-vectorsearch` and related Python packages used in this notebook."
"## Setup\n",
"\n",
"To access Databricks models you'll need to create a Databricks account, set up credentials (only if you are outside Databricks workspace), and install required packages.\n",
"\n",
"### Credentials (only if you are outside Databricks)\n",
"\n",
"If you are running LangChain app inside Databricks, you can skip this step.\n",
"\n",
"Otherwise, you need manually set the Databricks workspace hostname and personal access token to `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables, respectively. See [Authentication Documentation](https://docs.databricks.com/en/dev-tools/auth/index.html#databricks-personal-access-tokens) for how to get an access token."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain-core databricks-vectorsearch langchain-openai tiktoken"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `OpenAIEmbeddings` for the embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "5fb2788f",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
"os.environ[\"DATABRICKS_HOST\"] = \"https://your-databricks-workspace\"\n",
"os.environ[\"DATABRICKS_TOKEN\"] = getpass.getpass(\"Enter your Databricks access token: \")"
]
},
{
"cell_type": "markdown",
"id": "93df377e",
"metadata": {},
"source": [
"Split documents and get embeddings."
"### Installation\n",
"\n",
"The LangChain Databricks integration lives in the `langchain-databricks` package."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"id": "b03d22f1",
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../how_to/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"emb_dim = len(embeddings.embed_query(\"hello\"))"
"%pip install -qU langchain-databricks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"id": "08c6ef75",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## Setup Databricks Vector Search client"
"### Create a Vector Search Endpoint and Index (if you haven't already)\n",
"\n",
"In this section, we will create a Databricks Vector Search endpoint and an index using the client SDK.\n",
"\n",
"If you already have an endpoint and an index, you can skip the section and go straight to \"Instantiation\" section."
]
},
{
"cell_type": "markdown",
"id": "db62918b",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"First, instantiate the Databricks VectorSearch client:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0f2957b",
"metadata": {},
"outputs": [],
"source": [
"from databricks.vector_search.client import VectorSearchClient\n",
"\n",
"vsc = VectorSearchClient()"
"client = VectorSearchClient()"
]
},
{
"cell_type": "markdown",
"id": "31311046",
"metadata": {},
"source": [
"## Create a Vector Search Endpoint\n",
"This endpoint is used to create and access vector search indexes."
"Next, we will create a new VectorSearch endpoint."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 14,
"id": "be8f7d3a",
"metadata": {},
"outputs": [],
"source": [
"vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Direct Vector Access Index\n",
"Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n",
"index_name = \"vector_search_demo.vector_search.state_of_the_union_index\"\n",
"endpoint_name = \"<your-endpoint-name>\"\n",
"\n",
"index = vsc.create_direct_access_index(\n",
" endpoint_name=vector_search_endpoint_name,\n",
"client.create_endpoint(name=endpoint_name, endpoint_type=\"STANDARD\")"
]
},
{
"cell_type": "markdown",
"id": "63498435",
"metadata": {},
"source": [
"Lastly, we will create an index that cna be queried on the endpoint. There are two types of indexes in Databricks Vector Search and the `DatabricksVectorSearch` class support both use cases.\n",
"\n",
"* **Delta Sync Index** automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.\n",
"\n",
"* **Direct Vector Access Index** supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK.\n",
"\n",
"Also for delta-sync index, you can choose to use Databricks-managed embeddings or self-managed embeddings (via LangChain embeddings classes)."
]
},
{
"cell_type": "markdown",
"id": "863d7218",
"metadata": {},
"source": [
"The following code creates a **direct-access** index. Please refer to the [Databricks documentation](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html) for the instruction to create the other type of indexes."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "474aea5c",
"metadata": {},
"outputs": [],
"source": [
"index_name = \"<your-index-name>\" # Format: \"<catalog>.<schema>.<index-name>\"\n",
"\n",
"index = client.create_direct_access_index(\n",
" endpoint_name=endpoint_name,\n",
" index_name=index_name,\n",
" primary_key=\"id\",\n",
" embedding_dimension=emb_dim,\n",
" # Dimension of the embeddings. Please change according to the embedding model you are using.\n",
" embedding_dimension=3072,\n",
" # A column to store the embedding vectors for the text data\n",
" embedding_vector_column=\"text_vector\",\n",
" schema={\n",
" \"id\": \"string\",\n",
" \"text\": \"string\",\n",
" \"text_vector\": \"array<float>\",\n",
" # Optional metadata columns\n",
" \"source\": \"string\",\n",
" },\n",
")\n",
@@ -141,90 +187,333 @@
"index.describe()"
]
},
{
"cell_type": "markdown",
"id": "979bea9b",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## Instantiation\n",
"\n",
"The instantiation of `DatabricksVectorSearch` is a bit different depending on whether your index uses Databricks-managed embeddings or self-managed embeddings i.e. LangChain Embeddings object of your choice."
]
},
{
"cell_type": "markdown",
"id": "d34c1b01",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"If you are using a delta-sync index with Databricks-managed embeddings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"id": "dc37144c-208d-4ab3-9f3a-0407a69fe052",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_community.vectorstores import DatabricksVectorSearch\n",
"from langchain_databricks.vectorstores import DatabricksVectorSearch\n",
"\n",
"dvs = DatabricksVectorSearch(\n",
" index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n",
"vector_store = DatabricksVectorSearch(\n",
" endpoint=endpoint_name,\n",
" index_name=index_name,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f48e4e85",
"metadata": {},
"source": [
"## Add docs to the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dvs.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Similarity search\n",
"Optional keyword arguments to similarity_search include specifying k number of documents to retrive, \n",
"a filters dictionary for metadata filtering based on [this syntax](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html#use-filters-on-queries),\n",
"as well as the [query_type](https://api-docs.databricks.com/python/vector-search/databricks.vector_search.html#databricks.vector_search.index.VectorSearchIndex.similarity_search) which can be ANN or HYBRID "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"dvs.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Work with Delta Sync Index\n",
"If you are using a direct-access index or a delta-sync index with self-managed embeddings,\n",
"you also need to provide the embedding model and text column in your source table to\n",
"use for the embeddings:\n",
"\n",
"You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details."
"```{=mdx}\n",
"import EmbeddingTabs from \"@theme/EmbeddingTabs\";\n",
"\n",
"<EmbeddingTabs/>\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec6288a7",
"metadata": {},
"outputs": [],
"source": [
"delta_sync_index = vsc.create_delta_sync_index(\n",
" endpoint_name=vector_search_endpoint_name,\n",
" source_table_name=\"vector_search_demo.vector_search.state_of_the_union\",\n",
" index_name=\"vector_search_demo.vector_search.state_of_the_union_index\",\n",
" pipeline_type=\"TRIGGERED\",\n",
" primary_key=\"id\",\n",
" embedding_source_column=\"text\",\n",
" embedding_model_endpoint_name=\"e5-small-v2\",\n",
"# | output: false\n",
"# | echo: false\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b1bdbdf",
"metadata": {},
"outputs": [],
"source": [
"vector_store = DatabricksVectorSearch(\n",
" endpoint=endpoint_name,\n",
" index_name=index_name,\n",
" embedding=embeddings,\n",
" # The column name in the index that contains the text data to be embedded\n",
" text_column=\"document_content\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ac6071d4",
"metadata": {},
"source": [
"## Manage vector store\n",
"\n",
"### Add items to vector store\n",
"\n",
"Note: Adding items to vector store via `add_documents` method is only supported for a **direct-access** index."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "17f5efc0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['1', '2', '3']"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.documents import Document\n",
"\n",
"document_1 = Document(page_content=\"foo\", metadata={\"source\": \"https://example.com\"})\n",
"\n",
"document_2 = Document(page_content=\"bar\", metadata={\"source\": \"https://example.com\"})\n",
"\n",
"document_3 = Document(page_content=\"baz\", metadata={\"source\": \"https://example.com\"})\n",
"\n",
"documents = [document_1, document_2, document_3]\n",
"\n",
"vector_store.add_documents(documents=documents, ids=[\"1\", \"2\", \"3\"])"
]
},
{
"cell_type": "markdown",
"id": "dcf1b905",
"metadata": {},
"source": [
"### Delete items from vector store\n",
"\n",
"Note: Deleting items to vector store via `delete` method is only supported for a **direct-access** index."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "ef61e188",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector_store.delete(ids=[\"3\"])"
]
},
{
"cell_type": "markdown",
"id": "c3620501",
"metadata": {},
"source": [
"## Query vector store\n",
"\n",
"Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent. \n",
"\n",
"### Query directly\n",
"\n",
"Performing a simple similarity search can be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "aa0a16fa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* foo [{'id': '1'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search(\n",
" query=\"thud\", k=1, filter={\"source\": \"https://example.com\"}\n",
")\n",
"dvs_delta_sync = DatabricksVectorSearch(delta_sync_index)\n",
"dvs_delta_sync.similarity_search(query)"
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "562056dd",
"metadata": {},
"source": [
"Note: By default, similarity search only returns the primary key and text column. If you want to retrieve the custom metadata associated with the document, pass the additional columns in the `columns` parameter when initializing the vector store."
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "a1c746a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* foo [{'source': 'https://example.com', 'id': '1'}]\n"
]
}
],
"source": [
"vector_store = DatabricksVectorSearch(\n",
" endpoint=endpoint_name,\n",
" index_name=index_name,\n",
" embedding=embeddings,\n",
" text_column=\"text\",\n",
" columns=[\"source\"],\n",
")\n",
"\n",
"results = vector_store.similarity_search(query=\"thud\", k=1)\n",
"for doc in results:\n",
" print(f\"* {doc.page_content} [{doc.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "3ed9d733",
"metadata": {},
"source": [
"If you want to execute a similarity search and receive the corresponding scores you can run:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "5efd2eaa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* [SIM=0.414035] foo [{'source': 'https://example.com', 'id': '1'}]\n"
]
}
],
"source": [
"results = vector_store.similarity_search_with_score(\n",
" query=\"thud\", k=1, filter={\"source\": \"https://example.com\"}\n",
")\n",
"for doc, score in results:\n",
" print(f\"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]\")"
]
},
{
"cell_type": "markdown",
"id": "0c235cdc",
"metadata": {},
"source": [
"### Query by turning into retriever\n",
"\n",
"You can also transform the vector store into a retriever for easier usage in your chains. "
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "f3460093",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'source': 'https://example.com', 'id': '1'}, page_content='foo')]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = vector_store.as_retriever(search_type=\"mmr\", search_kwargs={\"k\": 1})\n",
"retriever.invoke(\"thud\")"
]
},
{
"cell_type": "markdown",
"id": "901c75dc",
"metadata": {},
"source": [
"## Usage for retrieval-augmented generation\n",
"\n",
"For guides on how to use this vector store for retrieval-augmented generation (RAG), see the following sections:\n",
"\n",
"- [Tutorials: working with external knowledge](https://python.langchain.com/v0.2/docs/tutorials/#working-with-external-knowledge)\n",
"- [How-to: Question and answer with RAG](https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag)\n",
"- [Retrieval conceptual docs](https://python.langchain.com/v0.2/docs/concepts/#retrieval)"
]
},
{
"cell_type": "markdown",
"id": "8a27244f",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all DatabricksVectorSearch features and configurations head to the API reference: https://api.python.langchain.com/en/latest/vectorstores/langchain_databricks.vectorstores.DatabricksVectorSearch.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "langchain-dev",
"language": "python",
"name": "python3"
"name": "langchain-dev"
},
"language_info": {
"codemirror_mode": {
@@ -236,9 +525,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
"nbformat_minor": 5
}